Skip to main navigation Skip to search Skip to main content

Video dialog via multi-grained convolutional self-attention context networks

  • Weike Jin
  • , Jun Yu
  • , Zhou Zhao*
  • , Jun Xiao
  • , Mao Gu
  • , Yueting Zhuang
  • *Corresponding author for this work
  • Zhejiang University
  • Hangzhou Dianzi University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Video dialog is a new and challenging task, which requires an AI agent to maintain a meaningful dialog with humans in natural language about video contents. Specifically, given a video, a dialog history and a new question about the video, the agent has to combine video information with dialog history to infer the answer. And due to the complexity of video information, the methods of image dialog might be ineffectively applied directly to video dialog. In this paper, we propose a novel approach for video dialog called multi-grained convolutional self-attention context network, which combines video information with dialog history. Instead of using RNN to encode the sequence information, we design a multi-grained convolutional self-attention mechanism to capture both element and segment level interactions which contain multi-grained sequence information. Then, we design a hierarchical dialog history encoder to learn the context-aware question representation and a two-stream video encoder to learn the context-aware video representation. We evaluate our method on two large-scale datasets. Due to the flexibility and parallelism of the new attention mechanism, our method can achieve higher time efficiency, and the extensive experiments also show the effectiveness of our method.

Original languageEnglish
Title of host publicationSIGIR 2019 - Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval
PublisherAssociation for Computing Machinery, Inc
Pages465-474
Number of pages10
ISBN (Electronic)9781450361729
DOIs
StatePublished - 18 Jul 2019
Externally publishedYes
Event42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019 - Paris, France
Duration: 21 Jul 201925 Jul 2019

Publication series

NameSIGIR 2019 - Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

Conference

Conference42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019
Country/TerritoryFrance
CityParis
Period21/07/1925/07/19

Keywords

  • Convolution
  • Multi-grained self-attention
  • Video dialog

Fingerprint

Dive into the research topics of 'Video dialog via multi-grained convolutional self-attention context networks'. Together they form a unique fingerprint.

Cite this