TY - GEN
T1 - Video dialog via multi-grained convolutional self-attention context networks
AU - Jin, Weike
AU - Yu, Jun
AU - Zhao, Zhou
AU - Xiao, Jun
AU - Gu, Mao
AU - Zhuang, Yueting
N1 - Publisher Copyright:
© 2019 Association for Computing Machinery.
PY - 2019/7/18
Y1 - 2019/7/18
N2 - Video dialog is a new and challenging task, which requires an AI agent to maintain a meaningful dialog with humans in natural language about video contents. Specifically, given a video, a dialog history and a new question about the video, the agent has to combine video information with dialog history to infer the answer. And due to the complexity of video information, the methods of image dialog might be ineffectively applied directly to video dialog. In this paper, we propose a novel approach for video dialog called multi-grained convolutional self-attention context network, which combines video information with dialog history. Instead of using RNN to encode the sequence information, we design a multi-grained convolutional self-attention mechanism to capture both element and segment level interactions which contain multi-grained sequence information. Then, we design a hierarchical dialog history encoder to learn the context-aware question representation and a two-stream video encoder to learn the context-aware video representation. We evaluate our method on two large-scale datasets. Due to the flexibility and parallelism of the new attention mechanism, our method can achieve higher time efficiency, and the extensive experiments also show the effectiveness of our method.
AB - Video dialog is a new and challenging task, which requires an AI agent to maintain a meaningful dialog with humans in natural language about video contents. Specifically, given a video, a dialog history and a new question about the video, the agent has to combine video information with dialog history to infer the answer. And due to the complexity of video information, the methods of image dialog might be ineffectively applied directly to video dialog. In this paper, we propose a novel approach for video dialog called multi-grained convolutional self-attention context network, which combines video information with dialog history. Instead of using RNN to encode the sequence information, we design a multi-grained convolutional self-attention mechanism to capture both element and segment level interactions which contain multi-grained sequence information. Then, we design a hierarchical dialog history encoder to learn the context-aware question representation and a two-stream video encoder to learn the context-aware video representation. We evaluate our method on two large-scale datasets. Due to the flexibility and parallelism of the new attention mechanism, our method can achieve higher time efficiency, and the extensive experiments also show the effectiveness of our method.
KW - Convolution
KW - Multi-grained self-attention
KW - Video dialog
UR - https://www.scopus.com/pages/publications/85073786898
U2 - 10.1145/3331184.3331240
DO - 10.1145/3331184.3331240
M3 - 会议稿件
AN - SCOPUS:85073786898
T3 - SIGIR 2019 - Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval
SP - 465
EP - 474
BT - SIGIR 2019 - Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval
PB - Association for Computing Machinery, Inc
T2 - 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019
Y2 - 21 July 2019 through 25 July 2019
ER -