TY - GEN
T1 - Video-based cross-modal recipe retrieval
AU - Cao, Da
AU - Fang, Jiansheng
AU - Yu, Zhiwang
AU - Nie, Liqiang
AU - Zhang, Hanling
AU - Tian, Qi
N1 - Publisher Copyright:
© 2019 Copyright held by the owner/author(s).
PY - 2019/10/15
Y1 - 2019/10/15
N2 - As a natural extension of image-based cross-modal recipe retrieval, retrieving a specific video given a recipe as the query is seldom explored. There are various temporal and spatial elements hidden in cooking videos. In addition, current image-based cross-modal recipe retrieval approaches mostly emphasize the understanding of textual and visual content independently. Such methods overlook the interaction between textual and visual content. In this work, we innovatively propose a new problem of video-based cross-modal recipe retrieval and thoroughly investigate this issue under the attention paradigm. In particular, we firstly exploit a parallel-attention network to independently learn the representations of videos and recipes. Next, a co-attention network is proposed to explicitly emphasize the cross-modal interactive features between videos and recipes. Meanwhile, a cross-modal fusion sub-network is proposed to learn both the independent and collaborative dynamics, which can enhance the associated representation of videos and recipes. Last but not the least, the embedding vectors of videos and recipes stemming from joint network are optimized with a pairwise ranking loss. Extensive experiments on a self-collected dataset have verified the effectiveness and rationality of our proposed solution.
AB - As a natural extension of image-based cross-modal recipe retrieval, retrieving a specific video given a recipe as the query is seldom explored. There are various temporal and spatial elements hidden in cooking videos. In addition, current image-based cross-modal recipe retrieval approaches mostly emphasize the understanding of textual and visual content independently. Such methods overlook the interaction between textual and visual content. In this work, we innovatively propose a new problem of video-based cross-modal recipe retrieval and thoroughly investigate this issue under the attention paradigm. In particular, we firstly exploit a parallel-attention network to independently learn the representations of videos and recipes. Next, a co-attention network is proposed to explicitly emphasize the cross-modal interactive features between videos and recipes. Meanwhile, a cross-modal fusion sub-network is proposed to learn both the independent and collaborative dynamics, which can enhance the associated representation of videos and recipes. Last but not the least, the embedding vectors of videos and recipes stemming from joint network are optimized with a pairwise ranking loss. Extensive experiments on a self-collected dataset have verified the effectiveness and rationality of our proposed solution.
KW - Co-Attention Network
KW - Cross-Modal Retrieval
KW - Parallel-Attention Network
KW - Recipe Retrieval
KW - Video Retrieval
UR - https://www.scopus.com/pages/publications/85074853523
U2 - 10.1145/3343031.3351067
DO - 10.1145/3343031.3351067
M3 - 会议稿件
AN - SCOPUS:85074853523
T3 - MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia
SP - 1685
EP - 1693
BT - MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
T2 - 27th ACM International Conference on Multimedia, MM 2019
Y2 - 21 October 2019 through 25 October 2019
ER -