TY - GEN
T1 - Much
T2 - 27th ACM International Conference on Multimedia, MM 2019
AU - Song, Xinhang
AU - Wang, Bohan
AU - Chen, Gongwei
AU - Jiang, Shuqiang
N1 - Publisher Copyright:
© 2019 Association for Computing Machinery.
PY - 2019/10/15
Y1 - 2019/10/15
N2 - Due to the abstraction of scenes, comprehensive scene understanding requires semantic modeling in both global and local aspects. Scene recognition is usually researched from a global point of view, while dense captioning is typically studied for local regions. Previous works separately research on the modeling of scene recognition and dense captioning. In contrast, we propose a joint learning framework that benefits from the mutual coupling of scene recognition and dense captioning models. Generally, these two tasks are coupled through two steps, 1) fusing the supervision by considering the contexts between scene labels and local captions, and 2) jointly optimizing semantically symmetric LSTM models. Particularly, in order to balance bias between dense captioning and scene recognition, a scene adaptive non-maximum suppression (NMS) method is proposed to emphasize the scene related regions in region proposal procedure, and a region-wise and category-wise weighted pooling method is proposed to avoid over attention on particular regions in local to global pooling procedure. For the model training and evaluation, scene labels are manually annotated for Visual Genome database. The experimental results on Visual Genome show the effectiveness of the proposed method. Moreover, the proposed method also can improve previous CNN based works on public scene databases, such as MIT67 and SUN397.
AB - Due to the abstraction of scenes, comprehensive scene understanding requires semantic modeling in both global and local aspects. Scene recognition is usually researched from a global point of view, while dense captioning is typically studied for local regions. Previous works separately research on the modeling of scene recognition and dense captioning. In contrast, we propose a joint learning framework that benefits from the mutual coupling of scene recognition and dense captioning models. Generally, these two tasks are coupled through two steps, 1) fusing the supervision by considering the contexts between scene labels and local captions, and 2) jointly optimizing semantically symmetric LSTM models. Particularly, in order to balance bias between dense captioning and scene recognition, a scene adaptive non-maximum suppression (NMS) method is proposed to emphasize the scene related regions in region proposal procedure, and a region-wise and category-wise weighted pooling method is proposed to avoid over attention on particular regions in local to global pooling procedure. For the model training and evaluation, scene labels are manually annotated for Visual Genome database. The experimental results on Visual Genome show the effectiveness of the proposed method. Moreover, the proposed method also can improve previous CNN based works on public scene databases, such as MIT67 and SUN397.
KW - Dense captioning
KW - Joint model
KW - Pooling
KW - Scene recognition
KW - Visual Genome
UR - https://www.scopus.com/pages/publications/85074863683
U2 - 10.1145/3343031.3350913
DO - 10.1145/3343031.3350913
M3 - 会议稿件
AN - SCOPUS:85074863683
T3 - MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia
SP - 793
EP - 801
BT - MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
Y2 - 21 October 2019 through 25 October 2019
ER -