TY - GEN
T1 - A Multimodal In-Context Tuning Approach for E-Commerce Product Description Generation
AU - Li, Yunxin
AU - Hu, Baotian
AU - Luo, Wenhan
AU - Ma, Lin
AU - Ding, Yuxin
AU - Zhang, Min
N1 - Publisher Copyright:
© 2024 ELRA Language Resource Association: CC BY-NC 4.0.
PY - 2024
Y1 - 2024
N2 - In this paper, we propose a new setting for generating product descriptions from images, augmented by marketing keywords. It leverages the combined power of visual and textual information to create descriptions that are more tailored to the unique features of products. For this setting, previous methods utilize visual and textual encoders to encode the image and keywords and employ a language model-based decoder to generate the product description. However, the generated description is often inaccurate and generic since same-category products have similar copy-writings, and optimizing the overall framework on large-scale samples makes models concentrate on common words yet ignore the product features. To alleviate the issue, we present a simple and effective Multimodal In-Context Tuning approach, named ModICT, which introduces the similar product sample as the reference and utilizes the in-context learning capability of language models to produce the description. During training, we keep the visual encoder and language model frozen, focusing on optimizing the modules responsible for creating multimodal in-context references and dynamic prompts. This approach preserves the language generation prowess of large language models (LLMs), facilitating a substantial increase in description diversity. To assess the effectiveness of ModICT across various language model scales and types, we collect data from three distinct product categories within the E-commerce domain. Extensive experiments demonstrate that ModICT significantly improves the accuracy (by up to 3.3% on Rouge-L) and diversity (by up to 9.4% on D-5) of generated results compared to conventional methods. Our findings underscore the potential of ModICT as a valuable tool for enhancing the automatic generation of product descriptions in a wide range of applications. Data and code are at https://github.com/HITsz-TMG/Multimodal-In-Context-Tuning.
AB - In this paper, we propose a new setting for generating product descriptions from images, augmented by marketing keywords. It leverages the combined power of visual and textual information to create descriptions that are more tailored to the unique features of products. For this setting, previous methods utilize visual and textual encoders to encode the image and keywords and employ a language model-based decoder to generate the product description. However, the generated description is often inaccurate and generic since same-category products have similar copy-writings, and optimizing the overall framework on large-scale samples makes models concentrate on common words yet ignore the product features. To alleviate the issue, we present a simple and effective Multimodal In-Context Tuning approach, named ModICT, which introduces the similar product sample as the reference and utilizes the in-context learning capability of language models to produce the description. During training, we keep the visual encoder and language model frozen, focusing on optimizing the modules responsible for creating multimodal in-context references and dynamic prompts. This approach preserves the language generation prowess of large language models (LLMs), facilitating a substantial increase in description diversity. To assess the effectiveness of ModICT across various language model scales and types, we collect data from three distinct product categories within the E-commerce domain. Extensive experiments demonstrate that ModICT significantly improves the accuracy (by up to 3.3% on Rouge-L) and diversity (by up to 9.4% on D-5) of generated results compared to conventional methods. Our findings underscore the potential of ModICT as a valuable tool for enhancing the automatic generation of product descriptions in a wide range of applications. Data and code are at https://github.com/HITsz-TMG/Multimodal-In-Context-Tuning.
KW - Multimodal Generation
KW - Multimodal In-Context Tuning
KW - Product Description Generation
UR - https://www.scopus.com/pages/publications/85195993252
M3 - 会议稿件
AN - SCOPUS:85195993252
T3 - 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings
SP - 850
EP - 861
BT - 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings
A2 - Calzolari, Nicoletta
A2 - Kan, Min-Yen
A2 - Hoste, Veronique
A2 - Lenci, Alessandro
A2 - Sakti, Sakriani
A2 - Xue, Nianwen
PB - European Language Resources Association (ELRA)
T2 - Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024
Y2 - 20 May 2024 through 25 May 2024
ER -