TY - GEN
T1 - PromptST
T2 - 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023
AU - Yu, Tengfei
AU - Ding, Liang
AU - Liu, Xuebo
AU - Chen, Kehai
AU - Zhang, Meishan
AU - Tao, Dacheng
AU - Zhang, Min
N1 - Publisher Copyright:
© 2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - An end-to-end speech-to-text (S2T) translation model is usually initialized from a pretrained speech recognition encoder and a pretrained text-to-text (T2T) translation decoder. Although this straightforward setting has been shown empirically successful, there do not exist clear answers to the research questions: 1) how are speech and text modalities fused in S2T model and 2) how to better fuse the two modalities? In this paper, we take the first step toward understanding the fusion of speech and text features in S2T model. We first design and release a 10GB linguistic probing benchmark, namely Speech-Senteval, to investigate the acoustic and linguistic behaviors of S2T models. Preliminary analysis reveals that the uppermost encoder layers of the S2T model can not learn linguistic knowledge efficiently, which is crucial for accurate translation. Based on the finding, we further propose a simple plug-in prompt-learning strategy on the uppermost encoder layers to broaden the abstract representation power of the encoder of S2T models. We call such a prompt-enhanced S2T model PromptST. Experimental results on four widely-used S2T datasets show that PromptST can deliver significant improvements over a strong baseline by capturing richer linguistic knowledge. Benchmarks, code, and scripts are freely available at https://github.com/ytf-philp/PromptST.
AB - An end-to-end speech-to-text (S2T) translation model is usually initialized from a pretrained speech recognition encoder and a pretrained text-to-text (T2T) translation decoder. Although this straightforward setting has been shown empirically successful, there do not exist clear answers to the research questions: 1) how are speech and text modalities fused in S2T model and 2) how to better fuse the two modalities? In this paper, we take the first step toward understanding the fusion of speech and text features in S2T model. We first design and release a 10GB linguistic probing benchmark, namely Speech-Senteval, to investigate the acoustic and linguistic behaviors of S2T models. Preliminary analysis reveals that the uppermost encoder layers of the S2T model can not learn linguistic knowledge efficiently, which is crucial for accurate translation. Based on the finding, we further propose a simple plug-in prompt-learning strategy on the uppermost encoder layers to broaden the abstract representation power of the encoder of S2T models. We call such a prompt-enhanced S2T model PromptST. Experimental results on four widely-used S2T datasets show that PromptST can deliver significant improvements over a strong baseline by capturing richer linguistic knowledge. Benchmarks, code, and scripts are freely available at https://github.com/ytf-philp/PromptST.
UR - https://www.scopus.com/pages/publications/85184828045
U2 - 10.18653/v1/2023.emnlp-main.627
DO - 10.18653/v1/2023.emnlp-main.627
M3 - 会议稿件
AN - SCOPUS:85184828045
T3 - EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings
SP - 10140
EP - 10154
BT - EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings
A2 - Bouamor, Houda
A2 - Pino, Juan
A2 - Bali, Kalika
PB - Association for Computational Linguistics (ACL)
Y2 - 6 December 2023 through 10 December 2023
ER -