Skip to main navigation Skip to search Skip to main content

Data augmentation for sentiment classification with semantic preservation and diversity

  • School of Computer Science and Technology, Harbin Institute of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

Data augmentation is a commonly-used technique to avoid over-fitting in deep learning. However, the mechanism behind effective data augmentation methods is unclear. To address this issue, we explore and identify two critical factors: semantic preservation and diversity to assess the quality of data augmentation in natural language processing. Our study focus on text sentiment classification and examines these two factors on two commonly-used data augmentation methods: synonym replacement and random deletion. Based on the discovery, we propose two new augmentation methods: TF-IDF word dropout and adaptive synonym replacement. Experimental results demonstrate that these two new data augmentation methods are effective. Moreover, with further experiments, we summarize three strategies for improving data augmentation methods in sentiment classification task. These strategies are employing online augmentation, introducing word importance into word sampling process, and filtering augmented data based on the current model state. We hope that our study will inspire some new perspectives on the underlying principles of data augmentation's effectiveness and contribute to a systematic study of data augmentation methods in future.

Original languageEnglish
Article number111038
JournalKnowledge-Based Systems
Volume280
DOIs
StatePublished - 25 Nov 2023
Externally publishedYes

Keywords

  • Data augmentation
  • Deep learning
  • Natural language processing
  • Sentiment classification

Fingerprint

Dive into the research topics of 'Data augmentation for sentiment classification with semantic preservation and diversity'. Together they form a unique fingerprint.

Cite this