TY - GEN
T1 - Short Text Clustering Enhanced by Semantic Matching Model
AU - Peng, Zijun
AU - Xin, Guodong
AU - Wei, Yuliang
AU - Wang, Wei
AU - Wang, Bailing
AU - Wang, Lianhai
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/9
Y1 - 2019/9
N2 - With the popularity of social networks, short text clustering has become a more and more important task that is widely used. Short text clustering is a challenging problem because social network short texts are characterized by irregular words, a lot of noise, and sparse features. We propose a Short Text Clustering enhanced by Semantic Matching Model (abbr. to STCSMM). The STCSMM method applies the knowledge of the tagged text similarity task dataset to the short text clustering through the semantic matching model, thereby improving the effect of short text clustering. First, we train a semantic matching network on the data set of the text similarity task, where the network contains the feature extraction layer and the vector distance calculation layer. Then, we use the learned feature extraction layer to extract short text feature and use the vector distance calculation layer replaces the commonly used distance metrics in the traditional K-means algorithm, such as cosine distance, Euclidean distance and so on. Finally, the text features obtained by feature extraction layer are applied to K-means based on vector distance calculation layer. This improved K-means clustering (STCSMM) has better performance on the microblog text clustering dataset than some existing methods such as K-means clustering with LDA, LSI or average word embedding feature vectors.
AB - With the popularity of social networks, short text clustering has become a more and more important task that is widely used. Short text clustering is a challenging problem because social network short texts are characterized by irregular words, a lot of noise, and sparse features. We propose a Short Text Clustering enhanced by Semantic Matching Model (abbr. to STCSMM). The STCSMM method applies the knowledge of the tagged text similarity task dataset to the short text clustering through the semantic matching model, thereby improving the effect of short text clustering. First, we train a semantic matching network on the data set of the text similarity task, where the network contains the feature extraction layer and the vector distance calculation layer. Then, we use the learned feature extraction layer to extract short text feature and use the vector distance calculation layer replaces the commonly used distance metrics in the traditional K-means algorithm, such as cosine distance, Euclidean distance and so on. Finally, the text features obtained by feature extraction layer are applied to K-means based on vector distance calculation layer. This improved K-means clustering (STCSMM) has better performance on the microblog text clustering dataset than some existing methods such as K-means clustering with LDA, LSI or average word embedding feature vectors.
KW - K- means
KW - STCSMM
KW - Semantic matching model
KW - Short text clustering
UR - https://www.scopus.com/pages/publications/85084916924
U2 - 10.1109/ICISCAE48440.2019.221680
DO - 10.1109/ICISCAE48440.2019.221680
M3 - 会议稿件
AN - SCOPUS:85084916924
T3 - 2019 2nd International Conference on Information Systems and Computer Aided Education, ICISCAE 2019
SP - 480
EP - 484
BT - 2019 2nd International Conference on Information Systems and Computer Aided Education, ICISCAE 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2nd IEEE International Conference on Information Systems and Computer Aided Education, ICISCAE 2019
Y2 - 28 September 2019 through 30 September 2019
ER -