Skip to main navigation Skip to search Skip to main content

Bayesian constituent context model for grammar induction

  • Min Zhang
  • , Xiangyu Duan*
  • , Wenliang Chen
  • *Corresponding author for this work
  • Soochow University

Research output: Contribution to journalArticlepeer-review

Abstract

Constituent Context Model (CCM) is an effective generative model for grammar induction, the aim of which is to induce hierarchical syntactic structure from natural text. The CCM simply defines the Multinomial distribution over constituents, which leads to a severe data sparse problem because long constituents are unlikely to appear in unseen data sets. This paper proposes a Bayesian method for constituent smoothing by defining two kinds of prior distributions over constituents: the Dirichlet prior and the Pitman-Yor Process prior. The Dirichlet prior functions as an additive smoothing method, and the PYP prior functions as a back-off smoothing method. Furthermore, a modified CCM is proposed to differentiate left constituents and right constituents in binary branching trees. Experiments show that both the proposed Bayesian smoothing method and the modified CCM are effective, and combining them attains or significantly improves the state-of-the-art performance of grammar induction evaluated on standard treebanks of various languages.

Original languageEnglish
Pages (from-to)531-541
Number of pages11
JournalIEEE Transactions on Audio, Speech and Language Processing
Volume22
Issue number2
DOIs
StatePublished - Feb 2014
Externally publishedYes

Keywords

  • Bayesian
  • Constituent contextmodel
  • Grammar induction
  • Smoothing

Fingerprint

Dive into the research topics of 'Bayesian constituent context model for grammar induction'. Together they form a unique fingerprint.

Cite this