Skip to main navigation Skip to search Skip to main content

Identifying plant pentatricopeptide repeat coding gene/protein using mixed feature extraction methods

  • Kaiyang Qu
  • , Leyi Wei
  • , Jiantao Yu
  • , Chunyu Wang*
  • *Corresponding author for this work
  • Tianjin University
  • Northwest Agriculture and Forestry University
  • School of Computer Science and Technology, Harbin Institute of Technology
  • University of Missouri

Research output: Contribution to journalArticlepeer-review

Abstract

Motivation: Pentatricopeptide repeat (PPR) is a triangular pentapeptide repeat domain that plays a vital role in plant growth. In this study, we seek to identify PPR coding genes and proteins using a mixture of feature extraction methods. We use four single feature extraction methods focusing on the sequence, physical, and chemical properties as well as the amino acid composition, and mix the features. The Max-Relevant-Max-Distance (MRMD) technique is applied to reduce the feature dimension. Classification uses the random forest, J48, and naïve Bayes with 10-fold cross-validation. Results: Combining two of the feature extraction methods with the random forest classifier produces the highest area under the curve of 0.9848. Using MRMD to reduce the dimension improves this metric for J48 and naïve Bayes, but has little effect on the random forest results.

Original languageEnglish
Article number1961
JournalFrontiers in Plant Science
Volume9
DOIs
StatePublished - 10 Jan 2019
Externally publishedYes

Keywords

  • J48
  • Maximum relevant maximum distance
  • Mixed feature extraction methods
  • Naïve bayes
  • Pentatricopeptide repeat
  • Random forest

Fingerprint

Dive into the research topics of 'Identifying plant pentatricopeptide repeat coding gene/protein using mixed feature extraction methods'. Together they form a unique fingerprint.

Cite this