Skip to main navigation Skip to search Skip to main content

CRISPRCasStack: A stacking strategy-based ensemble learning framework for accurate identification of Cas proteins

  • Tianjiao Zhang
  • , Yuran Jia
  • , Hongfei Li
  • , Dali Xu
  • , Jie Zhou
  • , Guohua Wang*
  • *Corresponding author for this work
  • Northeast Forestry University

Research output: Contribution to journalArticlepeer-review

Abstract

CRISPR-Cas system is an adaptive immune system widely found in most bacteria and archaea to defend against exogenous gene invasion. One of the most critical steps in the study of exploring and classifying novel CRISPR-Cas systems and their functional diversity is the identification of Cas proteins in CRISPR-Cas systems. The discovery of novel Cas proteins has also laid the foundation for technologies such as CRISPR-Cas-based gene editing and gene therapy. Currently, accurate and efficient screening of Cas proteins from metagenomic sequences and proteomic sequences remains a challenge. For Cas proteins with low sequence conservation, existing tools for Cas protein identification based on homology cannot guarantee identification accuracy and efficiency. In this paper, we have developed a novel stacking-based ensemble learning framework for Cas protein identification, called CRISPRCasStack. In particular, we applied the SHAP (SHapley Additive exPlanations) method to analyze the features used in CRISPRCasStack. Sufficient experimental validation and independent testing have demonstrated that CRISPRCasStack can address the accuracy deficiencies and inefficiencies of the existing state-of-the-art tools. We also provide a toolkit to accurately identify and analyze potential Cas proteins, Cas operons, CRISPR arrays and CRISPR-Cas locus in prokaryotic sequences. The CRISPRCasStack toolkit is available at https://github.com/yrjia1015/CRISPRCasStack.

Original languageEnglish
Article numberbbac335
JournalBriefings in Bioinformatics
Volume23
Issue number5
DOIs
StatePublished - 1 Sep 2022
Externally publishedYes

Keywords

  • CRISPR-Cas system
  • Cas proteins identification
  • machine learning
  • stacking strategy

Fingerprint

Dive into the research topics of 'CRISPRCasStack: A stacking strategy-based ensemble learning framework for accurate identification of Cas proteins'. Together they form a unique fingerprint.

Cite this