Skip to main navigation Skip to search Skip to main content

From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery

  • Yuhan Chen
  • , Nuwa Xi
  • , Yanrui Du
  • , Haochun Wang
  • , Jianyu Chen
  • , Sendong Zhao*
  • , Bing Qin
  • *Corresponding author for this work
  • Harbin Institute of Technology

Research output: Contribution to journalConference articlepeer-review

Abstract

Molecule discovery serves as a cornerstone in numerous scientific domains, fueling the development of new materials and innovative drug designs. Recent developments of in-silico molecule discovery have highlighted the promising results of cross-modal techniques, which bridge molecular structures with their descriptive annotations. However, these cross-modal methods frequently encounter the issue of data scarcity, hampering their performance and application. In this paper, we address the low-resource challenge by utilizing artificially-real data generated by Large Language Models (LLMs). We first introduce a retrieval-based prompting strategy to construct high-quality pseudo data, then explore the optimal method to effectively leverage this pseudo data. Experiments show that using pseudo data for domain adaptation outperforms all existing methods, while also requiring a smaller model scale, reduced data size and lower training cost, highlighting its efficiency. Furthermore, our method shows a sustained improvement as the volume of pseudo data increases, revealing the great potential of pseudo data in advancing low-resource cross-modal molecule discovery.

Original languageEnglish
Pages (from-to)21958-21966
Number of pages9
JournalProceedings of the AAAI Conference on Artificial Intelligence
Volume38
Issue number20
DOIs
StatePublished - 25 Mar 2024
Event38th AAAI Conference on Artificial Intelligence, AAAI 2024 - Vancouver, Canada
Duration: 20 Feb 202427 Feb 2024

Fingerprint

Dive into the research topics of 'From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery'. Together they form a unique fingerprint.

Cite this