Abstract
Large language models (LLMs) have achieved widespread success across a wide range of natural language processing (NLP) tasks. Pretraining is a foundational step in the LLM training process, where the model gains a general understanding of language by exposure to vast amounts of text data. However, pretraining LLM comes with high costs and significant impacts on energy consumption and the environment. For instance, the emissions generated by training GPT-3 are approximately 552 net tCO2e. To alleviate this issue, we propose a simple and cost-efficient information fusion method, which involves merging the LLM's checkpoints that share training trajectories during the pretraining phase. Additionally, previous model merging methods mostly maximize the posterior approximation of the model on the target dataset, or average the model parameters. The former often performs poorly in out-of-distribution settings, overlooking the fact that the target dataset is typically unlabeled, while the latter may get trapped in local minima. In this paper, we propose a method that uses generation quality as an indicator to determine merging weights. By calculating the perplexity of the LLM on the data, we can assess the learning degree of different checkpoints on the target dataset, thereby determining the merging weights effectively. Our method avoids overfitting the posterior distribution of the target dataset and relaxes the requirement for labeled information. Extensive experimental results demonstrate that our method consistently achieves more stable and superior overall performance in both in-distribution and out-of-distribution settings.
| Original language | English |
|---|---|
| Article number | 103415 |
| Journal | Information Fusion |
| Volume | 125 |
| DOIs | |
| State | Published - Jan 2026 |
| Externally published | Yes |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 7 Affordable and Clean Energy
Keywords
- Checkpoint merging
- Large language model
- Perplexity
Fingerprint
Dive into the research topics of 'Towards enhanced LLM pretraining: Dynamic checkpoint merging via generation quality'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver