UT Austin’s ‘Inheritune’ Supports Efficient Language Model Training: Leveraging Inheritance and Less Data for Comparable Performance

Scaling LLMs poses significant challenges due to the massive computational resources required and the need for high-quality datasets. Normally, the pre-training process involves using models with billions of parameters and training them on datasets containing trillions of tokens. This complicated procedure requires significant computing power and access to high-quality data to achieve better performance in language understanding and task generation.

Researchers at UT Austin have “Inherit”, a method to distinguish smaller basic LMs from larger ones. They inherit a few transformer blocks from a larger LM and then train the smaller model on a small portion (0.1%) of the original pretraining data. This approach efficiently creates LMs with 1.5 billion parameters using only 1 billion tokens, using a single GPU in less than 12 hours. Despite using significantly less data, the resulting models perform comparably to publicly available LMs trained on larger datasets, demonstrating effectiveness in a variety of settings.

Previous approaches to training small-scale LMs include extensive training from scratch with trillions of tokens or using high-quality synthetic data. For example, Tinyllama-1B is trained from scratch with 3 trillion tokens for 90 days. In contrast, the Inheritune efficiently trains small base LMs by inheriting transformer blocks from larger models and training on a small subset of data, achieving comparable performance with significantly less computing resources. Although model compression techniques have been successful in other domains, such as neural networks, they are not yet as effective in the complex functions of large LMs.

The Inheritune approach creates a small base LM by inheriting a fraction of the pre-training data and a few layers from an existing large LM. First, the first n layers of the reference model are inherited, initializing the target model. Then the target model is trained on the available subset of training data for a specified number of epochs. In the experiments, the researchers use a subset of 1 billion tokens from the Redpajama v1 dataset to train an LM with 1.5 billion parameters, achieving competitive performance compared to scratch-trained and derived LMs. The researchers evaluate the approach using several baseline models, especially taking into account the quality of the data before training for a fair comparison.

Inheritance allows the extraction of smaller target LMs without sacrificing performance, showing comparable zero-shot performance on relevant downstream tasks. Furthermore, these LMs outperform similarly sized models trained from scratch, outperforming them after fewer training steps. Experiments on GPT2 medium models show that initialization with Inheritune, especially with attention and MLP weights, yields superior convergence speed and final validation loss performance. Surprisingly, initializing attention or MLP weights yields similar improvements in convergence speed and validation loss.

Limitations of the Inheritune method also include the inability to adjust the architectural design other than the number of transformer blocks, potentially limiting flexibility in adjusting hidden dimensions and attention heads. Sensitivity to the quality of the training dataset is another concern due to its small size. Furthermore, the selection of blocks to keep, data set management, and hyperparameter tuning still need to explore opportunities for improvement. Nevertheless, the study concludes that Inheritune effectively pretrains small base language models with minimal data and computational resources, providing a simple approach to model reduction from large reference models.

look at the Paper and Github. All credit for this research goes to the researchers of this project. Don’t forget to follow us too Tweet. Come join us Telegram channel, Discord channelAnd LinkedIn Groops.

If you like our work, you will love us too newsletter..

Don’t forget to join us 40k+ ML SubReddit

For content partnership, please Fill in this form here..

Sana Hassan, a consulting intern at Marktechpost and dual student at IIT Madras, is passionate about applying technology and AI to tackle real-world challenges. With a great interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-world solutions.

🐝 Join the fastest growing AI Research newsletter, read by researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many more…