don't stop pretraining
Get more out of pretrained LMs by continuing to pretrain

An illustration of data distributions. Task data is comprised of an observable task distribution, usually non-randomly sampled from a wider distribution (light grey ellipsis) within an even larger target domain, which is not necessarily one of the domains included in the original LM pretraining domain – though overlap is possible. We explore the benefits of continued pretraining on data from the task distribution and the domain distribution.
Cite our paper:
@@inproceedings{Gururangan2020DontSP,
title={Don't Stop Pretraining: Adapt Language Models to Domains and Tasks},
author={Suchin Gururangan and Ana Marasovi{\'c} and Swabha Swayamdipta and Kyle Lo and Iz Beltagy and Doug Downey and Noah A. Smith},
booktitle={ACL},
year={2020}
}