don't stop pretraining

Get more out of pretrained LMs by continuing to pretrain

An illustration of data distributions. Task data is comprised of an observable task distribution, usually non-randomly sampled from a wider distribution (light grey ellipsis) within an even larger target domain, which is not necessarily one of the domains included in the original LM pretraining domain – though overlap is possible. We explore the benefits of continued pretraining on data from the task distribution and the domain distribution.

Cite our paper:

  title={Don't Stop Pretraining: Adapt Language Models to Domains and Tasks},
  author={Suchin Gururangan and Ana Marasovi{\'c} and Swabha Swayamdipta and Kyle Lo and Iz Beltagy and Doug Downey and Noah A. Smith},