Natural language processing (NLP) has been revolutionized by the advancement of deep learning, which have enabled large-scale language models. However, state-of-the-art models are still supervised via large datasets, whether labeled or not. In this survey class, we will focus on the datasets behind language technologies. Starting with dataset creation methodologies, we will survey widely used datasets in NLP, and some of their key characteristics, essential for learning. The course will describe how biases arising from creation methodologies might inadvertently lead to models that perform exceedingly well within some data distributions, but do not generalize to other distributions, as well as how these biases can be addressed. We will also explore some literature on how the quality of datasets can be estimated automatically, contextualized via data sheets, and broadly within the existing literature on data ethics. Finally, we will look at some extensions of the above concepts to unlabeled datasets, as well as datasets with modalities beyond language.
Staff and Logistics
Instructor: Swabha Swayamdipta
Teaching Assistant: TBD
You must have taken undergraduate and/or graduate level classes in either Machine Learning or Natural language Processing, equivalent to CSCI 544 (Applied Natural Language Processing) or CSCI 567 (Machine learning), respectively. Please email the instructor for special circumstances or for specific clarifications.