Link Search Menu Expand Document

Data-Centric NLP

🍂 Fall 2022     ⏰ Mon / Wed 10:00 am - 11:50 am     📍 WPH 203

Instructor: Swabha Swayamdipta

Office hours TBD


Jul 21: Week 0

Looking for TAs

Applications are open! All PhD students and Masters students in the Honors program are eligible to apply. The standard offer is for 25% TAship (10 hours a week), but there might be some flexibility.

Masters Student Enrollment

Since this is a 600-level class, preference will be given primarily to PhD students. After enrollment starts on 15th Aug, and in case we have empty slots, I might be able to accommodate Masters students in good standing. You will still need my official recommendation for enrollment, so please reach out by 17th Aug. However, before then, there is not much I can do.


Natural language processing (NLP) has been revolutionized by the advancement of deep learning, which have enabled large-scale language models. However, state-of-the-art models are still supervised via large datasets, whether labeled or not. In this survey class, we will focus on the datasets behind language technologies. Starting with dataset creation methodologies and principles, we will survey widely used datasets in NLP, and some of their key characteristics, essential for learning. Later on, we will study some automatic methods for dataset creation. The course will describe how biases arising from creation methodologies might inadvertently lead to models that perform exceedingly well within some data distributions, but do not generalize to other distributions. We will also explore literature on how the quality of datasets can be estimated automatically, either as corpus aggregates, or based on individual instances. We will discuss questions on using data for accountability, where we will investigate how evaluation sets are built, as well as how model decisions are can be contextualized in data. Finally, we will look at some extensions of the above concepts to unlabeled datasets, as well as datasets with modalities beyond language.


Students are highly encouraged to have taken undergraduate and/or graduate level classes in either Machine Learning or Natural Language Processing, equivalent to CSCI 544 (Applied Natural Language Processing) or CSCI 567 (Machine learning), respectively. Please email the instructor for special circumstances or specific clarifications.


Calendar is subject to change. More details, e.g. reading materials and additional resources will be added as the semester continues. All work is due on the specified date by 11:59 PM PT.

Datasets in NLP

Aug 22
Introduction and Historical Perspective
Aug 24
Data Provenance
Aug 29
Collection and Crowdsourcing
Aug 31
Lessons from Data Ethics

Biases and Mitigation

Sep 5
No Class
Labor Day
Sep 7
Biases: Good and Bad
Sep 12
Spurious Biases and Annotation Artifacts
Sep 14
Social Biases

Estimating Data Quality

Sep 19
Estimating Data and Label Quality
Sep 21
Corpus-wide Aggregate Estimates
Project Proposal
Sep 26
Instance-Specific Point-wise Estimates
Sep 28
Disagreements, Subjectivity and Ambiguity

Addressing Spurious Biases

Oct 3
Mitigating Known and Predefined Biases
Oct 5
Discovering Unknown Biases
Oct 10
Automatic Data Augmentation
Oct 12
Adversarial Augmentation

Data Accountability

Oct 17
Contextualizing Decisions in Datasets
Oct 19
Data Sheets
Oct 24
Creating Evaluation Sets
Oct 26
More on Evaluation Sets
Project Progress Report

Beyond Labeled Datasets in NLP

Oct 31
Unlabeled Data
Nov 2
Exploring multiple modalities
Nov 7
Is anything truly zero-shot?
Nov 9
Putting it all together

Project Presentations and Final Report

Nov 14
Project Presentations
Nov 16
Project Presentations
Nov 21
Project Presentations
Nov 23
No Class
Nov 28
Project Presentations
Nov 30
Project Presentations
Dec 5
No Class
Finals Week
Dec 7
No Class
Project Final Report


There will be three components to course grades. These are subject to change before the beginning of the semester.

  • Paper Discussion (20%).
    • Lead at least one discussion.
    • Bonus (10%): Lead a second discussion.
  • Class Participation (15%).
    • Participate in small and large group discussions.
    • Turn in quiz sheets.
    • Engage in Q/A during the lead paper discussion.
  • Class Project in groups of 1-2 students (65%).
    • Turn in a project proposal: 5%
    • Turn in a progress report: 10%
    • Present your main findings: 20%
    • Turn in a final report: 30%

Note: Please read about the academic policies and a note about student well-being here.