Link Search Menu Expand Document

Data-Centric NLP

🍂 Fall 2022     ⏰ Mon / Wed 10:00 am - 11:50 am     📍 WPH 203

Instructor: Swabha Swayamdipta

Office Hours: Fridays 10 - 11 am; SAL 238

Teaching Assistant: Qinyuan Ye


Nov 7: Week 12

  • Feedback on your project progress reports has been shared, let me know ASAP if you did not receive yours.
  • Project Presentations (see updated instructions) start on Nov 16.
  • Intermediate scores will be shared by the end of this week.
  • There will be no office hours on:
    • 18th Nov (SoCal NLP)
    • 25th Nov (Thanksgiving)
    • 9th Nov onwards


While natural language processing (NLP) has been revolutionized by the advancement of deep learning, data still remains central in the development and application of NLP. Contrary to the common belief that data is just a necessary evil to be dealt with, this class will present datasets in NLP as central in designing tasks and models as well as measuring progress in the field of NLP. Starting with dataset creation methodologies and principles, we will survey the ethics behind dataset collection. This will lead us to discussing certain harms related to dataset collection, such as spurious and representational or social biases, along with data based techniques to address them. We will explore literature on how the quality of datasets can be estimated automatically, and how annotator disagreements and subjectivity are natural components of crowdsourcing. We will discuss questions on using data for accountability, where we will investigate how evaluation sets are built, as well as how model decisions are can be contextualized in data. We will discuss these concepts as they relate to unlabeled datasets. To wrap up, we will revisit data ethics towards better data citizenship.


There are four class formats:

  • lectures,
  • 3 paper presentations (25 mins + 10 mins discussion, each),
  • 2 paper presentations (30 mins + 10 mins discussion, each) + quiz, or
  • final project presentation.

Datasets in NLP

Aug 22
Introduction and Overview
Additional Readings
Aug 24
Data Collection and Data Ethics
Additional Readings
Aug 29
More on Data Collection
Aug 31
More on Data Ethics

Biases and Mitigation

Sep 5
No Class
Labor Day
Sep 7
Biases: An Overview
Additional Readings
Sep 12
Spurious Biases I
Sep 14
Spurious Biases II
Sep 19
Data-Centric Bias Mitigation
Sep 21
Data Augmentation
Readings Project Proposal

Estimates of Data Quality

Sep 26
Estimating Data Quality
Additional Readings
Sep 28
Aggregate vs. Point-wise Estimates
Oct 3
Anomalies, Outliers, OOD
Oct 5
Disagreements, Subjectivity and Ambiguity I
Oct 10
No Class
Indigenous Peoples Day
Oct 12
Disagreements, Subjectivity and Ambiguity II

Data for Accountability

Oct 17
Creating Evaluation Sets
Oct 19
Counterfactual Evaluation
Oct 24
Adversarial Evaluation
Oct 26
Contextualizing Decisions in Datasets
Oct 28
Project Progress Report

Beyond Labeled Datasets in NLP

Oct 31
Unlabeled Data
Nov 2
Prompts are data too?
Project Feedback Readings

Privacy, More on Ethics, Outro

Nov 7
Data Privacy and Security
Nov 9
Towards better data citizenship
Nov 14
Putting it all together

Outro and Project Presentations

Nov 16
Project Presentations
Nov 21
Project Presentations
Nov 23
No Class
Nov 28
Project Presentations
Nov 30
Project Presentations
Dec 5
No Class
Finals Week
Dec 7
No Class
Project Final Report

Calendar is subject to change. More details, e.g. reading materials and additional resources will be added as the semester continues. All work is due on the specified date by 11:59 PM PT.


There will be three components to course grades, see more details.

Students are allowed a maximum of 4 late days total for all assignments (not quiz sheets).

Note: Please read about the academic policies and a note about student well-being.


Students are required to have taken undergraduate and/or graduate level classes in either Machine Learning or Natural Language Processing, equivalent to CSCI 544 (Applied Natural Language Processing) or CSCI 567 (Machine learning), or CSCI 662 (Advanced Natural Language Processing). Please email the instructor for special circumstances or specific clarifications.