Data-Centric NLP

Name: Data-Centric NLP
Author: Swabha Swayamdipta

🍂 Fall 2022 ⏰ Mon / Wed 10:00 am - 11:50 am 📍 WPH 203

Instructor: Swabha Swayamdipta

swabhas@usc.edu

Office Hours: Fridays 10 - 11 am; SAL 238

Teaching Assistant: Qinyuan Ye

qinyuany@usc.edu

Announcements

Nov 7: Week 12

Feedback on your project progress reports has been shared, let me know ASAP if you did not receive yours.
Project Presentations (see updated instructions) start on Nov 16.
Intermediate scores will be shared by the end of this week.
There will be no office hours on:
- 18th Nov (SoCal NLP)
- 25th Nov (Thanksgiving)
- 9th Nov onwards

Summary

While natural language processing (NLP) has been revolutionized by the advancement of deep learning, data still remains central in the development and application of NLP. Contrary to the common belief that data is just a necessary evil to be dealt with, this class will present datasets in NLP as central in designing tasks and models as well as measuring progress in the field of NLP. Starting with dataset creation methodologies and principles, we will survey the ethics behind dataset collection. This will lead us to discussing certain harms related to dataset collection, such as spurious and representational or social biases, along with data based techniques to address them. We will explore literature on how the quality of datasets can be estimated automatically, and how annotator disagreements and subjectivity are natural components of crowdsourcing. We will discuss questions on using data for accountability, where we will investigate how evaluation sets are built, as well as how model decisions are can be contextualized in data. We will discuss these concepts as they relate to unlabeled datasets. To wrap up, we will revisit data ethics towards better data citizenship.

Calendar

There are four class formats:

lectures,
3 paper presentations (25 mins + 10 mins discussion, each),
2 paper presentations (30 mins + 10 mins discussion, each) + quiz, or
final project presentation.

Datasets in NLP

Aug 22

Introduction and Overview: Lecture; Additional Readings

Aug 24

Data Collection and Data Ethics: Lecture; Additional Readings

Aug 29

More on Data Collection: Readings

Aug 31

More on Data Ethics: Readings

Biases and Mitigation

~~Sep 5~~

No Class: Labor Day

Sep 7

Biases: An Overview: Lecture; Additional Readings

Sep 12

Spurious Biases I: Readings

Sep 14

Spurious Biases II: Readings

Sep 19

Data-Centric Bias Mitigation: Readings

Sep 21

Data Augmentation: Readings Project Proposal

Estimates of Data Quality

Sep 26

Estimating Data Quality: Lecture; Additional Readings

Sep 28

Aggregate vs. Point-wise Estimates: Readings

Oct 3

Anomalies, Outliers, OOD: Readings

Oct 5

Disagreements, Subjectivity and Ambiguity I: Readings

~~Oct 10~~

No Class: Indigenous Peoples Day

Oct 12

Disagreements, Subjectivity and Ambiguity II: Readings

Data for Accountability

Oct 17

Creating Evaluation Sets: Readings

Oct 19

Counterfactual Evaluation: Readings

Oct 24

Adversarial Evaluation: Readings

Oct 26

Contextualizing Decisions in Datasets: Readings

Oct 28

Project Progress Report

Beyond Labeled Datasets in NLP

Oct 31

Unlabeled Data: Readings

Nov 2

Prompts are data too?: Project Feedback Readings

Privacy, More on Ethics, Outro

Nov 7

Data Privacy and Security: Readings

Nov 9

Towards better data citizenship: Readings

Nov 14

Putting it all together: Lecture

Outro and Project Presentations

Nov 16

Project Presentations

Nov 21

Project Presentations

~~Nov 23~~

No Class: Thanksgiving

Nov 28

Project Presentations

Nov 30

Project Presentations

~~Dec 5~~

No Class: Finals Week

~~Dec 7~~

No Class: Project Final Report

Calendar is subject to change. More details, e.g. reading materials and additional resources will be added as the semester continues. All work is due on the specified date by 11:59 PM PT.

Assignments

There will be three components to course grades, see more details.

Paper Discussion (20%).
Class Participation (20%).
Class Project (60%).

Students are allowed a maximum of 4 late days total for all assignments (not quiz sheets).

Note: Please read about the academic policies and a note about student well-being.

Pre-Requisites

Students are required to have taken undergraduate and/or graduate level classes in either Machine Learning or Natural Language Processing, equivalent to CSCI 544 (Applied Natural Language Processing) or CSCI 567 (Machine learning), or CSCI 662 (Advanced Natural Language Processing). Please email the instructor for special circumstances or specific clarifications.