Link Search Menu Expand Document

Data-Centric NLP

🍂 Fall 2022     ⏰ Mon / Wed 10:00 am - 11:50 am     📍 WPH 203

Instructor: Swabha Swayamdipta

swabhas@usc.edu

Office hours TBD

Announcements

Jul 21: Week 0

Looking for TAs

Applications are open! All PhD students and Masters students in the Honors program are eligible to apply. The standard offer is for 25% TAship (10 hours a week), but there might be some flexibility.

Masters Student Enrollment

Since this is a 600-level class, preference will be given primarily to PhD students. After enrollment starts on 15th Aug, and in case we have empty slots, I might be able to accommodate Masters students in good standing. You will still need my official recommendation for enrollment, so please reach out by 17th Aug. However, before then, there is not much I can do.

Summary

Natural language processing (NLP) has been revolutionized by the advancement of deep learning, which have enabled large-scale language models. However, state-of-the-art models are still supervised via large datasets, whether labeled or not. In this survey class, we will focus on the datasets behind language technologies. Starting with dataset creation methodologies and principles, we will survey widely used datasets in NLP, and some of their key characteristics, essential for learning. Later on, we will study some automatic methods for dataset creation. The course will describe how biases arising from creation methodologies might inadvertently lead to models that perform exceedingly well within some data distributions, but do not generalize to other distributions. We will also explore literature on how the quality of datasets can be estimated automatically, either as corpus aggregates, or based on individual instances. We will discuss questions on using data for accountability, where we will investigate how evaluation sets are built, as well as how model decisions are can be contextualized in data. Finally, we will look at some extensions of the above concepts to unlabeled datasets, as well as datasets with modalities beyond language.

Pre-Requisites

Students are highly encouraged to have taken undergraduate and/or graduate level classes in either Machine Learning or Natural Language Processing, equivalent to CSCI 544 (Applied Natural Language Processing) or CSCI 567 (Machine learning), respectively. Please email the instructor for special circumstances or specific clarifications.

Calendar

Calendar is subject to change. More details, e.g. reading materials and additional resources will be added as the semester continues. All work is due on the specified date by 11:59 PM PT.

Datasets in NLP

Aug 22
Introduction and Historical Perspective
Lecture
Aug 24
Data Provenance
Readings
Aug 29
Collection and Crowdsourcing
Readings
Aug 31
Lessons from Data Ethics
Lecture
Readings

Biases and Mitigation

Sep 5
No Class
Labor Day
Sep 7
Biases: Good and Bad
Lecture
Readings
Sep 12
Spurious Biases and Annotation Artifacts
Readings
Sep 14
Social Biases
Readings

Estimating Data Quality

Sep 19
Estimating Data and Label Quality
Lecture
Readings
Sep 21
Corpus-wide Aggregate Estimates
Readings
Project Proposal
Sep 26
Instance-Specific Point-wise Estimates
Readings
Sep 28
Disagreements, Subjectivity and Ambiguity
Readings

Addressing Spurious Biases

Oct 3
Mitigating Known and Predefined Biases
Lecture
Readings
Oct 5
Discovering Unknown Biases
Readings
Oct 10
Automatic Data Augmentation
Readings
Oct 12
Adversarial Augmentation
Readings

Data Accountability

Oct 17
Contextualizing Decisions in Datasets
Lecture
Readings
Oct 19
Data Sheets
Readings
Oct 24
Creating Evaluation Sets
Readings
Oct 26
More on Evaluation Sets
Readings
Project Progress Report

Beyond Labeled Datasets in NLP

Oct 31
Unlabeled Data
Readings
Nov 2
Exploring multiple modalities
Readings
Nov 7
Is anything truly zero-shot?
Readings
Nov 9
Putting it all together
Lecture

Project Presentations and Final Report

Nov 14
Project Presentations
Nov 16
Project Presentations
Nov 21
Project Presentations
Nov 23
No Class
Thanksgiving
Nov 28
Project Presentations
Nov 30
Project Presentations
Dec 5
No Class
Finals Week
Dec 7
No Class
Project Final Report

Assignments/Grading

There will be three components to course grades. These are subject to change before the beginning of the semester.

  • Paper Discussion (20%).
    • Lead at least one discussion.
    • Bonus (10%): Lead a second discussion.
  • Class Participation (15%).
    • Participate in small and large group discussions.
    • Turn in quiz sheets.
    • Engage in Q/A during the lead paper discussion.
  • Class Project in groups of 1-2 students (65%).
    • Turn in a project proposal: 5%
    • Turn in a progress report: 10%
    • Present your main findings: 20%
    • Turn in a final report: 30%

Note: Please read about the academic policies and a note about student well-being here.