Data-Centric NLP
🍂 Fall 2022 ⏰ Mon / Wed 10:00 am - 11:50 am 📍 WPH 203

Announcements
Jul 21: Week 0
Looking for TAs
Applications are open! All PhD students and Masters students in the Honors program are eligible to apply. The standard offer is for 25% TAship (10 hours a week), but there might be some flexibility.
Masters Student Enrollment
Since this is a 600-level class, preference will be given primarily to PhD students. After enrollment starts on 15th Aug, and in case we have empty slots, I might be able to accommodate Masters students in good standing. You will still need my official recommendation for enrollment, so please reach out by 17th Aug. However, before then, there is not much I can do.
Summary
Natural language processing (NLP) has been revolutionized by the advancement of deep learning, which have enabled large-scale language models. However, state-of-the-art models are still supervised via large datasets, whether labeled or not. In this survey class, we will focus on the datasets behind language technologies. Starting with dataset creation methodologies and principles, we will survey widely used datasets in NLP, and some of their key characteristics, essential for learning. Later on, we will study some automatic methods for dataset creation. The course will describe how biases arising from creation methodologies might inadvertently lead to models that perform exceedingly well within some data distributions, but do not generalize to other distributions. We will also explore literature on how the quality of datasets can be estimated automatically, either as corpus aggregates, or based on individual instances. We will discuss questions on using data for accountability, where we will investigate how evaluation sets are built, as well as how model decisions are can be contextualized in data. Finally, we will look at some extensions of the above concepts to unlabeled datasets, as well as datasets with modalities beyond language.
Pre-Requisites
Students are highly encouraged to have taken undergraduate and/or graduate level classes in either Machine Learning or Natural Language Processing, equivalent to CSCI 544 (Applied Natural Language Processing) or CSCI 567 (Machine learning), respectively. Please email the instructor for special circumstances or specific clarifications.
Calendar
Calendar is subject to change. More details, e.g. reading materials and additional resources will be added as the semester continues. All work is due on the specified date by 11:59 PM PT.
Datasets in NLP
Biases and Mitigation
Estimating Data Quality
Addressing Spurious Biases
Data Accountability
Beyond Labeled Datasets in NLP
Project Presentations and Final Report
- Nov 14
- Project Presentations
- Nov 16
- Project Presentations
- Nov 21
- Project Presentations
Nov 23-
- No Class
- Thanksgiving
- Nov 28
- Project Presentations
- Nov 30
- Project Presentations
Dec 5-
- No Class
- Finals Week
Dec 7-
- No Class
- Project Final Report
Assignments/Grading
There will be three components to course grades. These are subject to change before the beginning of the semester.
- Paper Discussion (20%).
- Lead at least one discussion.
- Bonus (10%): Lead a second discussion.
- Class Participation (15%).
- Participate in small and large group discussions.
- Turn in quiz sheets.
- Engage in Q/A during the lead paper discussion.
- Class Project in groups of 1-2 students (65%).
- Turn in a project proposal: 5%
- Turn in a progress report: 10%
- Present your main findings: 20%
- Turn in a final report: 30%
Note: Please read about the academic policies and a note about student well-being here.