Link Search Menu Expand Document

Detailed Calendar

Required and additional readings, to be updated (bi)weekly. Additional readings are not mandatory.

Datasets in NLP

Weeks 1 and 2

Aug 22 Lecture Introduction, Historical Perspective and Overview

Additional Readings

Aug 24 Lecture Data Collection and Data Ethics

Additional Readings

Aug 29 More on Collection

Additional Readings

  • Bowman et al. 2015 A large annotated corpus for learning natural language inference
  • Nie et al., 2020 Adversarial NLI: A New Benchmark for Natural Language Understanding

Aug 31 More on Data Ethics

  • Bender et al., 2021 On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜
  • Koch et al., 2021 Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research

Additional Readings

Biases and Mitigation

Weeks 3, 4 and 5

Sep 7 Lecture Biases: An Overview

Additional Readings

Sep 12 Spurious Biases I

Sep 14 Spurious Biases II

  • Gardner et al., 2021 Competency Problems: On Finding and Removing Artifacts in Language Data
  • Eisenstein, 2022 Informativeness and Invariance: Two Perspectives on Spurious Correlations in Natural Language

Sep 19 Data-Centric Bias Mitigation

Sep 21 Data Augmentation for Bias Mitigation

  • Ng et al., 2020 SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving O.O.D. Robustness
  • Kaushik et al., 2019 Learning the Difference that Makes a Difference with Counterfactually-Augmented Data

Project Proposal due latest by 11:59 PM PT.

Estimating Data Quality

Weeks 6, 7 and 8

Sep 26 Lecture Estimates of Data Quality

Additional Readings

Sep 28 Aggregate vs. Point-wise Estimates of Data Quality

Oct 3 Anomalies, Outliers, and Out-of-Distribution Examples

Oct 5 Disagreements, Subjectivity and Ambiguity I

Oct 12 Disagreements, Subjectivity and Ambiguity II

  • Miceli et al., 2020 Between Subjectivity and Imposition: Power Dynamics in Data Annotation for Computer Vision
  • Davani et al., 2021 Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations

Data for Accountability

Weeks 9 and 10

Oct 17 Creating Evaluation Sets

Additional Readings

Oct 19 Counterfactual Evaluation

Oct 24 Adversarial Evaluation

Oct 26 Contextualizing Decisions

Oct 28

Project Proposal due latest by 11:59 PM PT.

Beyond Labeled Datasets

Weeks 11, 12 and 13

Oct 31 Unlabeled Data

  • Dodge et al., 2021 Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
  • Lee et al., 2022 Deduplicating Training Data Makes Language Models Better
  • Gururangan et al., 2022 Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection

Nov 2 Prompts as Data?

  • Wei et al., 2022 Chain of Thought Prompting Elicits Reasoning in Large Language Models

Nov 7 Data Privacy and Security

Nov 9 Towards Better Data Citizenship

  • Jo & Gebru, 2019 Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning
  • Hutchinson et al., 2021 Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure

Outro and Presentations

Weeks 14, 15, and 16

Nov 14 Lecture Outro

Nov 16 Project Presentations

Nov 21 Project Presentations

Nov 28 Project Presentations

Nov 30 Project Presentations

Dec 7

Project Final Report due latest by 11:59 PM PT.