Welcome

Welcome to our tutorial on Item Response Theory for Natural Language Processing!

This tutorial will introduce the NLP community to Item Response Theory (IRT). IRT is a method from the field of psychometrics for model and dataset assessment. IRT has been used for decades to build test sets for human subjects and estimate latent characteristics of dataset examples. Recently, there has been an uptick in work applying IRT to tasks in NLP. It is our goal to introduce the wider NLP community to IRT and show its benefits for a number of NLP tasks. From this tutorial, we hope to encourage wider adoption of IRT among NLP researchers.

As NLP models improve in performance and increase in complexity, new methods for evaluation are needed to appropriately evaluate performance improvements. In addition, data quality continues to be important. Models exploitation of annotation artifacts, annotation errors, and a misalignment between models and dataset difficulty can hinder an appropriate assessment of model performance. As models reach and exceed human performance on certain tasks, it gets more difficult to distinguish between improvements and innovations and changes in scores due to chance. In this three-hour, introductory tutorial, we will review the current state of evaluation in NLP, then introduce IRT as a tool for NLP researchers to use when evaluating their data and models. We will also introduce and demonstrate the py-irt Python package for IRT model-fitting to help encourage adoption and facilitate IRT use.

While this methodology has been applied successfully to NLP applications, further community exposure specifically for graduate students may provide a new methodological perspective. We aim to make the tutorial interactive with hands-on Jupyter notebooks which will give concrete simple examples.

Stay Connected

We’re building a list of individuals interested in working with/learning more about IRT in NLP.

Please fill out this form if you’d like to be part of the community.

Date and Place

The tutorial took place at EACL 2024 on March 21st, 2024.

Speakers

John Lalor, University of Notre Dame
Pedro Rodriguez, Meta AI-FAIR
Joao Sedoc, New York University
Jose Hernandez-Orallo, Universitat Politècnica de València and the Leverhulme Centre for the Future of Intelligence, University of Cambridge, UK

Schedule

Evaluation in NLP
Introduction to IRT
- Defining IRT Models
- IRT Model Fitting
- Introduction to py-irt
IRT in NLP
- Building Test Sets
  - Model Evaluation
  - Chatbot Evaluation
- Training Dynamics
  - Example Mining
  - Curriculum Learning
- Model and Data Evaluation
  - Rethinking Leaderboards
  - Features Related to Difficulty
Advanced Topics and Opportunities for Future Work

Material

Tutorial Recording

Presentation Materials

Introduction: slides
Evaluation in NLP: slides
Introduction to IRT: slides
- ipynb notebook 1 (html export of notebook)
- ipynb notebook 2 (html export of notebook)
IRT in NLP: slides
- ipynb notebook (html export of notebook)
Advanced Topics: slides
Conclusion and Opportunities for Future Work: slides
Full tutorial (single pdf file): slides

We have also put together a structured reading list (pdf) for references.

Reference

If you build on top of this tutorial and want to cite it, please use the following bib entry:

@inproceedings{irt4nlp2024,
  title =        "Item Response Theory for Natural Language Processing",
  author =       "Lalor, John P. and Rodriguez, Pedro and Sedoc, Joao and Hernandez-Orallo, Jose",
  booktitle =    "Proceedings of the 18th Conference of the European
                  Chapter of the Association for Computational
                  Linguistics: Tutorial Abstracts",
  month =        march,
  year =         "2024",
  address =      "Malta",
  publisher =    "Association for Computational Linguistics",
  abstract =     "This tutorial will introduce the NLP community to
                  Item Response Theory (IRT). IRT is a method from
                  the field of psychometrics for model and dataset
                  assessment. IRT has been used for decades to build
                  test sets for human subjects and estimate latent
                  characteristics of dataset examples. Recently, there
                  has been an uptick in work applying IRT to tasks in
                  NLP. It is our goal to introduce the wider NLP
                  community to IRT and show its benefits for a number
                  of NLP tasks. From this tutorial, we hope to encourage
                  wider adoption of IRT among NLP researchers."
}