2024 Recap

In 2024, I visited SIGDIAL and INLG in Japan, both of which were smaller NLP-related conferences. I also finished writing and submitted my dissertation. Meanwhile, I tried to keep track of the major conferences (ACL, EMNLP, COLM, NeurIPS, ICLR, CHI, FAccT, etc.) in terms of papers that seem relevant to my research interests. In late December, I started to write short gists for my favorite papers. In the following, I arranged them into research directions.

NLP Interpretability

Do Large Language Models Latently Perform Multi-Hop Reasoning? (Yang et al., ACL 2024)

This paper studies whether LLMs latently perform multi-hop reasoning when recalling entities and finds strong evidence for such reasoning pathways when certain relation types are present in the prompt. However, the utilization is highly contextual and the evidence for the second hop and the entire traversal is rather moderate. With increasing model size, this behavior is more pronounced with the first hop.

DynamicQA: Tracing Internal Knowledge Conflicts in Language Models (Marjanović, Yu et al., EMNLP 2024 Findings)

DynamicQA is a dataset for real-world knowledge conflicts where facts can change over time (temporal) or have dynamic properties (disputable). The authors propose two measures for intra-memory conflicts, semantic entropy and coherent persuasion score, and find that LMs more often struggle with dynamic facts compared to facts with a single possible value.

Information Flow Routes: Automatically Interpreting Language Models at Scale (Ferrando & Voita, EMNLP 2024)

Revealing the Parametric Knowledge of Language Models: A Unified Framework for Attribution Methods (Yu, Atanasova & Augenstein, ACL 2024)

Interpretability of Language Models via Task Spaces (Weber et al., ACL 2024)

Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation (Qi, Sarti et al., EMNLP 2024)

Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models (Ghandeharioun, Caciularu et al., ICML 2024)

Explaining Text Similarity in Transformer Models (Vasileiou & Eberle, NAACL 2024)

SynthEval: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists (Zhao et al., EMNLP 2024 Findings)

Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs (Feucht et al., EMNLP 2024)

Insights into LLM Long-Context Failures: When Transformers Know but Don’t Tell (Gao et al., EMNLP 2024 Findings)

Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps (Chuang et al., EMNLP 2024)

Estimating Knowledge in Large Language Models Without Generating a Single Token (Gottesman & Geva, EMNLP 2024)

A Primer on the Inner Workings of Transformer-based Language Models (Ferrando et al., 2024)

Rationale Generation

Digital Socrates: Evaluating LLMs through explanation critiques (Gu, Tafjord & Clark, ACL 2024)

The authors create a dataset of multiple-choice QA problems and model-generated explanations and test if critique models can detect flaws in reasoning chains pertaining to the explanations.

ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation (Brassard et al., COLM 2024)

This paper presents a new dataset of free-text explanations alongside aspect-wise quality ratings to evaluate how LLMs rate explanations. The authors find that while larger models predicted labels in the commonsense reasoning tasks that more often aligned with human annotators, they can be inconsistent across the quality aspects.

RORA: Robust Free-Text Rationale Evaluation (Jiang, Lu et al., ACL 2024)

SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales (Xu et al., EMNLP 2024)

Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words? (Yona, Aharoni & Geva, EMNLP 2024)

Properties and Challenges of LLM-Generated Explanations (Kunz & Kuhlmann, HCI+NLP @ NAACL 2024)

Human Evaluation of Explanations

Unraveling the Dilemma of AI Errors: Exploring the Effectiveness of Human and Machine Explanations for Large Language Models (Pafla, Larson & Hancock, CHI 2024)

Fool Me Once? Contrasting Textual and Visual Explanations in a Clinical Decision-Support Setting (Kayser et al., EMNLP 2024)

Persuasiveness of Generated Free-Text Rationales in Subjective Decisions: A Case Study on Pairwise Argument Ranking (Elaraby et al., EMNLP 2024 Findings)

Interpretability Gone Bad: The Role of Bounded Rationality in How Practitioners Understand Machine Learning (Kaur et al., CSCW 2024)

In a controlled experiment with 119 ML practitioners, giving them access to interpretability tools resulted in 5x less time spent on the task and 17% less accuracy about the data and the model, showing an inclination towards “good enough” understanding.

EXMOS: Explanatory Model Steering Through Multifaceted Explanations and Data Configurations (Bhattacharya et al., CHI 2024)

Why is ‘Problems’ Predictive of Positive Sentiment? A Case Study of Explaining Unintuitive Features in Sentiment Classification (Gu, Arguello & Wang, FAccT 2024)

On Evaluating Explanation Utility for Human-AI Decision Making in NLP (Chaleshtori et al., EMNLP 2024 Findings)

Facilitating Human-LLM Collaboration through Factuality Scores and Source Attributions (Do, Ostrand et al., TREW @ CHI 2024)

Dialogue

MediQ: Question-Asking LLMs and a Benchmark for Reliable Interactive Clinical Reasoning (Li et al., NeurIPS 2024)

Soda-Eval: Open-Domain Dialogue Evaluation in the Age of LLMs (Mendonça, Trancoso & Lavie, EMNLP 2024 Findings)

Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs (Siro, Aliannejadi & de Rijke, SIGIR 2024)

Modeling the Quality of Dialogical Explanations (Alshomary et al., LREC-COLING 2024)

IQA-EVAL: Automatic Evaluation of Human-Model Interactive Question Answering (Li et al., NeurIPS 2024)

Fact Checking

HealthFC: Verifying Health Claims with Evidence-Based Medical Fact-Checking (Vladika, Schneider & Matthes, LREC-COLING 2024)

What Evidence Do Language Models Find Convincing? (Wan, Wallace & Klein, ACL 2024)

Their dataset ConflictingQA deals with the problem that RAG models are increasingly tasked with subjective and ambiguous queries and pairs such queries with documents as evidence in different categories. They find that models heavily rely on the relevance of a website to the query and ignore stylistic features that humans find important (presence of scientific references, neutral tone).

The Earth is Flat because…: Investigating LLMs’ Belief towards Misinformation via Persuasive Conversation (Xu et al., ACL 2024)

Enhancing Perception: Refining Explanations of News Claims with LLM Conversations (Hsu et al., NAACL 2024 Findings)

Memorization

Generalisation First, Memorisation Second? Memorisation Localisation for Natural Language Classification Tasks (Dankers & Titov, ACL 2024 Findings)

Demystifying Verbatim Memorization in Large Language Models (Huang, Yang & Potts, EMNLP 2024)

Readability & Education

RSA-Control: A Pragmatics-Grounded Lightweight Controllable Text Generation Framework (Wang & Demberg, EMNLP 2024)

Beyond Flesch-Kincaid: Prompt-based Metrics Improve Difficulty Classification of Educational Texts (Rooein et al., BEA @ NAACL 2024)

Two-Pronged Human Evaluation of ChatGPT Self-Correction in Radiology Report Simplification (Yang, Cherian & Vucetic, ACL 2024 Findings)

Annotation

‘Seeing the Big through the Small’: Can LLMs Approximate Human Judgment Distributions on NLI from a Few Explanations? (Chen et al., EMNLP 2024 Findings)

If in a Crowdsourced Data Annotation Pipeline, a GPT-4 (He et al., CHI 2024)

Dataset Analysis

Explaining Datasets in Words: Statistical Models with Natural Language Parameters (Zhong et al., NeurIPS 2024)

Benchmark Transparency: Measuring the Impact of Data on Evaluation (Kovatchev & Lease, NAACL 2024)

Concept-based Explanations

CoSy: Evaluating Textual Explanations of Neurons (Kopf et al., NeurIPS 2024)

The proposed Concept Synthesis framework can qualitatively evaluate textual explanations of latent neurons in DNNs (for CV) using a generative model conditioned on textual input and a comparison between its model and the response of the explained neuron.

Generalization

PRobELM: Plausibility Ranking Evaluation for Language Models (Yuan et al., COLM 2024)

Prompting, Tool Use & Self-Correction

GoLLIE: Annotation Guidelines improve Zero-Shot Information-Extraction (Sainz et al., ICLR 2024)

TACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools (Caciularu et al., NeurIPS 2024 D&B)

When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs (Kamoi et al., TACL 2024)

The Prompt Report: A Systematic Survey of Prompting Techniques (Schulhoff et al., 2024)

What Are Tools Anyway? A Survey from the Language Model Perspective (Wang et al., COLM 2024)

Meta-analyses

From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP (Mosbach et al., EMNLP 2024)

To rebut the criticism that interpretability research lacks actionable insights and impact on NLP, the authors construct a manually annotated citation graph and conduct a survey of members of the NLP community. They find that NLP researchers build on findings from the interpretability field and perceive it as important for progress in NLP and that highly influential work outside of interpretability often cites these findings.

Mechanistic? (Saphra & Wiegreffe, BlackboxNLP @ EMNLP 2024)

Position: Key Claims in LLM Research Have a Long Tail of Footnotes (Luccioni & Rogers, ICML 2024)

The order in which the papers appeared does not indicate any ranking.
Most papers are in their final (peer-reviewed and published) version safe for very few exceptions (mostly surveys). If I assume an arXiv pre-print to appear at some conference in 2025, I will not mention it and save it for next year.

Log

2024-12-29 – Initial version. Future updates will fill in the missing gists.