Nils Feldhus

Simplifying Outcomes of Language Model Component Analyses
How can results and findings from LM component analysis and mechanistic interpretability studies that are very hard to comprehend for non-experts be simplified and illustrated?

[1] NLEs for Neurons (Huang et al., BlackboxNLP 2023)
[2] Summarize and Score (Singh et al., 2023)
[3] Interpreting the Semantic Flow with VISIT (Katz & Belinkov, EMNLP 2023 Findings)
[4] Knowledge-Critical Subnetworks (Bayazit et al., EMNLP 2024)
[5] Function Vectors (Todd et al., ICLR 2024)
[6] LM Transparency Tool (Tufanov et al., ACL 2024 Demos)
[7] Primer on Component Analysis Methods (Ferrando et al., 2024)
[8] Mechanistic? (Saphra & Wiegreffe, BlackboxNLP @ EMNLP 2024)"

The Mindful Mechanic: Interpreting LLMs’ Decision-Making in Tool Use
API-calling and tool use [1, 2] are expected properties of performant LLM and can often be found in evaluations and benchmarks [3, 4, 5] because the models rely on external knowledge sources and calculations to ensure temporal generalization and factual correctness. However, it remains unclear what parts of a prompt and which mechanisms within LLMs are responsible for deciding when and which tool or API should be used for the next generation step. In a comprehensive study across both instruction-tuned and out-of-the-box LLMs, we examine the decision-making in tool use benchmarks with interpretability methods offering information flow routes [6] and feature attributions [7].

[1] Toolformer (Schick et al., ICLR 2023)
[2] Chameleon (Lu et al., NeurIPS 2023)
[3] ToolLLM (Qin et al., ICLR 2024)
[4] "What Are Tools Anyway?" Survey (Wang et al., COLM 2024)
[5] TACT dataset (Caciularu et al., NeurIPS 2024 D&B)
[6] LM Transparency Tool (Tufanov et al., ACL 2024 Demos)
[7] Inseq (Sarti et al., ACL 2023 Demos)

Explaining Blind Spots of Model-Based Evaluation Metrics for Text Generation

[1] Blindspot NLG (He, Zhang et al., ACL 2023)
[2] AdvEval (Chen et al., ACL 2024 Findings)
[3] LLM Comparative Assessment (Liusie et al., EACL 2024)
[4] Explainable Evaluation Metrics for MT (Leiter et al., JMLR 2024)
[5] TICKing All the Boxes (Cook et al., 2024)
[6] ROSCOE (Golovneva et al., ICLR 2023)
[7] RORA (Jiang, Lu et al., ACL 2024)

Analyzing User Behavior in Explanatory Fact Checking Systems
How can we measure and mitigate human overreliance on persuasive language in the fact checking domain and explanations generated by LLM-based fact checking systems?

[1] LLMs Help Humans Verify Truthfulness (Si et al., NAACL 2024)
[2] Explanations Can Reduce Overreliance (Vasconcelos et al., CSCW 2023)
[3] Explanations to Prevent Overtrust (Mohseni et al., ICWSM 2021)
[4] Explanation Details Affecting Human Performance (Linder et al., Applied AI Letters 2021)
[5] Perception of Explanations in Subjective Decision-Making (Ferguson et al., CHI 2024 TREW)
[6] Role of XAI in Collaborative Disinformation Detection (Schmitt et al., FAccT 2024)
[7] Belief Bias and Explanations (González et al., ACL 2021 Findings)