Open Topics

The Mindful Mechanic: Interpreting LLMs’ Decision-Making in Tool Use
API-calling and tool use [1, 2] are expected properties of performant LLM and can often be found in evaluations and benchmarks [3, 4, 5] because the models rely on external knowledge sources and calculations to ensure temporal generalization and factual correctness. However, it remains unclear what parts of a prompt and which mechanisms within LLMs are responsible for deciding when and which tool or API should be used for the next generation step. In a comprehensive study across both instruction-tuned and out-of-the-box LLMs, we examine the decision-making in tool use benchmarks with interpretability methods offering information flow routes [6] and feature attributions [7].

[1] Toolformer (Schick et al., ICLR 2023)
[2] Chameleon (Lu et al., NeurIPS 2023)
[3] ToolLLM (Qin et al., ICLR 2024)
[4] "What Are Tools Anyway?" Survey (Wang et al., COLM 2024)
[5] TACT dataset (Caciularu et al., NeurIPS 2024 D&B)
[6] LM Transparency Tool (Tufanov et al., ACL 2024 Demos)
[7] Inseq (Sarti et al., ACL 2023 Demos)

Explaining Knowledge Conflicts and Factual Errors (of Temporal Generalization) in LLM Generations
How can we expose and express knowledge conflicts in LLMs resulting from poor temporal generalization?

[1] DYNAMICQA (Marjanović et al., EMNLP 2024 Findings)
[2] Survey on Factuality Challenges (Augenstein et al., 2023)
[3] Unfaithful Explanations in CoT Prompting (Turpin et al., NeurIPS 2023)
[4] Interventions for Explaining Factual Associations (Geva et al., EMNLP 2023)
[5] Self-Bias in LLMs (Xu et al., 2024)
[6] Mismatches between Token Probabilities and LLM Outputs (Wang et al., ACL 2024 Findings)
[7] Resolving Knowledge Conflicts (Wang et al., COLM 2024)
[8] SAT Probe (Yuksekgonul et al., ICLR 2024)
[9] MONITOR metric (Wang et al., NAACL 2024)


Simplifying Outcomes of Language Model Component Analyses
How can results and findings from LM component analysis and mechanistic interpretability studies that are very hard to comprehend for non-experts be simplified and illustrated?

[1] NLEs for Neurons (Huang et al., BlackboxNLP 2023)
[2] Summarize and Score (Singh et al., 2023)
[3] Interpreting the Semantic Flow with VISIT (Katz & Belinkov, EMNLP 2023 Findings)
[4] Knowledge-Critical Subnetworks (Bayazit et al., 2024)
[5] Function Vectors (Todd et al., ICLR 2024)
[6] LM Transparency Tool (Tufanov et al., ACL 2024 Demos)
[7] Primer on Component Analysis Methods (Ferrando et al., 2024)


Conversational Model Refinement

  1. Can we elicit expert human feedback using targeted question generation in a mixed-initiative dialogue setting?
  2. Can we use human feedback to natural language explanations to improve the model performance and align it to user preferences?

[1] Compositional Explanations (Yao et al., NeurIPS 2021)
[2] Digital Socrates (Gu et al., ACL 2024)
[3] Explanation Formats (Malaviya et al., NAACL 2024)
[4] FeedbackQA (Li et al., ACL 2022 Findings)
[5] Synthesis Step by Step (Wang et al., EMNLP 2023 Findings)