Seven papers accepted at EMNLP 2025!

August 2025

We are very proud to have gotten a total of seven papers accepted to EMNLP 2025:

In “Reason to Rote: Rethinking Memorization in Reasoning”, Yuekun Yao and incoming postdoc Yupei Du investigate the tension between memorization and generalization in LLMs. Through the clever addition of noise to synthetic datasets, they find that memorization of noisy labels in transformer language models does not override their capability to generalizably reason: instead, it subtly adapts the same underlying computational mechanisms. This offers an explanation on how models can simultaneously handle both clean and noisy labels, highlighting their inductive bias towards reusing existing structures.

The paper “Evaluating Spatiotemporal Consistency in Automatically Generated Sewing Instructions” summarizes Luisa Geiger’s MSc thesis, co-supervised by Mareike Hartmann and Michael Sullivan. Geiger tackles the problem of generating natural-language instructions from visual sewing patterns. A particular challenge is evaluation. Geiger develops a novel tree-based method for automatically checking the correctness of the instructions that correlates well with human judgments.

“Playpen: An Environment for Exploring Learning From Dialogue Game Feedback” sets the stage for a planned shared task on learning to play dialogue games in interaction. This paper presents the Playpen architecture and evaluates a number of baseline learning methods on it. Michael Sullivan contributed a GRPO-based learner. Playpen is an initative that came out of an ELLIS workshop in 2024.

In “Procedural Environment Generation for Tool-Use Agents”, Michael Sullivan and Mareike Hartmann show how to improve the accuracy of a tool-use agent by training it on synthetic data. They develop a typed language for generating both complex tasks and random tools to solve them. A focus is on the use of unseen tools for real-world tasks that require orchestrating tools from different domains.

“Language Models Can Learn Implicit Multi-Hop Reasoning, but Only if They Have Lots of Training Data” studies under what circumstances transformers can be trained to perform k-hop reasoning in a single forward pass, without chain-of-thought. Yuekun Yao and Yupei Du find that transformers can learn to do this, but the amount of required training data grows exponentially and the required transformer depth grows linearly in k. The paper also contains a theoretical result by co-author Michael Hahn proving this lower bound on the depth.

In “A Knapsack by Any Other Name: Presentation impacts LLM performance on NP-hard problems”, former intern Alex Duchnowski investigates whether LLMs are capable of solving NP-hard optimization problems as they occur in the real world. Alex finds that there is a wide gap in problem-solving accuracy when a standard NP-hard problem is “costumed” as an “everyday problem”, compared to the textbook formulation. This holds both for methods that solve the problem by LLM reasoning and for methods that only use the LLM to map the problem to a linear program and then solve the LP using an optimal solver.

Finally, “Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests” study whether “cognitive abilities” of LLMs can best be measured. The paper finds that static benchmarks and interactive evaluations in the context of dialogue games have complementary strengths in predicting models’ performance on cognitive tests, with dialogue games outperforming static benchmarks in discriminating between models. This provides additional motivation for focusing on interactive dialogue games in the Playpen Challenge mentioned above.

Congratulations to all authors!