In the Retroeval workshop, I gave a talk on NLG Evaluation: Past, Present, Future. People were mostly interested on my thoughts about the future. I try to summarise these below. Basically, I hope that NLG evaluation will become more meaningful, and in particular become (A) more rigorous and (B) better connected to real-world effectiveness. (A) means better designed and executed experiments, (B) means moving away from benchmarks and doing more impact, qualitative, and safety evaluation.
Experimental rigour
Scientific experiments are meaningless if they are poorly designed or executed. Unfortunately, many experiments in NLG and NLP are not rigorous; for example they use inappropriate datasets or metrics, are not reproducible, suffer from data contamination or reward hacking, or use buggy code. I have written numerous blogs about these problems and will not repeat them here. What is most worrying is that NLP research culture tolerates and indeed in some cases encourages poor experiments (blog).
We know how to do proper experiments, so the challenge is to change NLP research culture so that researchers value good experiments (currently some do, but many do not). One encouraging sign to me is that other communities such as medicine who have high standards for experimental rigour are using our technology. They will insist on rigorous evaluation, and hopefully the NLP community will learn from this.
Meaningful evaluation
NLP evaluation is dominated by benchmarks and metrics. Even if these are done well (see above), scores on artificial benchmarks do not tell us much about how useful and effective a system/model/prompt will be when used in real contexts. Also, benchmark numbers become outdated very quickly as new models are announced (blog), and most users do not care about the small differences that SOTA-chasers fixate on.
If NLG and NLP are important real-world technologies, we need to evaluate them in a way which provides meaningful and useful insights to the people who want to use our tech. I deliberately say “insight” instead of numbers, because insights are what our users care about. As above, this means more focus on impact, qualitative, and safety evaluation.
Impact evaluation (blog) involves directly measuring how a deployed NLG or NLP system changes key performance indicators (KPIs). In short, instead of looking at artificial benchmarks, we measure how our system actually helps users. There are a number of ways of doing this, including randomised controlled trials and before-and-after studies. Impact evaluation is incredibly rare in the NLP community (Reiter 2025), perhaps because it takes a lot more time, resources, and planning than running benchmarks. But if we truly care about how effective our technology is, we need to do impact evaluations!
Qualitative evaluation (blog) involves qualitatively analysing data for useful insights. This is often done with textual data such as user feedback, interviews, and focus groups; another useful technique is qualitative error analysis. Regardless, the idea is to carefully and qualitatively analyse a relatively small number of items, looking for deeper insights that go beyond simple numbers. Qualitative evaluation is currently pretty rare in NLP, but it can provide very valuable insights on how people react to and use NLP technology in complex and noisy real-world contexts. And these insights may remain useful and valid long after quantitative numbers become obsolete because of new models being released.
Safety evaluation (blog) involves checking if systems can be dangerous. Crucially, while most evaluation focuses on average case behaviour, safety evaluation looks at what happens in the worse case, and in some cases what happens when adversaries (such as hackers) attack a system. Safety evaluation is already widely recognised as being very important, including by governments, and is a major research area (Bengio et al 2026). Ultimately users and society usually care much more about whether a system is safe than whether it is marginally better than a competitor.
Summary
Evaluation needs to become more rigorous, in the sense of carefully designed and executed experiments; many people are saying this, not just me! We know how to do rigorous evaluations, the challenge is changing a research culture which tolerates poor experiments.
But in addition, we also need to change the type of evaluation we do, and move away from a fixation on quantitative evaluations based on average performance on an artificial benchmark. A bit of this is OK, but we need to look at real-world impact as well as artificial benchmarks, do qualitative as well as quantitative analyses, and look at behaviour in worse case and adversarial contexts as well as average performance, especially if we care about safety.
I hope that we will see good progress towards these goals by 2030, and widespread adoption and acceptance by 2035.