I had a book launch in December 2024, with lots of interesting discussion. One question which made me think was how NLG evaluation 5-10 years ago compared to NLG evaluation now; lets choose 2015 (there actually was not much difference in NLG evaluation between 2015 and 2020). The short answer is that evaluation in 2015 was very disappointing; evaluation in 2025 is definitely better, but still not where it should be.
Please note that my comments below are about evaluation of text generation; I’m not talking about evaluation of problem solving, world knowledge, sentiment analysis, etc.
Automatic (metric) evaluation (better in 2025)
In 2015 (as in 2020), the great majority of NLG evaluation was done automatically (metrics). The most common metrics were BLEU and ROUGE. The frustrating thing was that these metrics were known to be poor predictors of text quality in real-world usage (blog, blog), but most people in the academic community did not care. One senior researcher agreed with me that metrics didn’t mean much, but said he still used them because they made it easy to publish papers and get grants. In effect, the community insisted on numbers showing improvement in state-of-art (SOTA), but didnt care whether the numbers meant anything. This made no sense to me, and was very disappointing.
In 2025, the best automatic evaluations use LLMs (LLM-as-judge). If this is done carefully, taking into consideration limitations, biases, and issues such as data contamination, it does often have some predictive power about real-world utility. So still some issues and concerns, but much better than BLEU!
Human evaluation (better in 2025)
In 2015, most human evaluation involved asking crowdworkers to rate texts on Likert scales, or give a preference between texts. The problem here is that rating/preference is subjective, and crowdworkers often do not care much about doing this well. If this kind of evaluation is done very carefully (WMT direct assessment is a good example), it does mean something. But most papers I saw were not nearly as careful as WMT, and a sloppy human evaluation does not mean much. So somewhat disappointing.
In 2025, high-quality human evaluations are often done by asking domain experts to annotate texts for mistakes and problems (blog). This is a far better approach, and gives results which are less subjective and more meaningful.
Replication and robustness (still disappointing in 2025)
In 2015, very few people cared about replication, I suspect most people didnt even understand what it meant. The community’s insistence on “impressive numbers, but we dont care if they mean anything” meant that most people did not care about doing robust experiments; after all, what is the point of robustly measuring something which is meaningless? Definitely depressing…
In 2025, I think most researchers do understand what replication means, but unfortunately in my experience most NLP researchers are hostile and do not support replication (blog). We are seeing more papers about experimental quality and robustness (blog), but again in practice many people do not seem to care much about this (blog). Maybe this goes back to reviewing; since reviewers cannot check if authors will support replication and cannot check if an experiment is properly done, some researchers may decide to ignore this. Post-publication monitoring (blog) would help address this, but remains unheard of in NLP venues. So still pretty depressing.
Impact evaluation (still disappointing in 2025)
In 2015, it was almost unheard of to evaluate NLG systems on the basis of real-world impact; unfortunately, this remain true in 2025. We’ve done some impact evaluations in my group (blog), but I’ve seen very little elsewhere looking at real-world impact of text generation (although there has been some work on real-world impact of code generation). I recently was trying to help some PhD students (not at Aberdeen) write a survey related to evaluation, and it was clear that that had little exposure to or knowledge of impact evaluation.
This is very depressing, especially in 2025. We tell people that AI will change everything and have a huge impact on human lives and societies, but we refuse to actually try to measure this impact…
Commercial pressure (gotten worse in 2025)
In 2015 commercial NLG companies (such as Arria) existed, but they didnt have much influence over how academic NLG systems are evaluated. Unfortunately in 2025, the big LLM vendors (especially OpenAI) have a lot of influence over evaluation, including in academic papers, and I strongly suspect they push evaluations which make their systems look good (which explains the problems I described in blog); essentially they view evaluation as marketing. The LLM vendors also do not encourage NLG evaluations (blog), despite the fact that LLMs are generative language engines.
So this is an area where the situation is worse in 2025 than in 2015.
Final Thoughts
In 2025 we have much better evaluation techniques than in 2015, especially LLM-as-judge for automatic evaluation, and annotation-based human evaluation. This is great, and what is also great is the progress. Evaluation in 2020 was largely unchanged from 2015 and even 2010 (eg, focus on BLEU), so its very encouraging to see the progress in 2025. Progress continues, for example, I am very interested in human evaluations based on error-span annotation, and we plan to use this in one of our projects in 2025.
What is discouraging is the lack of progress in replication, experimental robustness, and (especially) evaluating impact. Unfortunately much of the research community still views evaluation as a kind of game, where the goal is to come up with impressive numbers, with little concern for whether the numbers were produced robustly or mean anything in real-world usage. Looking forward, I am concerned about the strong and growing influence of LLM vendors who view evaluation as a marketing exercise.
So overall, I see a lot of progress in techniques and technology, but much less progress in culture and attitudes, which perhaps is not surprising; tech changes much faster than people!