I’m writing a book on NLG (blog), and for the past few weeks I’ve been working on the evaluation section. This has encouraged me to take a ‘big picture’ perspective, and one possibility I see for the future is that evaluations based on LLMs such as PALM and GPT will replace current metrics (BLEU, BLEURT, etc) and also lower-quality human evaluations (eg, most studies which ask Turkers for Likert ratings). At the same time, widespread use of LLMs will encourage the development of high-quality human evaluations to assess aspects of quality that are difficult for LLMs, such as semantic and pragmatic correctness.
In short, the current mix of evaluation in NLG (mostly metrics, including obsolete ones such as BLEU; some human, many of questionable quality) will be replaced by a combination of LLM-based evaluations and high-quality human evaluations.
Current state of evaluation in NLG
In 2023 most evaluation in NLG is done using automatic metrics. Older and obsolete metrics such as BLEU are still heavily used, despite the existence of much better alternatives. However, even modern trained metrics such as BLEURT struggle to measure semantic correctness (eg, hallucinations and omissions) and pragmatic correctness.
Some evaluation is done using human evaluation, but the scientific quality of a lot of human evaluation is poor, and recent work emphasises that poor-quality human evaluation is not very useful. However, we are seeing more work on high-quality human evaluation protocols. I’m especially excited by annotation-based protocols such as MQM and our work on evaluating factual accuracy in data-to-text.
LLM based evaluation
At the time of writing, there is a lot of excitement about using LLMs to evaluate generated texts. I was especially impressed by a recent paper by Kocmi and Federman, which showed that GPT 3.5 could evaluate machine translation texts better than existing metrics, using straightforward prompts, no examples, and no reference texts. I know a lot of other people are exploring this space, and it seems plausible to me that LLM-based evaluation could replace BLEU, BLEURT, lower-quality human evaluations, etc. Which I think overall would be a good thing, maybe we’ll finally see the end of BLEU…
However, it is essential that researchers carefully validate where LLM-based evaluation correlates well with high-quality human evaluations, and what sorts of problems it misses. I attended an evaluation webinar led by someone from OpenAI, who claimed that the best way to evaluate GPT4 outputs was by using GPT4 as an evaluation tool. He did not present any evidence to support this other than the fact that he had eyeballed a few ad-hoc evaluation examples and they looked good. GPT4 may well be an excellent evaluation tool, but we need to scientifically validate this claim, and understand where it works and where it doesnt!
Another important point is that in order to make it possible to replicate ecaluations, we need to use LLMs which are fixed and do not change (many commercial LLMs such as GPT4 are constantly being updated, which improves performance but makes replicability hard). Open-souce LLMs would be ideal from a replicability perspective.
The other impact of LLMs is that people are taking semantic and even pragmatic correctness more seriously. For many years evaluation techniques for semantic correctness were awful, but few researchers cared, perhaps because they were more interested in leaderboard positions based on rubbish metrics than in real-world utility. But this is changing because semantic correctness is a huge problem with real-world usage of LLMs. I think/hope this will lead to more work on emphasis on high-quality human evaluations, especially task-based and annotation-based evaluations; these are the only reliable way (at least in 2023) to evaluate semantic and pragmatic quality.
NLG Evaluation in 2025
Putting this together, perhaps by 2025 NLG evaluation will look like the following
- Automatic evaluation is mostly done using LLMs. We will have a small number of standard protocols (eg, LLM+prompt pairs) which have been carefully validated as described above. Most of these will use fixed LLMs in order to enhance replicability. BLEU and ROUGE will be ancient history, and BLEURT will be fading.
- High-quality human evaluation will be widely used and indeed expected for high-prestige scientific papers. Annotation and task based protocols will largely replace Likert scales, and researchers will be very careful in both designing and executing their experiments.
- For all types of evaluation, researchers will be expected to respond to concerns and answer questions after their paper is published; this will be supported by formal discussion forums for papers (blog).
Above definitely would be progress and lead to better and more meaningful evaluations. Maybe I’m being naive, but I especially hope that we will finally end the use of BLEU (and ROUGE). I’ve been complaining about BLEU for almost 20 years (paper), as have others. There is no scientific justification for its use in 2023, but unfortunately many researchers are reluctant to change. Sometimes a “shock” is the best agent of change, and perhaps LLMs can provide such a shock!