evaluation

Do LLM benchmarks ignore NLG?

Over the holiday period I am trying to catch up on reading, and just finished looking at the paper announcing the Amazon Nova LLMs. When I read the evaluation section, which mentioned dozens of benchmarks, it struck me that very few of these evaluated language generation, ie the ability to produce an appropriate high-quality text. Which seems weird, since LLMs are language-generation engines (this is certainly how they are viewed by the public); so why is there so little interest in evaluating their language generation capabilities?

Benchmarks used to evaluate Nova

More concretely, most of the text-related benchmarks are based on question-answering, with the result being either selecting a multiple-choice option or providing a simple answer (eg MMLU, BBH, GPQA). These evaluate world knowledge, reasoning, and language understanding, but they do not evaluate text generation. The report does show a text generated for ChartQA (fig 10), but the chartQA evaluation benchmark just assess correctness of a short answer, it does not assess quality of a paragraph-length text.

A few of the metrics do assess things related to text generation, but I am mostly not impressed. Most of them rely on ngram-based metrics which compare to reference text(s) such as BLEU, ROUGE, CIDEr; no one should be using such techniques to assess text quality in 2024.

  • IFEval: tests whether generated texts obey simple constraints, such as having at least 25 sentences. This doesnt say much about ability to generate good texts
  • COMET and spBLEU for machine translation: COMET is fine, this is the one NLG-related metric used which I have confidence in. As above, no one should be using BLEU in 2024
  • CIDEr for image/video captioning. Again I am not impressed by CIDEr as a way of assessing NLG quality (as opposed to ability to “understand” image/video).
  • ROUGE for summarisation; no one should be using ROUGE!

I am happy that the COMET was used to evaluate MT, this is a solid evaluation; otherwise I am disappointed.

In 2024, the best way to automatically evaluate text quality is LLM-as-judge (blog). However, the Nova report only mentions this once, for CRAG, which uses LLMs to simply assess whether an answer is accurate, incorrect, or missing (ie, the LLM chooses one of these three categories).

What is missing

I would love to see a proper evaluation of a real text-generation task. This could include a mixture of careful evaluation by domain experts (which Amazon and friends can easily afford) and LLM-as-judge. Of course such an evaluation would not be perfect, but it would tell us much more about how well Nova can generate texts than the current evaluation!

Below are some concrete suggestions, based on the NLG application categories I discuss in my book: journalism, business intelligence, summarisation, and healthcare. There are many of course many other possibilites (eg, see Kasner and Dušek 2024).

Sports summaries (journalism): This task is to generate a sports story (suitable to appear in the media) about a sports game, based on game data (both statistics and play-by-play data). My students have worked on generating summaries of basketball, but other sports could be used. There are already some sports datasets (eg, Sportsett), but they have old data, so it would be better to create a new dataset (as suggested by Kasner and Dušek 2024) in order to address data contamination concerns.

Investor reports (business intelligence): Nova has been evaluated on financial QA based on earnings report. How about taking this a step further and generating a financial summary (looking at broader data, not just earnings reports) for potential investors? This is not a task I an very familiar with, but it seems a natural extension of Nova’s current FinQA evaluation.

Summarising doctor-patient consultations (summarisation and healthcare): This task is to generate a summary of a doctor-patient consultation, suitable to be entered into the patients electronic health record; again this is something my students have worked on. There are some such datasets already, such as PriMock57, but as with sports (above) it would be better to create a new dataset because of data contamination concerns.

Final Thoughts

The evaluation suite for Amazon Nova, which I assume is typical of other LLMs (I have not checked), emphasises question answering based on world knowledge, reasoning, and language understanding. This is fine, but I am very disappointed to see so little decent evaluation of quality of generated texts. To put this another way, the coverage of mathematical problem-solving ability seems better than the coverage of NLG (and at the time of writing there is a lot of excitement about new benchmarks in mathematical problem solving). But I suspect very few people use LLMs to solve math problems, while many people use LLMs to produce summaries, letters, documents, etc.

Perhaps I have misunderstood something, if so please let me know! Also, if anyone wants to build a proper “NLG benchmark” for LLMs, I would be happy to help.

Postscript (27 Dec 2024): I want to be clear that I think that Amazon’s benchmarking of Nova is better than a lot of what I see in this space! I focus on it exactly because it is a good recent example of LLM benchmarking. For example, someone asked me on X/Twitter about Stanford’s HELM benchmark suite. HELM evaluates machine translation using BLEU scores on WMT14 data set, which is a ridiculous way to evaluate MT in 2024; Amazon’s approach of using COMET scores on FLORES data set is much better.

2 thoughts on “Do LLM benchmarks ignore NLG?

Leave a comment