evaluation

One-day class on NLG evaluation

Last week I ran a one-day class on NLG evaluation for IBM in Dublin. It covered many topics at a fairly high level. The overall goal was to give people more insights about different types of evaluation and what goes wrong in evaluations; hopefully this will both help people do better evaluations themselves, and also be more critical of weak evaluations in published papers. Ie, I want people to think carefully about how evaluation is done, rather than just run standard libraries and take numbers at face value.

Anyways, it was a nice experience and people have asked me about me slides, so below I summarise what I did and give links to my slides. Note that my slides are designed to be visual aids to a presentation, they are not intended to be read stand-alone.

Session 1: Introduction to NLG (PDF)

The first session was a general high-level introduction to NLG and data-to-text, which was not specifically about evaluation.

The most important take-home message was that there are many ways to build an NLG system, and using the latest LLM is not always the best option (blog).

Session 2: Requirements (PDF)

The second session was about requirements for NLG systems, focusing on NLG quality criteria, human-AI workflows, and how to acquire requirements. Of course there is a large literature on software requirements, I tried to focus on important aspects of NLG requirements which may not be covered in standard textbooks on software requirements.

The key take-home message was that we cannot usefully evaluate an NLG system without knowing what its users care about. For example, in medical contexts users may insist that NLG systems always produce texts which do not have critical accuracy errors. Hence an evaluation which ignores this and just looks at readability (for example) is not going to tell us much about how useful the NLG system will be to users (blog).

I had a number of discussions about this with attendees during breaks, and hopefully the message got across. There is an unfortunate tendancy in NLP for researchers to evalate things which are easy to measure even if users have little interest in them; I hope I convinced attendees not to do this.

Session 3: Evaluation Concepts (PDF)

The third session looked at some basic evaluation concepts, including experimental perspective, statistical hypothesis testing, experimental design, replication, and evaluation challenges of LLMs. I have of course discussed these topics in many of my previous blogs.

One very important take-home message was that there are some common mistakes which people make which degrade the quality of evaluations, including evaluatng the wrong thing (as mentioned above), comparing against weak baselines, using outdated evaluation techniques, using unrepresentative test sets, data contamination, and buggy experimental code (blog). Unfortunately, I suspect that most published papers in NLP suffer from at least one of these problems, which is depressing.

I had some nice chats about this during breaks, especially with people from other communities (not AI), some of whom were surprised that the NLP and AI communities are not more concerned about this. It might be possible to change reviewing processes (especially for top journals) so that they do a better job of detecting such problems, but I wont discuss this here.

Session 4: Automatic evaluation (PDF)

The fourth session gave an overview of automatic (metric) evaluation. I discussed different types (eg, reference-based, referenceless, LLM-as-judge), and gave examples of some popular techniques. I also looked at experimental design, including problems specific to automatic evaluation such as running an evaluation 1000 times (different random seeds, maybe different parameters) and only reporting the best result, and using low-quality reference texts.

The most important take-home message, at least for me, was the importance of validation. Metrics should not be used unless there is strong evidence that they correlate well with high-quality “gold standard” human evaluations (blog). Unfortunately I’ve seen people propose and use metrics purely because they “look sensible” (no experimental evaluation data), and also use metrics which are poorly validated (ie, compared to a low-quality human evaluation which suffers from serious experimental flaws).

Session 5: Human evaluation (PDF)

The fifth session was about human evaluation. I gave examples, briefly discussed research ethics, and looked at experimental design. Problems specific to human evaluation include choosing inappropriate subjects, not giving subjects enough guidance on their task, and not checking that subjects (especially crowdworkers) are taking the task seriously.

A very important take-home message was that there are many types of human evaluations. Most people ask subjects to rate texts (eg on Likert scale) or rank a set of texts, both of which are subjective (based on subject’s opinion). There are more objective ways of doing human evaluation, including asking subjects to annotate individual errors and problems (blog), measuring the impact of a text on how well a subject does a task (blog), and assessing real-world impact on Key Performance Indicators (KPI) of deployed systems (blog). All of these are better (ie, give more meaningful evaluations) than asking people to rate or rank texts.

Session 6: Hands-on Evaluation Exercise (Google Form)

I had discussions throughout the day, but after doing the above material I asked participants to fill out a Google Form which asked them to perform a ratings-based, annotation-based, and task-based human evaluations of simple texts which describe basketball games, and then repeat this exercise by asking their favourite LLM to rate, annotate, and post-edit a generated text. Most people took around 30 minutes to complete the form, and then we had a discussion.

With regard to the human evaluation, it was noticable that agreemeent was much better for the annotation exercise (most people found the same errors) than for the ratings (eg, Likert for “Test is Accurate” ranged from Agree to Somewhat disagree). This is typical, annotation usually gives more consistent results between subjects. It does however take longer than just rating a text.

We had a good discussion about LLM-based evaluation. It didnt work very well or agree with the human ratings, somewhat to the surprise of some of the participants. A few person said that they thought careful prompt engineering would improve the quality of LLM evaluation; I’m sure this is true, but I suspect problems would remain.

Session 7: Other topics (PDF)

The final session looked at a few other topics of interest, including commercial evaluation (costs, benefits, risks, return on investment) and relevant blogs I have written.

I also briefly described six recent (2024) evaluation research papers which I really liked:

I highly recommend all of the above papers!

Final Thoughts

Feedback on the above presentations is very welcome. This is the first time I’ve focused so heavily on “what goes wrong”. Im partially inspired by Greenhalgh’s classic How to Read a Paper material for medics, and amongst other things hope that my material helps researchers be more discerning readers as well as better experimenters.

3 thoughts on “One-day class on NLG evaluation

  1. I have some doubts regarding various evaluation measures.

    Many measures assume that the user is rational and neutral. However, real users (humans) are boundedly rational and hence susceptible to cognitive biases. For example, a person’s assessment of an NLG output or a summary may vary from time to time depending on their environment.

    How “ecologically valid” are the assumptions behind evaluation metrics?

    Like

    1. I agree that subjective evaluation (eg, asking subjects to rate texts on a Likert scale) has robustness issues, with wide differences between subjects and (as you say) some within-subject differences as well. This is why I recommend doing human evaluation with more objective techniques if possible, such as annotating errors, measuring impact on task performance, or assessing real-world impact on KPIs.

      Like

Leave a comment