Can LLM-based eval replace human evaluation?

I’ve had several chats over the past month about whether LLM-based evaluation can replace human evaluation. Ie, evaluate texts by asking GPT (etc) to assess their quality, instead of people. Of course the LLM evaluation must be done well, for example LLMs should not be asked to evaluate their own output (ie, do not ask GPT4 to evaluate text produced by GPT4).

Expanding on what I said in a previous blog, I think LLMs can replace most “ask Turkers to rate texts on a Likert scale” evaluations, which may be the most common kind of human evaluation in NLP. But LLM evaluation cannot replace high-quality evaluations based on real-world impact and/or annotation by domain experts.

In this blog I assume that the goal of evaluation is to assess quality criteria such as readability, accuracy, and usefullness, as they would be perceived by real-world users of the NLG systems and its texts. So for quality criteria Q, evaluation A is better than evaluation B if A is a better predictor of Q in real-world usage.

Mostly replaceable: Rating/ranking evaluations with crowdworkers or students

The most popular human evaluation in NLP is to ask students, colleagues, or crowdworkers to read generated texts and either

Rate the texts with regard to a quality criteria such as clarity or accuracy; rating is usually done on a Likert scale.
Rank a set of texts on basis of quality criteria; for example given two texts, say which of these is most accurate.

Van der Lee et al describe such evaluations in detail, and give best practice recommendations for carrying them out.

One problem with evaluations of this type is that many suffer from poor design (blog) or poor execution (blog). These definitely are inferior to LLM-based evaluations.

However, even a well-designed and executed evaluation of this type is inherently subjective, and based on the opinions of the subjects used in the experiment. Which is an issue because students, colleagues, and/or crowdworkers are unlikely to be representative of real-world users of the NLG system; this impacts the predictive power of the evaluation. Partially for this reason, I suspect that LLM-based evaluation metrics may give more objective and meaningful assessments, for the most common quality criteria.

Having said this, I think rating/ranking evaluations are still very useful (and tell us more than LLM evaluation) for more subtle quality criteria such as trust and safety, *if* they are done with representative users and/or domain experts (and of course well designed and carefully executed). For example, Balloccu et al (2024) asked experts to rate the safety of texts produced by GPT in response to dietary struggles; we have not been able to get LLMs to robustly replicate these ratings.

Partially replaceable: evaluation by annotations

There is more to human evaluation than asking Turkers to assess texts on a Likert scale (blog)! In particular, I am a strong believer in annotation-based evaluation (blog), especially when done by domain experts.

Annotation evaluations ask people to carefully read generated texts and mark up errors and other problems. We have done a lot of this kind of evaluation in my group, and I am convinced that it gives much more meaningful results that crowdworker Likert ratings. Annotation-based evaluation by domain experts is also the heart of the MQM evaluation protocol in machine translation.

I am also convinced that annotation evaluation can give us much better understanding of real-world quality than LLM-based evaluation. A great example of this was a recent paper (Magesh et al 2024) which asked legal experts to annotate legal texts produced by LLMs, and identified numerous problems and issues; we have done something similar to healthcare. I doubt that LLMs could have detected these errors and issues.

There is work on prompting LLMs to do annotation-based evaluation (eg Kocmi et al 2023), which has limitations but presumably will get better over time. My view is that we can only ask LLMs to do annotation-based evaluations when we have an excellent understanding of the annotation process and what we expect to find (which is the case for the MQM protocol that Kocmi et al looked at); which means we first need to first ask domain experts to do the annotation, in order to develop and get experience with the protocol. I also suspect LLMs will may struggle to annotate subtle quality criteria like the ones mentioned above.

Not replaceable: impact evaluation

If we really want to understand how useful an NLG system is, there is no substitute for an Impact-based evaluation (blog). In such evaluations the system is deployed and used for real, and we measure its impact on users and key performance indicators (KPIs). Impact evaluations by definition must look at real-world usage, they cannot be done by LLMs!

For example many years ago we evaluated a smoking-cessation system using a randomised controlled trial (and discovered that it was not effective), and more recently by student Francesco Moramarco evaluated a medical report generator by measuring the impact of the deployed system on the productivity of its users; both of these are described in blog.

We can also evalate the impact on NLG system has on task-performance in an artificial context. For example, in the Babytalk project, we showed doctors NLG and human-written summaries of patient data, asked them to make an intervention decision, and then checked if their decision matched the gold standard (Portet et al 2009).

Impact evaluations in NLG require substantial amounts of time, money, and effort, and also require addressing ethical issues; if people are going to use a system for real, we need to show that it will not harm them. I suspect they are more common in commercial (as opposed to academic) contexts, but it is hard to say for sure, since companies tend to only publish evaluations that make their product look good.

Final thoughts

There may not be much of a future for the type of human evaluation which have dominated NLP in the past, namely asking crowdworkers, students, or colleagues to rate or rank texts. But this doesnt mean that human evaluation is no longer important or needed, because there are other (better) types of human evaluation which are still extremely useful and which will tell us things about our NLG systems and texts which we cannot learn any other way!

Ehud Reiter's Blog

Ehud's thoughts about Natural Language Generation. Also see my book on NLG.

Can LLM-based eval replace human evaluation?

Mostly replaceable: Rating/ranking evaluations with crowdworkers or students

Partially replaceable: evaluation by annotations

Not replaceable: impact evaluation

Final thoughts

3 thoughts on “Can LLM-based eval replace human evaluation?”

Leave a comment Cancel reply

Mostly replaceable: Rating/ranking evaluations with crowdworkers or students

Partially replaceable: evaluation by annotations

Not replaceable: impact evaluation

Final thoughts

Share this:

Related

Share this:

3 thoughts on “Can LLM-based eval replace human evaluation?”

Leave a comment Cancel reply