evaluation

Are Experts Needed in Human Evaluation?

** Paper will be presented at ACL in1615 Generation session on Tuesday 11 July

One of the perennial questions in human evaluation of NLP systems is whether the evaluation needs to be done by domain experts, or whether we can use non-experts such as students, colleagues, and crowd workers (which if course is much cheaper, quicker, and easier to organise). Sometimes the answer is clear (eg, an NLG system that produces complex clinical texts must be evaluated by clinicians), but sometimes its not immediately obvious whether experts are needed.

A team from the PhilHumans project, led by Zixiu (Alex) Wu and Simone Balloccu, has a paper on this topic at ACL2023, called Are Experts Needed? On Human Evaluation of Counselling Reflection Generation. From my perspective, the key findings are that the usefulness of evaluations with non-experts is decreasing as LLMs get better. You can read the paper for more details and information.

Domain

The team is working in a therapy domain (see previous work on AnnoMI), and in particular is interested in using large language models such as GPT2 and GPT3 to generate reflections, where the therapist reflects back to the patient what the patient has said, for example “It sounds like you’re really upset with her because she invaded your privacy.”

From an evaluation perspective, the challenge is to identify when a reflection is appropriate and when it is not. The team also wants to characterise inappropriate reflections as

  • Malformed: suffers from unclear references, bad grammar, and/or confusing logic.
  • Dialogue-contradicting: contradicts context partially or fully.
  • Parroting: repeats a part of context unnaturally.
  • Off-topic: little to no relevance to context.
  • On-topic but unverifiable: relevant to context but including content that cannot be verified based on context alone.

Study

The team used both GPT2 and GPT3 to generate reflections, and asked both therapists (experts) and lay people to evaluate the reflections produced by both GPT2 and GPT3, and also gold standard reflections (what therapists actually said). One finding is that the correlation between lay people and therapists was never dreadful, especially if we take into consideration that this is a difficult subjective task with only moderate inter-annotator agreement (Randolph’s kappa was mostly around 0.4). So its not a disaster to use lay people in this task.

However, there are two important caveats which suggest that evaluations by lay people may be less useful for newer LLMs which generate higher-quality texts.

Correlation between experts and lay people is worse with higher-quality LLMs: One key finding is that the correlation between lay people and therapists, for coherence (overall acceptability), was quite good for GPT2 reflections (Spearman correlation of 0.74), but less good for GPT3 reflections (Spearman correlation of 0.44). In other words, as language models become better, its harder for non-experts to evaluate whether their output is acceptable, perhaps because the problems are more subtle (or perhaps lay people are misled by the high fluency of GPT3 texts). This makes sense to me, and resonates with what I am seeing in other studies.

Judgements of human-authored gold texts are harsher when subjects see better GPT texts: Another really interesting finding is that the therapist-authored “gold reflections”, which are the same in the GPT2 and GPT3 experiments, are rated as less coherent in the GPT3 experiments. In other words, in a context where subjects are seeing GPT3 texts, they are harsher in rating therapist-authored Gold texts than in contexts where they are seeing GPT2 texts. This effect is seen with both experts and lay people, but is considerably stronger with lay people. The paper presents a case study where the same subject (lay person L7) marked the same Gold text as Coherent in the GPT2 experiment, but incoherent in the GPT3 experiment.

I think there is a really important point here. We tend to assume in human evaluations that ratings and judgments of texts are not influenced by the other texts that raters see. But this paper clearly shows that this assumption is false, and judgements are strongly influenced by other texts, especially for non-expert subjects.

My thoughts

In 2009 we did an evaluation with both lay people and experts, on weather forecasts (paper), and didnt see a huge difference in results; I believe other people have also observed this in semi-technical domains like weather forecasts (of course we cannot ask a lay person to evaluate clinical notes written in complex medical language). But what this study suggests is that lay person evaluation is less appropriate for high-quality texts produced by modern LLMs. Ie, as generated texts get closer to human quality, it becomes harder for non-experts to find problems and assess text quality. This resonates with what I am hearing in other fields such as MT.

In short, evaluating texts with domain experts is a pain, but it may become increasingly necessary as the quality of generated texts improves.

3 thoughts on “Are Experts Needed in Human Evaluation?

  1. Hi…

    The paper and the blog post are highly informative. Thank you.
    I have a couple of doubts.
    How are you defining “coherence”? Can you please provide a good reference for the coherence of the text?
    How utility of a reflection is different from its coherence? Can we say when a reflection is highly coherent, it is of high utility?

    Like

    1. In this paper, “coherent” basically means appropriate (see the examples in the appendix, Figs 5,6,7). Of course more quality criteria could be assessed, but lay people often find it difficult to rate more than one or two quality criteria on a text

      Like

Leave a comment