We had a Human Evaluation workshop last week, at COLING. Lots of neat stuff, but what I was most interested in was the results of the shared task on reproducing human evaluations. Participants presented results in papers and posters; we also had a general discussion at the end of the workshop.
There were lots of good insights from this about what makes an evaluation replicable. Some of these are well-known, for example it is essential that subjects take the task seriously and do not click at random (or use chatGPT). Strategies to minimise this include attention checks, choosing subjects selectively, treating subjects well, and (if possible) doing experiments in-person with people you know, instead of using crowdworkers.
But there was one frequent comment from shared-task participants which perhaps has not received enough attention in the past, which is that subjects need to understand the task. In other words, if an experimenter asks subjects to do a task which is fuzzy and/or complicated, and does not give sufficient instructions, training and/or examples, then subjects will interpret the task differently, which means results will not be replicable and may not be meaningful.
Examples
Its useful to look at some examples. For instance, van Miltenburg et al reproduced a fluency evaluation of a system which generated scientific definitions of terms, and describe several cases where raters (subjects) disagreed about fluency. In one scenario the system generated the definition “see etchplain” for the term “etchplain“. This is clearly not a useful definition, but is it fluent? Their subjects disagreed. In this case the subjects were not given a definition of fluency but instead were shown a small number of examples, none of which were similar to the above scenario.
Dinkar et al also reproduced a fluency evaluation and saw that subjects struggled to understand what “fluency” meant; Dinkar et al tried providing a better definition of fluency but this did not help much.
Problems did not just occur in fluency evaluations. For example, Klubička and Kelleher reproduced an evaluation where subjects counted the number of redundancies in generated texts. Their subjects reported very different counts from the original experiment, partially because it was not clear how to count redundancies. For example, if “low price range” is repeated (eg “the pub provided take-away deliveries in the low price range. It is called The Vaults and is in the low price range“), does this count as 1, 2, or 3 redundancies?
One final example is Fresen et al, who reproduced an evaluation of the informativeness of summaries of meetings. They saw large differences compared to the original study, and pointed out that the dialogues being summarised were complex and ambiguous, which made it difficult for subjects to assess informativeness.
There are many other such examples in the reproduction studies reported in the workshop, I encourage interested readers to look at the proceedings.
Discussion
Human evaluations are most useful, compared to metrics, when assessing criteria that are complex and fuzzy. Indeed, I suspect that in 2024, fluency and redundancy may be better assessed by GPT than by random crowdworkers. Where human evaluation remains essential is in subtle tasks like assessing the safety of health messages. The complexity of such assessments make them hard to automate (at least in my experience), but it also means that we need to somehow make it clear to subjects what they are supposed to assess, otherwise we will see problems similar to the above.
One approach, described by Balloccu et al, is to work with a group of subjects (domain experts in Balloccu’s case) to define and agree on what is being measured, before actually doing the evaluation. This requires more time, effort, and money than throwing tasks at crowdworkers, but I suspect it gives much better results when evaluating complex attributes of texts.
If it is essential to use crowdworkers, then piloting experiments can be a very useful way to identify problems and clarify the task. For example, when Craig Thomson and I developed a protocol for annotating factual errors (blog), we did a lot of piloting with friends and coworkers before releasing the experiment to crowdworkers; this helped us to simplify and clarify what we wanted our crowdworkers to do.
Of course, sometimes there is genuine disagreement between subjects about how to assess a text; in such cases it can be useful to record the distribution of opinions instead of a single assessment. Indeed one of the invited talks at the human evaluation workshop was about this, and there was a separate workshop at Coling (unfortunately at the same time as the human evaluation workshop) on the related topic of Perspectivist Approaches to NLP.
Final Thoughts
Human evaluations are most useful when they evaluate complex and fuzzy aspects of texts, but in such cases there is a real danger that evaluators (subjects) will disagree about what exactly they are evaluating; if so the results of the evaluation will be less meaningful and harder to replicate. We saw many examples of this at the human evaluation workshop, and the problem is likely to get worse when evaluating complex attributes such as safety.
It is essential to think about this *before* starting an evaluation. If you are using subjects you dont know and will never work with again (most crowdworker scenarios), you should pilot your experiment first and check that subjects know what they are supposed to do. If you are using the same group of subjects over time, then work directly with them to refine and clarify the task. But if you care about meaningful evaluation results, do NOT just throw together a complicated experiment that makes sense to you and ship it to crowdworkers without any checking or piloting.
2 thoughts on “Human eval: Subjects must understand the task”