A few weeks ago I was talking online to a PhD student I met at EACL. This student was using metrics to evaluate the texts produced by his system, so I told him he should do a human evaluation as well. He responded that he had very little confidence in the human evaluations he had seen, and was not convinced that asking a bunch of random Turkers to annotate readability and accuracy of a generated text on a Likert scale gave more meaningful results than Bertscore or Bleurt.
A similar point was made in a recent paper where professional translators were asked to careful assess the output of MT systems, in order to produce a “gold-standard” human evaluation. The authors showed that Turker-based evaluations were not good predictors of the gold-standard human evaluation, and indeed had worse correlations to the gold-standard human evaluations than some of the metrics (including Bleurt).
I have a lot of sympathy for the above, although I also think that in a lot of NLG contexts even a weak human evaluation is going to be more meaningful than Bleurt. But anyways, I think the main point is that we need high-quality human evaluations. And we should only use “cheap” human evaluations if we know their results are well correlated with high-quality human evaluations (just as we should only use metrics if we know their results are well correlated with high quality human evaluations).
Emerging research topic
I’ve always been interested in high-quality human evaluations, and one my frustrations was that until recently many NLP researchers didnt seem to see much difference between (A) a medical-grade randomised controlled clinical trial on 2500 subjects and (B) asking 50 Turkers to rate texts for readability (etc) on a 5-pt Likert scale. But I’m glad to say that I think this is changing, and I’ve already seen a lot of exciting work on high-quality human evaluation in 2021 (and we’re only halfway through the year!), including
- Above mentioned paper: Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation
- Excellent survey paper: Human evaluation of automatically generated text: Current trends and best practice guidelines
- First-ever EACL workshop on Human Evaluation of NLP Systems
- Upcoming Shared Task on Reproducibility of Human Evaluations in NLG
This is only a sample, there is lots of other recent research on high-quality evaluation, including the work we are doing at Aberdeen on evaluating accuracy.
Exciting to see so much activity in this space!
But how do I do a good human evaluation?
I wont give detailed advice on experimental design here, at least for NLG I instead recommend that people read the above-mentioned survey paper. But I will make some high-level comments. These comments are hardly rocket science, in fact most of them are pretty obvious; however I have seen many papers which report experiments which I suspect violated *all* of the below principles.
Good evaluations take time, effort, and planning: Expect to spend several weeks carrying out a good quality human evaluation, and several months doing a very high-quality evaluation. Similarly a good evaluation could easily require tens of hours from your subjects, and a very high-quality evaluation could require hundreds of hours. You also need to carefully plan the evaluation, dont try to “wing it”.
Find good subjects and treat them well: Find subjects who take the task seriously, and have appropriate domain knowledge, language skills and other characteristics (eg, dont ask a CS undergrad to evaluate a medical NLG system); ideally subjects should also be representative. Once you have found the subjects, treat them well. Explain what you are doing, answer questions and (if they are paid) give them a reasonable hourly wage. Ideally subjects should feel like they are part of your research team.
Ecologically valid experiments: Make your experiments as realistic as possible. For very high quality evaluations, ideally you would deploy your system and get people to use it for real (eg Hunter et al 2011 , Braun et al 2018). If this is not possible, make the experiment as close to real-world usage as possible.
Honest statistics: Compute statistical significance (p value), and do it honestly, including applying multiple hypothesis corrections if appropriate. Its very easy to cheat in statistics in ways that are hard for readers and reviewers to check, for example by tweaking hypotheses and details of statistical analysis. Please dont do this!
Provide enough information for replication: Provide enough information about your experiment so that other people can replicate it. Sometimes you can add this info to a paper, its also fine to have a detailed experimental document in a repository or as supplemental material.
Do pilot studies: Do a small-scale pilot study before you start a large experiment. People often do unexpected things, and I’ve seen many cases where an experiment which made sense to the experimenter did not work well because it confused subjects. The best way to guard against this is to do a small pilot study, whose purpose is to ensure that your experimental protocol works.
Its great to see the growing interest in doing human evaluations in the NLG and NLP communities, but many of the human evaluations I see are not as good as should or could be. Doing high-quality evaluations is a lot of work, but it is usually the best way to really understand how well our ideas work.