There’s been a lot of interest recently in improving human evaluations in NLG, which is great. The best paper awards in both INLG 2019 and INLG 2020 went to papers on this topic (van der Lee et al 2019, Belz et al 2020), and I’m very happy to be one of the organisers of the first ever Workshop on Human Evaluation of NLP Systems, which will be at EACL 2021. Its great to see the community agreeing that human evaluation is (A) very important and (B) must be done well!
One thing that I’ve not yet seen much of in NLG is experimental comparisons of different human evaluations. Ie, experiments where we conduct two or more different human evaluations (eg, different raters, different materials) and compare the result. There have been a few papers along this line, such as Belz and Gatt 2008, but they are unusual. Such papers are more common in Machine Translation (MT), and I think the NLG community can learn and be inspired by the MT community in this regard.
Comparing Human Evaluations in MT
I’ve seen a number of MT papers, usually linked to the WMT shared task and conference, which compare different human evaluations. For example, WMT 2016 ran two human evaluations: researcher rankings where researchers ranked texts best-to-worst (TrueSkill was used to produce system scores), and direct assessment where Turkers were asked to individually rate each text. These gave similar results, with Pearson correlation above 0.9 for all language pairs.
I did some reading over the holidays, and really liked a paper on this topic by Läubli et al, A Set of Recommendations for Assessing Human-Machine Parity in Language Translation. They were interested in the claims that MT systems were better than human translators, and investigated how such claims depend on the design of the human evaluation of the systems. They discovered that
- Raters: The authors compared ratings from professional translators (experts) to ratings from bilinguals who were not translators (non-experts). All raters preferred the human translations, but experts saw a bigger gap between human and MT translations, perhaps because they were more sensitive to nuances. Inter-annotator agreement was higher for experts than non-experts.
- Text size: The authors compared the task of evaluating translated sentences with the task of evaluating translations of complete news articles. In both cases, evaluators thought human translations were more fluent. However, with regard to adequacy, evaluators had a small (non-significant) preference for MT at the sentence level, but a significant preference for human translations at the article level. The authors point out a number of document/discourse level phenomena which are handled poorly by MT systems.
- Human translations: The MT texts were compared to human translations, and of course there are many ways of producing human translations. The authors show that if translators are asked to prioritise fluency, this increases fluency but may decrease accuracy in the human translations,
- Text source: In the context of Chinese->English translation, the authors compare translations of (A) texts originally written in Chinese and (B) Chinese translations of English texts. They show that human translations (Chinese->English) are preferred over MT for (A) but nor for (B). Ie, MT systems are really good at “back-translating” a translated document into its original language.
In other words, if you want to experimentally check whether MT systems are better than human translators, you will get different answers depending on whether you
- ask non-experts to evaluate sentences which are “back translations” of translated sentences back into their original source language, OR
- ask professional translators to evaluate translations of complete articles which are not themselves translations (ie, the articles were written in the language being translated).
Overall, this was a great example of an experimental meta-evaluation of human evaluation, where the experimenters change the design of the human evaluation and see what impact this has on the result. Their results are also really interesting, and resonate with NLG. Eg, neural NLG systems certainly are better at generating sentences than they are at generating documents, and my own experience shows that domain experts are better at considering nuances than non-experts.
Back in 2005, we showed that human evaluators preferred wind descriptions produced by our SumTime NLG weather forecast generator over wind descriptions written by human meteorologists. Looking back at this, I suspect the issues that Läubli et al investigated would also impact this finding.
Comparing human evaluations in NLG?
It would be great to see similar experiments in NLG! In fact Craig Thomson and I are running a shared task on evaluating accuracy in NLG texts (paper) (Github), where we are soliciting submissions which are human evaluation protocols as well as submissions which are automatic metrics. If we get several submissions which are human evaluation protocols, this will effectively be an experiment which compares different human evaluation protocols. I encourage people to submit such protocols to our shared task!