Last week I was at INLG 2019 in Tokyo, which (as usual) was really interesting and exciting. Very wide variety of papers (psycholinguistics, novel applications, reference, hallucinations, deep learning, etc, etc), and of course a chance to meet and catch up with colleagues and former students who I have not seen in ages!
One thing I ***really*** liked about INLG was the large number of papers which discussed evaluation and other “methodological” issues. Since I care deeply about this kind of thing, it was great to see so much discussion about it at the conference!
For the first time in my memory, there were a number of papers at INLG about how to perform good-quality human “intrinsic” evaluations (ie, evaluations based on asking subjects to rate or rank texts, not on task outcomes). These included
- Best practices for the human evaluation of automatically generated text (winner of best long paper award!)
- Agreement is overrated: A plea for correlation to assess human evaluation reliability
- The use of rating and Likert scales in Natural Language Generation human evaluation tasks: A review and some recommendations
- Towards Best Experiment Design for Evaluating Dialogue System Output
I wont comment on these papers individually, but overall they are exploring how to improve the quality of human evaluations of NLG systems, which is very important topic which has not received sufficient attention in the research community. One of the invited talks, by Anja Belz, also addressed this issue, and pointed out that more consistency in human evaluations would make it much easier to compare the results of such evaluations.
There was also talk of organising a “shared task” to try out different experimental designs for human intrinsic evaluation and see which was most effective. This is a tricky task to organise (much harder than the usual shared tasks), so we’ll need to see how this materialises in practice, but its great to see people talking and enthusiastic about this!
Overall, I was very happy to see so many people talking about how to do better human evaluations; this topic is finally starting to get the attention it deserves!
(Some) Other Methodology Papers
Quality of training data: A nice paper (Semantic Noise Matters for Neural Natural Language Generation) on the impact on system performance of cleaning up training data. Training data quality and size was also discussed by Philipp Koehn in his invited talk. I’ve complained about low quality data in the past, and indeed I suspect that in a lot of cases improving data quality will be of more benefit to our systems that swapping in the latest/trendiest learning approach. So again it was very nice to see people highlighting the importance of high-quality clean data.
Replication: I saw a paper replicating earlier work, On task effects in NLG corpus elicitation: a replication study using mixed effects modeling; such papers are unusual in the NLG community. Discussions with other participamts suggests that more replication studies are being undertaken, I look forward to seeing their result! Again replication is a sign of scientific maturity.
Automatic evaluation: In addition to papers on improving human evaluation, there were also papers on automatic evaluation, including Automatic Quality Estimation for Natural Language Generation: Ranting (Jointly Rating and Ranking) and Towards a Metric for Automated Conversational Dialogue System Evaluation and Improvement. I was happy to see that these papers were cautious in their claims. Automatic/metric evaluation of NLG systems, at least in 2019, should be seen as a supplement to human evaluation, not a replacement for it.
Expanding a blog into a paper
As a footnote and on a totally different topic, I presented a paper Natural Language Generation Challenges for Explainable AI for the NL4XAI2019 workshop at INLG, which was essentially an expanded version of one of my blogs (more details, examples, references, etc). The workshop was organised by the NL4XAI project, which I am part of, so I wanted to contribute to it, and I decided to expand a blog post rather than write something from scratch. Anyways, this is the first time I’ve turned a blog post into a proper paper, and I’m interested in thoughts from readers as to whether this is a useful thing to do.