We can build much better NLG systems if we understand what users want the systems to do! This may sound trite, but there is very little research in the academic community in understanding user needs and requirements, which is a shame and indeed lost opportunity.
I was very happy to win an INLG Test of Time award for my paper “An Architecture for Data-to-Text Systems”, so I thought I’d write a few comments on it.
A travelogue about a recent bike trip. After two years of being limited in my holidays by Covid, it was great to finally be able to do some cycle touring again!
Society (and most funding agencies) want to see real-world benefits or “impact” from academic research. Of course not all research will have real-world impact, and impact may take years or decades to appear! I share some thoughts on types of impact, barriers to impact, and my personal experiences.
I am excited by the idea of using error annotation to evaluate NLG systems, where domain experts or other knowledgeable people mark up individual errors in generated texts. I think this is usually more meaningful and gives better insights that asking crowdworkers to rate or rank texts, which is how most human evaluations are currently done.
Progress in NLG requires understanding what users want, creating high quality data sets, building models and algorithms, and thoroughly evaluating systems. I remain disappointed that the research community seems fixated on building models and pays much less attention to user needs, datasets, and evaluation.
The most meaningful evaluation is when we test whether an NLG system actually achieves its communicative goal, eg helps people make better decisions or write documents faster. Unfortunately such “extrinsic” or “task” evaluation is rare in NLP in 2002, we need to see more such evaluations!
I’ve come to realise that there is some confusion, especially amongst newcomers to NLP/AI, about when a research paper can be presented at two venues. I try to explain the rules and principles as I understand them.
The ROUGE metric dominates evaluation of summarisation, and I do not understand why. I am not aware of good evidence that ROUGE predicts utility, and recent work by one of my students shows that character-level edit (Levenshtein) distance against a reference text is a better predictor of utility than ROUGE.
Some of my PhD students have recently looked at how many mistakes people (professionals, not Turkers) make when they do NLG-like tasks. The number of mistakes is considerably higher than we expected (although still much lower than the number of mistakes made by current neural NLG systems).