I’ve come to realise that there is some confusion, especially amongst newcomers to NLP/AI, about when a research paper can be presented at two venues. I try to explain the rules and principles as I understand them.
The ROUGE metric dominates evaluation of summarisation, and I do not understand why. I am not aware of good evidence that ROUGE predicts utility, and recent work by one of my students shows that character-level edit (Levenshtein) distance against a reference text is a better predictor of utility than ROUGE.
Some of my PhD students have recently looked at how many mistakes people (professionals, not Turkers) make when they do NLG-like tasks. The number of mistakes is considerably higher than we expected (although still much lower than the number of mistakes made by current neural NLG systems).
Both academic researchers and commercial NLG developers are interested in building NLG systems which describe sporting events. However, they care about different things. For example, many academics show little interest in use cases, domain knowledge, robustness, and high quality input data, all of which are very important to commercial NLG developers.
NLG texts must be correct pragmatically as well as semantically. In particular, they must not contain statements which are contextually misleading even if they are literally true. We badly need better techniques for evaluating pragmatic accuracy as well as generating pragmatically correct texts.
Like many others, I am trying to do too much in my university academic role. I’m looking for areas where I can “do less” without having a major impact on research and teaching.
There is a lot of uninformed criticism of rule-based NLG in academic papers. In this blog I explain at a very high level how such systems work and what some of the main challenges are in building them.
One of the challenges in data-to-text NLG is creating good summaries and insights when the input is flawed (incomplete, incorrect, or inconsistent). One of my PhD students has been working on this problem, and it is a hard one! But a good solution would be hugely valuable for society. I may be able to offer a PhD studentship in this area, contact me if interested.
I’m excited by the potential of adding conversational capabilities to data-to-text systems, so that users can provide context, ask follow-up questions, etc. I think this is essential to my vision of using NLG to humanise data and AI!
I teach an MSc course on Evaluating AI. which several people have asked me about. In this blog I give an overview of what is in the course. Hopefully this will be useful to people who are interested in learning about (or teaching) evaluation.