Language is diverse, and different syntax, vocabulary, document structures, etc are used in different domains and genres. NLG developers and researchers need to keep this in mind if they are trying to develop generic NLG components.
I am excited by the idea of using a neural language model to improve the output of rule/template NLG. Many academics probably regard this as a boring use of LMs (see my previous blog), but I think it could be very useful in many real world applications.
There is lots of excitement and hype about “gee whiz” uses of language models in NLG, such as generating stories from prompts. However, I suspect there maybe more real-world value in using language models for more mundane tasks such as quality assurance.
We can build much better NLG systems if we understand what users want the systems to do! This may sound trite, but there is very little research in the academic community in understanding user needs and requirements, which is a shame and indeed lost opportunity.
I was very happy to win an INLG Test of Time award for my paper “An Architecture for Data-to-Text Systems”, so I thought I’d write a few comments on it.
A travelogue about a recent bike trip. After two years of being limited in my holidays by Covid, it was great to finally be able to do some cycle touring again!
Society (and most funding agencies) want to see real-world benefits or “impact” from academic research. Of course not all research will have real-world impact, and impact may take years or decades to appear! I share some thoughts on types of impact, barriers to impact, and my personal experiences.
I am excited by the idea of using error annotation to evaluate NLG systems, where domain experts or other knowledgeable people mark up individual errors in generated texts. I think this is usually more meaningful and gives better insights that asking crowdworkers to rate or rank texts, which is how most human evaluations are currently done.
Progress in NLG requires understanding what users want, creating high quality data sets, building models and algorithms, and thoroughly evaluating systems. I remain disappointed that the research community seems fixated on building models and pays much less attention to user needs, datasets, and evaluation.
The most meaningful evaluation is when we test whether an NLG system actually achieves its communicative goal, eg helps people make better decisions or write documents faster. Unfortunately such “extrinsic” or “task” evaluation is rare in NLP in 2002, we need to see more such evaluations!