I’m planning to do a systematic review of the validity of BLEU, and am very keen to get comments and suggestions on study design from others!
A short travelogue about a holiday cycling trip I did in May 2017 (nothing to do with NLG!)
I’m looking for a PhD student to work on Advanced Data Storytelling!
Some explanation and advice about regression to mean, which is a statistical phenomena that can impact NLG evaluations.
People who use corpora to build NLG systems need to understand what is in the corpora. The widely used Weathergov corpus, for example, probably contains computer-generated texts rather than human-written texts. So learning from it is essentially reverse-engineering a rule-based NLG system.
I am really dubious about evaluations based on BLEU and other metrics. I explain why, and also give advice on best practice for people who are committed to using metrics
Advice on how to evaluate an NLG system by getting people to use it in the real world, and then measuring how effective the system was.