How to do an NLG Evaluation: Metrics
I am really dubious about evaluations based on BLEU and other metrics. I explain why, and also give advice on best practice for people who are committed to using metrics
I am really dubious about evaluations based on BLEU and other metrics. I explain why, and also give advice on best practice for people who are committed to using metrics
Advice on how to evaluate an NLG system by getting people to use it in the real world, and then measuring how effective the system was.
I’ve just finished delivering a commercial training course, it was a very different (and in many ways better) experience than the academic teaching I’ve done for decades
Why isnt there more open-source NLG software, and what can be done to encourage more open-source NLG?
Advice on how to evaluate an NLG system by asking human subjects to rate the system, when they are using it in a real-world context.
My experiences in writing NLG pages for Wikipedia; mostly positive, but sometimes frustrating.
How do we test an NLG system, and ensure that it is robust, reliable, and meets its quality goals?
The key to success in building an NLG system is getting requirements right; this is true of software in general, and its true of NLG.
What are the best tools for building NLG systems?
A high-level discussion of the different ways of evaluating NLG systems.