The real world usefulness of NLG systems depends on many different factors, not just accuracy and fluency of generate texts. We should evaluate real-world utility of our systems, and check how well existing evaluation techniques (metrics and Turker-based human evaluation) correlate with real-world utility.
I encourage students to have “exercises” where they critically read an academic paper, looking for problems in evaluations. This will help develop skills for writing as well as reading papers. So give it a go!
I’m a strong proponent of human evaluations, but they need to be high quality in order to give meaningful results; a quick/cheap/sloppy human evaluation may not be very useful.
I was impressed by a recent paper by Läubli et al which experimentally compared the results of different human evaluations in MT (eg, how do results differ between expert and non-expert human raters), in the context of understanding when MT systems are “better” than human translators. Would be great to see more experimental comparisons of different human evaluations in NLG!
Craig Thomson and I will present a paper at INLG on a methodology for evaluating the accuracy of generated texts, based on asking human annotators to mark up factual errors in a text. This is not cheap, but I think it is the most robust and reliable approach to measuring accuracy.