I’m a strong proponent of human evaluations, but they need to be high quality in order to give meaningful results; a quick/cheap/sloppy human evaluation may not be very useful.
I was impressed by a recent paper by Läubli et al which experimentally compared the results of different human evaluations in MT (eg, how do results differ between expert and non-expert human raters), in the context of understanding when MT systems are “better” than human translators. Would be great to see more experimental comparisons of different human evaluations in NLG!
Craig Thomson and I will present a paper at INLG on a methodology for evaluating the accuracy of generated texts, based on asking human annotators to mark up factual errors in a text. This is not cheap, but I think it is the most robust and reliable approach to measuring accuracy.