Comparing Human Evaluations

I was impressed by a recent paper by Läubli et al which experimentally compared the results of different human evaluations in MT (eg, how do results differ between expert and non-expert human raters), in the context of understanding when MT systems are “better” than human translators. Would be great to see more experimental comparisons of different human evaluations in NLG!


Get Your Hands Dirty!

I would like to see more PhD students and postdocs “getting their hands dirty” by collecting real-world data, working with real-world users and experts, and conducting real-world evaluations with users. Its not easy, but engaging with the real world does help scientific and technological progress.