Lets use error annotations to evaluate systems!

I am excited by the idea of using error annotation to evaluate NLG systems, where domain experts or other knowledgeable people mark up individual errors in generated texts. I think this is usually more meaningful and gives better insights that asking crowdworkers to rate or rank texts, which is how most human evaluations are currently done.