We need more extrinsic (task) evaluation!
The most meaningful evaluation is when we test whether an NLG system actually achieves its communicative goal, eg helps people make better decisions or write documents faster. Unfortunately such “extrinsic” or “task” evaluation is rare in NLP in 2002, we need to see more such evaluations!