We should evaluate real-world impact!
It is very rare to see evaluations in the NLP research literature which are based on measuring the impact of systems on real-world users. I’d love to see more such evaluations, and describe some ways of doing this, along with a few examples.