evaluation

How effective is prompting?

I was very impressed by a recent paper that compared prompting-based MT to MT based on trained models. Results are very interesting; prompting-based MT generates fluent texts which however have accuracy problems. Also the paper itself is an excellent example of a high-quality NLP evaluation, and I recommd it to anyone who wants to do good NLP evaluations.

evaluation

Comparing Human Evaluations

I was impressed by a recent paper by Läubli et al which experimentally compared the results of different human evaluations in MT (eg, how do results differ between expert and non-expert human raters), in the context of understanding when MT systems are “better” than human translators. Would be great to see more experimental comparisons of different human evaluations in NLG!