evaluation

Texts can be accurate but still not appropriate

Jan 16, 2023Jan 16, 2023 ehudreiter7 Comments

An example from MedPaLM highlighted to me that generated texts can contain information which is factually accurate but still not appropriate, because (in this case) of its negative psychological impact. There are other such cases, and we should ensure that our evaluation criteria are sensitive to them.

evaluation

How effective is prompting?

Dec 8, 2022Dec 8, 2022 ehudreiter2 Comments

I was very impressed by a recent paper that compared prompting-based MT to MT based on trained models. Results are very interesting; prompting-based MT generates fluent texts which however have accuracy problems. Also the paper itself is an excellent example of a high-quality NLP evaluation, and I recommd it to anyone who wants to do good NLP evaluations.

evaluation

Lets use error annotations to evaluate systems!

Jun 1, 2022 ehudreiter10 Comments

I am excited by the idea of using error annotation to evaluate NLG systems, where domain experts or other knowledgeable people mark up individual errors in generated texts. I think this is usually more meaningful and gives better insights that asking crowdworkers to rate or rank texts, which is how most human evaluations are currently done.

evaluation

We need more extrinsic (task) evaluation!

May 9, 2022 ehudreiter2 Comments

The most meaningful evaluation is when we test whether an NLG system actually achieves its communicative goal, eg helps people make better decisions or write documents faster. Unfortunately such “extrinsic” or “task” evaluation is rare in NLP in 2002, we need to see more such evaluations!

evaluation

Why is ROUGE so popular?

Apr 10, 2022Apr 13, 2022 ehudreiter9 Comments

The ROUGE metric dominates evaluation of summarisation, and I do not understand why. I am not aware of good evidence that ROUGE predicts utility, and recent work by one of my students shows that character-level edit (Levenshtein) distance against a reference text is a better predictor of utility than ROUGE.

evaluation

Humans make mistakes too

Apr 3, 2022Feb 11, 2023 ehudreiter4 Comments

Some of my PhD students have recently looked at how many mistakes people (professionals, not Turkers) make when they do NLG-like tasks. The number of mistakes is considerably higher than we expected (although still much lower than the number of mistakes made by current neural NLG systems).

evaluation

MSc Course on Evaluating AI

Dec 15, 2021Dec 15, 2021 ehudreiter2 Comments

I teach an MSc course on Evaluating AI. which several people have asked me about. In this blog I give an overview of what is in the course. Hopefully this will be useful to people who are interested in learning about (or teaching) evaluation.

evaluation

Real-world utility is based on many things

Nov 22, 2021 ehudreiter1 Comment

The real world usefulness of NLG systems depends on many different factors, not just accuracy and fluency of generate texts. We should evaluate real-world utility of our systems, and check how well existing evaluation techniques (metrics and Turker-based human evaluation) correlate with real-world utility.

evaluation

Exercise: Find Problems in an Evaluation

Aug 9, 2021Dec 8, 2022 ehudreiter3 Comments

I encourage students to have “exercises” where they critically read an academic paper, looking for problems in evaluations. This will help develop skills for writing as well as reading papers. So give it a go!

evaluation

High Quality Human Evaluations

May 7, 2021 ehudreiter5 Comments

I’m a strong proponent of human evaluations, but they need to be high quality in order to give meaningful results; a quick/cheap/sloppy human evaluation may not be very useful.

Ehud Reiter's Blog

Ehud's thoughts about Natural Language Generation. Also see my book on NLG.

Category: evaluation

Texts can be accurate but still not appropriate

How effective is prompting?

Lets use error annotations to evaluate systems!

We need more extrinsic (task) evaluation!

Why is ROUGE so popular?

Humans make mistakes too

MSc Course on Evaluating AI

Real-world utility is based on many things

Exercise: Find Problems in an Evaluation

High Quality Human Evaluations