other

What LLMs cannot do

Dec 11, 2023 ehudreiter4 Comments

I summarise a few papers I have recently read on what LLMs can and cannot do. One (not surprising) finding is that LLMs skill profile is very different from humans. Which is good, means that human+LLM together can do things that humans/LLMs cannot do on their own. Also means that it makes little sense to evaluate LLMS using tests/techniques designed to evaluate people.

other

Indifference to climate change makes me angry

Nov 27, 2023Nov 27, 2023 ehudreiterLeave a comment

There is a personal rant (no connection to AI or NLP) about the fact that many decision-makers dont seem to care about climate change. How else to explain UK ban on new onshore wind in England, and refusing to shift Bitcoin to more energy-efficient algorithms?

evaluation

We should evaluate real-world impact!

Nov 13, 2023Aug 3, 2025 ehudreiter14 Comments

It is very rare to see evaluations in the NLP research literature which are based on measuring the impact of systems on real-world users. I’d love to see more such evaluations, and describe some ways of doing this, along with a few examples.

evaluation

A bad way to measure hallucination

Oct 31, 2023Dec 4, 2023 ehudreiter5 Comments

The easiest eay to measure hallucinations is to ask Turkers to count incorrect statements in a text. However reproduction papers published at the Human Evaluartion workshop suggest that this is not a reliable way to measure hallucination. Hopefully researchers will switch to better ways to measure hallucination!

other

ChatGPT error or human error?

Oct 19, 2023 ehudreiter2 Comments

One of my students is investigating mistakes made by chatGPT in explaining medical notes. Some of these turned out to be due to problems in the notes being explained (ie human error by the doctors who wrote the notes) more than LLM deficiencies.

academics

Peer Review Has Improved My Papers

Oct 10, 2023 ehudreiter1 Comment

At its best, peer review can significantly improve the quality of papers; its not just an accept/reject gate. I describe a few examples where peer review has led to major improvements in the quality of my papers.

other

NLG texts should not upset people

Sep 26, 2023Mar 24, 2024 ehudreiter6 Comments

Generated texts can (unnecessarily and unintentionally) upset people. I describe several examples I’ve seen of this. This blog is based on a talk I gave at a SigDial workshop.

evaluation

There are many types of human evaluation!

Sep 13, 2023 ehudreiter1 Comment

Many people asume that “human evaluation” means asking people to rate or rank outputs. However there are many other types of human evaluation, most of which give more meaningful results than rating or ranking! I discuss some of these, including task-based evaluation, annotation-based evaluation, and real-world evaluation.

building NLG systems

Problems in using LLMs in commercial products

Aug 24, 2023Aug 24, 2023 ehudreiterLeave a comment

I see lots of big-picture talk about what LLMs can do, but at a practical level there are real challenges in using them in commercial applications. These include cost, stability, and need for human-in-loop, as well as use-case-specific challenges.

evaluation

My MSc students evaluate chatGPT

Aug 16, 2023Aug 18, 2023 ehudreiterLeave a comment

Four of my MSc students did projects evaluating chatGPT in different use cases: developers assist, translation, exam taking, emotional support. In every use case, chatGPT was often very impressive, but also had issues and limitations.

Ehud Reiter's Blog

Ehud's thoughts about Natural Language Generation. Also see my book on NLG.

What LLMs cannot do

Indifference to climate change makes me angry

We should evaluate real-world impact!

A bad way to measure hallucination

ChatGPT error or human error?

Peer Review Has Improved My Papers

NLG texts should not upset people

There are many types of human evaluation!

Problems in using LLMs in commercial products

My MSc students evaluate chatGPT