personal

Cycling in Netherlands

Jul 13, 2025Jul 14, 2025 ehudreiter1 Comment

This is a personal blog, about a recent bike trip I did which was mostly in Netherlands.

Patients want to know what information an AI model considers

Jun 25, 2025Jun 25, 2025 ehudreiter2 Comments

My student Adarsa Sivaprasad is looking into what questions users of an AI prediction model actually have, and how these should be answered. Amongst other things, users seem to have more questions about what information a model considers than about how a model works.

academics

The Aberdeen NLP Research Group

Jun 5, 2025 ehudreiterLeave a comment

We have a really nice NLP research group in University of Aberdeen, with a dozen researchers who work on topics such as evaluation, interpretability, health applications, cognitive aspects, and cross-temporal research. We regularly publish and win awards in top venues. Its exciting!

Uncategorized

Key messages from my NLG book

May 14, 2025 ehudreiterLeave a comment

Its been around 6 months since my new NLG book was released. I summarise what I now think are its key messages, for rule-based NLG, ML and neural NLG, requirements, evaluation, safety/testing/maintainability, and applications.

evaluation

Even good leaderboards may not be useful, because they are gamed

May 5, 2025May 5, 2025 ehudreiter3 Comments

Most LLM benchmarks and leaderboards are garbage. Unfortunately, it now seems that even the few “good” benchmarks (such as SWEBench and Chatbot Arena) are compromised because they are being gamed by the big LLM vendors, who tweak the benchmarks and rules so that their systems do better.

evaluation

Examples of evaluating real-world impact

Apr 8, 2025Aug 3, 2025 ehudreiter4 Comments

I describe several papers which measure real-world impact of NLP systems, using different methodologies (A/B test, before/after eval, clinical trial, observational study). I hope these examples inspire and encourage more people to consider evaluating real-world impact!

evaluation

Benchmarks distract us from what matters

Mar 26, 2025 ehudreiter7 Comments

I suspect that our fixation with LLM benchmarks may be driving us to optimise LLMs for capabilities that are easier to benchmark (such as math problems) even if they are not of much interest to users; and also to ignore capabilities (such as emotional appropriateness) which are important to real users but hard to assess with benchmarks.

other

People do not understand how LLMs can/cannot help them

Mar 13, 2025 ehudreiter1 Comment

People will make much better use of LLMs if they understand what the technology can and can not do. Unfortunately many people have little understanding of this; I make a few suggestions which perhaps could help a bit.

other

Improving Bayesian Networks

Mar 3, 2025 ehudreiterLeave a comment

Nikolay Babakov has recently published several papers on Bayesian networks, including challenges in reusing BNs, ideas for explaining BNs (work with Jaime Sevilla), and using LLMs to help build BNs. I help to supervise Nikolai, and think BNs can potentially be a useful way to do reasoning with uncertainty which is configurable and explainable.

evaluation

I want a benchmark for emotional upset

Feb 17, 2025Feb 17, 2025 ehudreiter1 Comment

I would love to see benchmarks which assess whether generated texts are emotionally upsetting. This is a major problem which we frequently encounter in our work on using AI to support patients. It would be challenging to build such a benchmark (nothing like it exists today), but we need a braoder range of benchamarks which assess complex real-world quality criteria such as emotional impact.

Ehud Reiter's Blog

Ehud's thoughts about Natural Language Generation. Also see my book on NLG.

Cycling in Netherlands

Patients want to know what information an AI model considers

The Aberdeen NLP Research Group

Key messages from my NLG book

Even good leaderboards may not be useful, because they are gamed

Examples of evaluating real-world impact

Benchmarks distract us from what matters

People do not understand how LLMs can/cannot help them

Improving Bayesian Networks

I want a benchmark for emotional upset