Future of NLG evaluation
In a recent position paper, I argued that NLG evaluation in the future needs to be become more rigorous. It also needs to move beyond benchmarks, and focus more on impact, qualitative, and safety evaluation.
In a recent position paper, I argued that NLG evaluation in the future needs to be become more rigorous. It also needs to move beyond benchmarks, and focus more on impact, qualitative, and safety evaluation.
In most ways NLG and NLP are much better in 2026 than when I got my PhD in 1990. Unfortunately research culture has gotten *worse” in this period, which really worries me as I retire. We have a culture which does not value scientific rigour, tolerates cheating and fraud, and in many ways is closed to new ideas and new people.
When we create complex prompts for LLMs, we face similar software engineering challenges as conventional software development (requirements, design, implementation and debugging, testing, maintenance). We need to better understand good software engineering for prompts.
I am often asked how AI will impact Computer Science teaching. The biggest challenge is adapting what we teach so that it is relevant to a world where AI assistants are heavily used in software development. We should also use AI tutors to help teach. Least important is making assessments more resistant to AI cheating.
25 years ago I proposed personal health assistants as a grand challenge for computer science. LLMs have brought this vision closer to reality, but many challenges remain. These include understanding requirements, adapting to individual users, showing effectiveness in RCTs, and running on cheap phones with limited Internet access.
There is very limited data on harms to real patients from using AI health chatbots. The limited data we have from incident reports, clinical trials with patients, and data from health providers suggests that bots are usually safe, but can cause harm in a few cases. More data is badly needed!
Quantitative comparisons of different LLMs are not very interesting in research papers, because the LLMs in question will probably be out of date by the time the paper is published. However looking for behaviour which is shared by several LLMs is definitely interesting and worthwhile.
ACL/ARR have rules and guidelines for how papers are written. Unfortunately many authors (and reviewers) ignore these, which makes their papers harder to read and less useful. Please follow the rules!
A group who is reading my book sent me many questions, some of which we discussed in a call last week. I thought I would share the questions and my responses.
Most semantic evaluation of LLMs focuses on accuracy and hallucination. These are very important, but it is also important to look at completeness and omission; does the generated text include all of the key information which the user needs to know? Omissions are a huge problem in medical NLG, and in other NLG tasks as well.