Questions from readers of my book
A group who is reading my book sent me many questions, some of which we discussed in a call last week. I thought I would share the questions and my responses.
A group who is reading my book sent me many questions, some of which we discussed in a call last week. I thought I would share the questions and my responses.
Most semantic evaluation of LLMs focuses on accuracy and hallucination. These are very important, but it is also important to look at completeness and omission; does the generated text include all of the key information which the user needs to know? Omissions are a huge problem in medical NLG, and in other NLG tasks as well.
The most exciting and rewarding moments of my research career were when I discovered something new and exciting about NLG, language, etc. I describe a few of these “Eureka” moments. I hope my readers also have them; when I reflect back on my career, these moments and insights are what I remember best, much more so than getting papers or proposals accepted.
I am excited by the idea of using AI to help people manage ilness and health conditions. This isnt very sexy, but I think there is real potential to improve health outcomes and quality of life.
I hope to retire soon, and many people are asking about my plans. Basically I want to do lots of travel, say involved in academia, and perhaps do some writing.
I strongly recommend that researchers do “sanity checks” on data, model outputs, and evaluation results, looking for anomalies. This can help detect data errors, model cheating, software bugs, and other flaws which distort experiments.
LLMs often “cheat” on benchmarks via data contamination and reward hacking. Unfortunately, this problem seems to be getting worse, perhaps because of perverse incentives. If we want to genuinely and meaningfully evaluate LLMs, we need to move beyond benchmarks and start measuring real-world impact.
Research culture is very important but also very hard to change. I suspect this is one reason why it is so difficult to get people to do more rigorous and meaningful experiments.
When building an NLG system, it really helps to understand what users want; this came up several times at the recent INLG conference. I discuss some of our work in this space, and give a few suggestions.
I review some data on usage of AI in healthcare, and conclude that the most common uses in 2025 are probably (A) giving personalised health information to patients and (B) helping clinicians write documents. We’ve worked on these topics at Aberdeen, but most researchers focus on AI for decision support, which is not widely used.