The most exciting and rewarding moments in research, at least for me, were when I discovered something new and interesting about NLG, language, etc. These were my “Eureka” moments and insights. None of them changed the world, but all of them were personally exciting to me, and many (not all) of them led to influential and highly-cited papers.
Below I list some of my Eureka insights throughout my career, at roughly five year intervals. These cover a range of topics including algorithms, systems, evaluation, and language. I think most are still relevant in the LLM era of 2026, which is a good feeling. I also include pointers to relevant papers if people want to learn more.
I’m writing this partially because many people I talk to believe that research success is measured by grants, papers in good venues, awards, etc. Of course these are all important, not least because they matter for hiring and promotion. But for me, Eureka moments are the most exciting and rewarding aspect of research; they are certainly what I remember best when I look back at my research career. I hope my readers will have many Eureka moments of their own!
Mathematical analysis of generating referring expressions
My PhD thesis was about generating descriptions of objects. I had a few “Eureka” moments when doing it, but the one I remember the most, and which had the most impact (eg, led to my most-cited journal paper) involved generating referring expressions that identified an object in a visual context. I realised that existing approaches could be mathematically formalised as a set cover problem (where the set being covered was visually salient “distractor” objects). This in turn allowed me to use computational complexity techniques to show that it was computationally difficult (NP-Hard) to find a minimal-length referring expression, which other researchers had argued should be our goal. I suggested using a greedy algorithm instead, which also seemed closer to how people do this task.
This kind of mathematical formalisation and analysis of NLG was extremely rare in 1990. Its a bit more common (although still unusual) in 2026, which is good to see. Mathematical analyses are not always appropriate, but they can in some cases provide very useful insights about NLG.
Papers: ACL (1990), Cognitive Science (1995)
Pipeline architecture for NLG
After my PhD I spent 5 years as a post-doc and working for a company. The Eureka moment I remember best, and which had the most impact, was realising that NLG systems could be constructed as pipelines, with modules for document planning, microplanning, and surface realisation (I later extended this architecture for data-to-text by adding pipeline modules for signal analysis and data interpretation). It’s a simple architecture, but it works. Over the years on many occasions I have seen people who start with one stage approaches to NLG (whether fill-in-the-blank templates or LLMs), but then move towards structured pipelines when they had to debug, adapt, and maintain complex NLG systems. The pipeline architecture led to my most cited publication (my first book) and an INLG Test-of-Time award (for my data-to-text version of the pipeline).
Papers: ENLG (1994), book (2000), ENLG (2007)
NLG texts can be better than human texts, by using words consistently
In the early 2000s I started focusing on data-to-text, looking initially at generating weather forecasts. The most memorable Eureka moment for me in this period was when we used machine learning to analyse lexical (word) choice in a corpus of human-written forecasts, and discovered that the Author feature was very important; ie, different human authors (meteorologists) used words differently in their forecasts. Our user studies with forecast readers showed that they found this confusing. Which meant that an NLG system which used words consistently could generate texts which readers preferred over human texts! We showed that this actually happened in an experiment with forecast readers.
In 2026 its common to talk about “super human AI”, but in the early 2000s this was unheard of. So it was very exciting to show that NLG systems could produce better texts than human writers, and to explain why this was the case, based on a corpus linguistics analysis. This led to the first mention of my research in the popular media.
Papers: Computational Linguistics (2002), Artificial Intelligence (2005)
Simple ngram metrics are meaningless in NLG
I have always been very interested in evaluation, and in the mid 2000s I became concerned by the proliferation of papers which used simple ngram metrics such as BLEU or ROUGE to evaluate NLG systems. Usually with no justification other than “this is what everyone else is doing.” I decided to try to get actual empirical data on whether these metrics correlated with human judgements of generated texts, so that we could discuss their appropriateness in a scientifically informed manner. I did some experiments and later a structured literature survey, and uniformly this showed that BLEU and ROUGE were very poor predictors of human judgements. More recently, my student Francesco Moramarco showed that simple edit-distance was more meaningful than common ngram metrics.
Above was not a surprising insight to me, but it was nice to have proper evidence to back up my beliefs! It was also nice that the 2009 paper was nominated for (but did not win) a test of time award. Thankfully in 2026 I see a lot less use of BLEU and ROUGE to evaluate NLG. But I still see some usage, which is disappointing since the evidence is clear that they are not meaningful!
Papers: EACL (2006), Computational Linguistics (2009), Computational Linguistics (2018), ACL (2022).
Experiments need to be as realistic as possible
Around 2010 we were working on our Babytalk data-to-text system, which summarised electronic patient data about babies in neonatal ICU for doctors, nurses, and parents. One memorable insight for me was about evaluation. Our first evaluation, of Babytalk for doctors, using a carefully controlled psychological experiment. Doctors saw different information presentations of data from babies who had been in hospital 5 years previously and decided on an intervention, which we assessed for correctness. I thought this was the best NLG evaluation I had ever been involved in, but the clinicians we worked with did not like it, because it was not realistic. They insisted that the next evaluation (of the Babytalk system for nurses) be done by installing the system on the ward and getting nurses to use it for real, as part of their workflow in supporting actual babies in the ward.
Other experiences after 2010 convinced me that the doctor were right; if we really want to understand how effective an NLG system is, we need to get people to use it in real-world contexts. I tried to make this point in a recent opinion piece in Computational Linguistics.
Papers: Computational Linguistics (2009) (Section 2), JAMIA (2011), Computational Linguistics (2025)
User requirements are incredibly important
In the 2010s I mostly worked for Arria, and had less involvement with academic research. I learned a lot from this work, but unfortunately I cannot publicly discuss many of my insights. However, I will say that one important “Eureka” insight was that it was essential to really understand user requirements and build systems which met them. This may sound obvious, but most NLP academics have little interest in or experience with user requirements. Lets just say that at Arria, I saw several projects collapse because the academic researchers running them (including me) did not put nearly enough effort into understanding the requirements of the various stakeholders we were trying to please.
I dont have any papers about this from my Arria days. However my student Francesco Moramarco wrote an excellent paper about this, which won a Best Paper award at NAACL, and my second book has an entire chapter on user requirements for NLG.
Papers: NAACL (2022), book (2025)
Error annotation is a great way to do human evaluation of texts
Around 2020 I was very dissatisfied with metric-based evaluation of NLG, and also with human evaluation based on subjective Likert ratings. Of course the ideal evaluation was impact evaluation, but this was very expensive and time-consuming, and often not possible. We needed a simpler and cheaper evaluation technique which was still meaningful.
I was intrigued when my student Craig Thomson started experimenting with evaluating texts using error annotations. I worked with Craig to develop an evaluation protocol where we asked people to read texts and mark up mistakes (a bit like MQM in machine translation), and then evaluated the texts based on number and type errors. This approach has worked really well, and we have used it in many subsequent NLG projects at Aberdeen. I am convinced that it is the most meaningful way to evaluate generated texts if impact evaluation is not possible.
Papers: INLG (2020), Computer Speech and Language (2023)
Patients want to understand why an AI model ignores features
If I look at very recent work, it is of course difficult to say what will have lasting impact. But certainly one recent “Eureka” insight for me came from my student Adarsa Sivaprasad. Adarsa is working on explaining simple “white-box” ML models to end users, and wanted to understand what explanation needs arose when real people used an AI model which they really cared about. Adarsa did this by looking at a model developed by Aberdeen’s Medical School which predicts success probability of IVF (ie, chance of having a baby); this is something that IVF patients deeply care about! Adarsa started by analysing user comments on the tool (which is deployed) and followed this up with surveys and interviews. And one thing that came out very strongly was that people wanted an explanation of what features the model looked at in making its prediction, and why some features were ignored. Eg, “does the model take into consideration that I have PCOS? If not, can I believe its prediction?”
We didnt expect this, because its not something which is much discussed in the XAI community. But it makes sense to me: if Jane Doe wants to know whether to trust an AI model, she may not feel comfortable trying to understand explanations of reasoning, but she can make judgements about whether the AI model took into consideration what she thinks is important information.
Papers: AIIH (2025) [hopefully more coming soon]
Final Thoughts
The “Eureka” moments described above are the ones which most excited me. There is a good correlation with my most influential and cited papers, but it is by no means perfect. Some of the above insights have not (yet) led to high-impact papers, and some of my most influential papers are not related to any of the above insights. But still, the correlation is reasonable.
So if you want to write high-impact papers, follow up on your “Eureka” insights! Even if this takes time (it took me several years to develop some of the above insights into high-impact papers). I appreciate that many people feel pressured to churn out incremental papers, because this is quicker and more reliable, and hiring and promotion decisions can be heavily influenced by paper counts. But keep in mind that when you look back on your career (as I am doing), it will be the “Eureka” insights that you will remember and value.