NOTE: My survey of impact evaluation is being published in Computational Linguistics (DOI)
I am writing an opinion piece on the need for more evaluation in NLP of real-world impact, by which I mean measuring KPIs (key performance indicators) of real users using deployed systems. As part of this, I am doing a survey of such evaluations in ACL Anthology papers. The survey is pretty depressing. Perhaps 0.1% (1 in a 1000) of Anthology papers contain an evaluation of real-world impact, and 2/3 of these just briefly describe the impact evaluation (eg, one paragraph giving results of a live A/B test), usually after spending several pages describing a metric evaluation on a test set.
The latter is especially depressing, because it suggests that even when people do gather data on real-world KPIs, they do not regard it as very important (at least in the context of an academic NLP paper). In my mind data on real-world impact is far more interesting than metric evaluations, but it seems clear that most people regard metric evaluations as more important; perhaps this is part of the machine learning mindset?
But anyways, even if only 1 in 3000 Anthology papers properly describe an impact evaluation, this is still a significant number of papers, since there are 100K papers in the Anthology. I describe a few of these papers below, in the hope that these examples will encourage other researchers to consider evaluating real-world impact.
Clinical trial (Smoking Cessation)
The oldest paper in the Anthology which gives real-world impact data is my paper (Reiter et al 2001) which describes using a medical randomised controlled clinical trial to evaluate whether an NLG systems that produced smoking-cessation letters actually helped people to stop smoking. We basically recruited 2500 smokers, asked them for information about their smoking, and then split them into three groups (people who got NLG letters, default fixed letters, and just a thank you letter). After six months, we measured smoking cessation rates in the three groups, and discovered that the “fixed letter” group had a higher cessation rate than the “NLG” group. Oh, well… Reiter et al (2003) is a journal paper which describes this project and experiment in more detail.
My literature search did not discover any other papers in the Anthology which give KPIs from a clinical trial. However, there are papers in the medical literature which describe clinical trials of NLP systems. For example, Meystre and Haug (2008) use an RCT to evaluate an NLP information extraction tool.
A/B testing (improving selection of bids in sponsored search)
My search showed that the most common type of real-world impact evaluation in NLP is A/B testing. This has some similarities to a clinical trial, and involves deploying two versions of a web page, and comparing KPIs on the two versions. A/B testing is already heavily used for things like web design, so its relatively straightforward to also use this to evaluate NLP systems.
Mohankumar et al (2024) used an A/B test to evaluate the impact of a system which improves selections of bids for sponsored searches, and showed that this increased revenue by 1%. The application is fairly representative, what I like about this paper is that they give some details about their A/B test evaluation (as mentioned above most Anthology papers that use A/B testing give very little information about what they actually did). Another paper which gives experimental details is Russell and Gillespie (2016), who use A/B testing to evaluate machine translation systems.
Before-and-after evaluation (medical scribing tool)
A before-and-after evaluation (sometimes called pre-post study) measures how KPIs improve when an NLP system is deployed. Duggan et al (2025) is a paper in a medical journal which uses this methodology to evaluate a tool which helps clinicians write some types of clinical documents; they report that using the tool reduced writing time by 20%. One of my PhD students, Francesco Moramarco, described a before-and-after study in a similar domain in his PhD thesis; he found a smaller gain (10%), but this may be because he used older technology (BART).
I have also seen before-and-after studies of LLMs in other fields, such as Pandey et al 2024 (blog) which described the impact of an LLM coding assistant. Unfortunately the Anthology papers I found which used before-and-after evaluation did not give sufficient detail about what they actually did. Eg Yoon et al 2024 claim “a reduction of processing time by over 60%”, but give no details about how this was calculated.
Observational study (identifying patients with Covid)
Controlled trials, A/B tests, and before-and-after studies are all based on comparing an NLP system against a control or baseline. But if the system is doing something novel, then there may not be a control system, in which case we can simply report KPIs on how well the system did.
For example, Chapman et al (2020) developed an NLP system which analysed patient notes in order to identify people who might have Covid. This was a novel application at the time, so there was no baseline, and instead they reported that their system identified 6,360 patients who had Covid.
Extended study (alerting credit officers to relevant news)
Most impact evaluations report data from a few weeks or months. A rare exception is Nygaard et al (2024), which report three years of KPI data on an NLP system which alerted credit officers to relevant news items about clients. It would be great to see more papers which give data collected over several years usage!
Final thoughts
I would love to see more applied NLP papers evaluating real-world impact. I hope the above examples are interesting and perhaps inspirational for some of my readers. Feel free to contact me directly if you are thinking of doing a real-world impact evaluation and want my thoughts.
Hi, Thanks a lot for writing this post and the paper (and giving the talk at the GEM2 workshop yesterday!). A few months ago, we wrote a survey of AI evaluation and identified six paradigms, one of which was “real-world impact”. Confirming your findings, we also found that this was the least represented paradigm: we included 4 papers/preprints originally, one of which was retracted! The non-retracted papers (all about using LLMs) are:
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4944588
https://arxiv.org/abs/2409.04109
https://arxiv.org/abs/2306.01694
Hope this is interesting
LikeLike
Many thanks for your survey and the papers. I note that none of the papers you mentioned are in the ACL Anthology, which is consistent with my findings.
About evaluation types, by the way (which you discuss in your survey), I would also like to see more qualitative evaluation, this is accepted and respected in medicine but not in CS and AI. A colleague made the point to me very strongly that we need more high-quality user studies in our evaluationbs
LikeLike
Thanks for the feedback about qualitative evaluations. I guess, under our “paradigms” framework, some of them could be characterised in the “evals” or “exploratory” paradigm, based on their aim, but I suppose some may fall outside any of our paradigms. In a way, absence of that evaluation paradigm in our survey is due to its relative absence in the field, as we defined the paradigms by looking at what is done in the literature. I fully agree that more qualitative evaluations would be insightful.
LikeLike