evaluation

More on evaluating impact

I am very interested in measuring the real-world impact of NLP systems, by which I mean changes in Key Performance Indicators (KPIs) which are caused by using a deployed real-world NLP system. I recently published a paper “We Should Evaluate Real-World Impact” in Computational Linguistics as a “Last Word” opinion piece (DOI) (Arxiv). Amongst other things the paper shows that impact evaluations are only included in 0.1% of ACL Anthology papers. I also gave an invited talk on impact evaluation at the ACL GEM workshop in July (PDF).

From the paper and talk I got some really good pointers to related papers, as well as very interesting comments and suggestions on evaluating impact, I will summarise some of these below. Of course I cannot summarise everything, apologies in advance to people who made interesting comments which I do not mention below!

More examples of impact evaluation

Using RCT to evaluate LLM for software development

Becker et al 2025 (Arxiv) is a fascinating paper which presents a randomised controlled trial (RCT) of the effectiveness of LLM coding assistants. Becker et al recruited 16 experienced developers and asked them to complete 246 tasks, each of which addressed a genuine issue in an open-source repository which the developers had contributed to. Half of these tasks were done with LLM assistance, and half were done without LLMs. Becker et al then measured productivity, and discovered that using LLM tools made developers less productive (increased completion time by 19%), which was quite a striking finding.

I really like this paper (and mentioned it in my GEM talk). It is a good concrete example of how to evaluate LLMs with RCT, and also it shows that a careful RCT impact evaluation can give very different results from benchmarks (LLMs do great at coding benchmarks) and subjective human evaluations (the devs in the study who used LLM tools thought they were helpful).

In the coding world, LLM tools are probably useful in some contexts (coding task, dev skill, etc) but not others. Becker et al show that the only way to genuinely find out when LLMs are useful is to do proper impact studies such as RCTs, we cannot rely on benchmarks or subjective human ratings.

Before-and-after study of behaviour change

I am working with several students (in Aberdeen and elsewhere) on building apps that encourage people to change their behaviour, and then evaluating whether the app does actually change behaviour. We have just finished two evaluation experiments, and expect to finish a third in August. These studies have not yet been published (hopefully they will be within the next year or so), so I dont want to say much about them here. But I thought I would mention an earlier paper which illustrates the sorts of experiments done in these studies.

Braun et al 2018 (DOI) described an app that gave feedback to drivers on unsafe driving behaviour (like speeding), using data acquired by GPS. The system was evaluated by giving it to 6 drivers who used it for a month (that is, they drove as normal and got feedback from the app). Braun et al measured changes in unsafe driving over this period; ie compared frequency of unsafe driving incidents before and after people used the app. The results of the study were inconclusive, probably because it was too small; the above-mentioned current experiments use a similar evaluation approach but many more subjects.

Note that these studies do not qualify as impact evaluation under my criteria, because the apps/systems are not production deployed systems. Nevertheless, they do give insights on effectiveness of apps in real usage.

Measuring the impact of bias

Bias is a huge issue in NLP and LLMs, with many researchers trying to reduce it. However, from an evaluation perspective, I do not see much impact evaluation of bias, perhaps because its subtle and difficult to link to quantifiable KPIs.

Savoldi et al (ACL Anthology) present an interesting approach to quantifying bias, by measuring the time required to post-edit generated texts to remove bias. In other words, their KPI is the amount of human post-editing required to fix the problem. I think this is a really interesting approach, which could be used in other context when we are trying to create quantifiable KPIs for subtle attributes of texts.

More surveys of impact evaluation

There are other evaluation surveys which show that impact evaluation is very rare.

  • My student Mengxuan Sun did a careful systematic scoping review of NLP in Cancer Care (DOI), including medical as well as NLP venues. Despite the medical topic and inclusion of medical venues, none of the papers surveyed had anything remotely approaching an impact evaluation,
  • Burden et al 2025 (Arxiv) survey 125 evaluation papers (apparently selected on an ad-hoc basis), and characterise them into 6 “paradigms”. Only four papers are in “Real-World Impact” paradigm, and none of these are in the ACL Anthology.

As a caveat, if any of my readers wish to do their own survey, please use a systematic literature review like Sun did (or like my CL paper). Ad-hoc paper selection is poor practice (large potential biases) and also not replicable.

Other comments

Diversity of evaluation

After my GEM talk, I had a really nice discussion with Iryna Gurevych about impact evaluation. She made the excellent point that the fundamental problem was lack of diversity in NLP/ACL evaluations. We need more impact evaluations, but we should also have more high-quality user studies and more qualitative evaluations (and probably other things as well); and we also need benchmark evals.

One of the reviewers of my CL paper commented on the NLP evaluation “ecosystem”, by which he/she meant the range of evals in NLP. I think this a great perspective; we will learn more and make more progress (both scientific and applied) if we have an evaluation ecosystem which includes many types of evaluation. A “monoculture” which focuses on benchmark and test set performance will miss many scientifically important insights.

Commercial limitations on publishing

Several people from companies told me that they did conduct impact evaluations internally, but were not able to publish them because they were regarded as commercially confidential. I appreciate this constraint (and have encountered it myself in commercial work). A related issue is that companies are very reluctant to publish negative results about their products (ie, an LLM company would not have published Becker et al’s negative result, which was mentioned above).

For this reason, I think it is important that academics remain involved in impact evaluation. From a resource perspective, its often easier for companies to evaluate impact. However from a scientific perspective it is essential that research is published, and this is often easier for academics. Sometimes commercial-academic collaborations can work well, such as the work we did with Babylon Health on evaluating their consultation summarisation system.

Final thoughts

My recent CL paper shows that there are very few papers about impact evaluation in the ACL Anthology, which is depressing. So the above papers and comments are encouraging because they show that some people are thinking about impact evaluation, which is encouraging!

3 thoughts on “More on evaluating impact

    1. Really interesting, thanks for pointing this out! I’m slightly annoyed they they treat 𝑃 = 0.087 as significant, but its still great to see an RCT at ACL on impact of LLMs on tutoring

      Like

Leave a comment