We should evaluate real-world impact!

I’ve had several discussions over the past few months about evaluations that measure the real-word impact of NLG/NLP systems. In other words, instead of calculating metrics on test sets or asking human subjects whether they like what they see, we measure the impact on users when the system is deployed in the real world. I see lots of claims that LLMs will revolutionise the world in all sorts of ways, surely some of these claims should be backed up by experiments that measure real-world impact?

Unfortunately, real-world impact evaluations are rare in the NLP research literature; it does not seem to be part of our culture. Indeed, most of my discussions have been with researchers who work in other areas (such as medicine) where it is expected that high-quality evaluations will measure whether something works in the real world.

Of course commercial NLP companies have zillions of white papers and case studies which claim to demonstrate real-world impact, but these are marketing documents, not scientific experiments.

Anyways, I think we need at least some evaluations that scientifically measure real-world impact! Below I describe a few ways of doing this (there are of course others).

Randomised controlled trial

In medicine, the most rigorous and trusted type of evaluation is a randomised controlled clinical trial (RCT). I once did such an evaluation. We had developed an NLG system which produced personalised smoking cessation letters, and wanted to check whether the letters actually helped smokers quit. So we set up an RCT where 2500 smokers were sent either our NLG letter or control material. We waited six months, identified the individuals who had stopped smoking, and measured whether there were more such individuals in the NLG-letter group than the control groups (unfortunately, one of the control groups had more quitters than the NLG group, so our system was not effective).

Large numbers of RCTs have been done in medicine, and I’ve also seen them used in other fields such as psychology and even economics. But they remain very rare in NLP, even in NLP for clinical applications. One of PhD students, Mengxuan Sun, who is working on NLP in cancer care, did a survey of papers that applied NLP in this area, looking at papers published up to 2022; she found lots of impressive technology and interesting use cases, but nothing remotely resembling an RCT.

Process efficiency

For a number of years I worked in applications of NLG in the oil industry, and during this period I attended several engineering conferences in this area. They had numerous papers where an engineer applied an innovation to an oil platform, refinery, etc, and then reported how this innovation impacted production, efficiency, downtime, etc. We could do the same in NLG, especially for document production tasks.

Another PhD student, Francesco Moramarco, in fact did this for an NLG system which summarised doctor patient consultations; these summaries were edited by doctors before being saved in the patient record. Francesco analysed data from 5000 consultations where the system was used with real patients and compared these to consultations which did not use the system. He showed that the system saved doctors some time (although perhaps not as much as hoped) without diminishing quality; indeed reports produced by this process (doctors post-editing NLG summaries) had fewer errors than manually-written reports.

It is very common for LLMs to be used in “human-in-loop” contexts, and in principle it should be relatively straightforward to do similar studies in other such use cases; but I have seen very few such studies in the research literature.

A/B Testing

It is common in IT for companies to evaluate new versions of websites and other services by A/B Testing, ie providing the new website to some customers and the old website to others, and measuring whether this has an impact on their behaviour. I’ve never done this myself, but it seems like a sensible way to evaluate the impact of many NLP systems. Indeed, I suspect some NLP companies do this, ie test new versions of models using A/B testing, but I’ve only seen a handful of papers in the research literature that evaluate systems using A/B testing.

Retrospectives

Finally, it would be useful to see retrospectives, which look back on a product or system and analyse successes and failures in real-world usage. I know lots of organisations do this internally, and see this as a valuable source of lessons for the future, but again I’ve rarely seen this kind of thing published.

One excellent example of a published retrospective in NLP is Strickland’s retrospective analysis of IBM Watson in healthcare (blog). This is full of very useful and valuable insights which I suspect generalise to other attempts to use in AI and healthcare, such as the difficulty of keeping ML models up-to-date with the latest medical findings.

I’d love to see more such retospectives!

Let me know if I’ve missed something!

There are thousands of NLP research papers published every year, and I’ve only read a handful of them, so I probably have missed papers that evaluate real-world impact of NLP. Please let me know if you’ve seen a good paper along these lines! Provided that its a scientific paper, I’m not interested in commercial white papers or case studies.

Also do let me know if you are interested in carrying out an evaluation of real-world impact, I am also happy to give advice if this is appropriate.

8 thoughts on “We should evaluate real-world impact!”

Interesting article, thanks!

But wasn’t “standard” NLG evaluation with metrics such as BLEU, BERTscore, etc. kind of thought of as a cost-effective proxy for real-world impact? Like based on the assumption that “If a system scores higher in metric X it will have better real-world impact.”

Just like I think we could view “quality estimation” as kind of a cheap approximation of “reference-based evaluation”, we could view the “reference-based evaluation” as a cheap approximation of “real-world impact”.

So imo real-world evaluation is maybe not precluded because people don’t want it / think about it, but because it isn’t affordable in 99.999% of cases, e.g., to run a RCT with solid statistical power. Maybe people will try approximate things real world eval with some LLM agent stuff in the future (since ChatGPT everything seems possible…), but that’d be then just another approximation, and its use may be unclear.

LikeLike

Hi, unfortunately we dont know that metric performance predicts real-world performance. There have been validation studies which check correlation of metrics with human evaluations (https://ehudreiter.com/2018/07/10/how-to-validate-metrics/), but these all use human eval in artifical contexts. I’ve never seen a paper which correlates metrics with real-world utility/impact, probably because there are so few NLP papers which present real-world impact.

Ie in medicine there is a combination of studies which measure real impact and “surrogate measures”, so medics can see how well surrogates agree with real-world, and indeed tune surrogates so that they better agree with real-world. But this cant be done in NLP because there very few studies that measure real-world impact.

However I have seen papers which raise doubts about how well metrics predict real-world impact. Francesco Moaramarco (mentioned in my blog) measured how well various metrics correlate with time taken by a doctor to edit a computer-generated summary (in an artificial context, not real world), He found that no metric did this well, and indeed the one that came closest was simple edit distance (better than BertScore, etc). https://aclanthology.org/2022.acl-long.394/

Evaluating real-world impact takes a lot more time and money than metrics, but most scientific fields accept that expensive experiments are necessary part of doing science…

LikeLike

Juri Opitz says:

Nov 14, 2023 at 8:21 am

Thanks for the answer and the pointers, much appreciated.

I basically agree with all what you’re saying, I just meant there’s kind of a problem due to how fast people iterate on ML models.

Like what I mean is that developing a new drug usually takes a long time and is very costly, so an RCT is not only necessary but it should also be financially feasible. On the other hand, a new bell or whistle is quickly attached to some LLM…

So while for sure you can accept that expensive experiments are necessary part of science, there’ll be always some limit of what can be done, and for ML models the cost of making a small but impactful modification (e.g., simply re-writing a prompt can make an LLM behave differently) is infinitely cheaper compared to a meaningful downstream evaluation.

Solid downstream evaluation is definitely ideal, and should actually be required in safety-critical areas like medicine, but given the fast iterations in the “ad-hoc” computer science, I fear that it’s just not practical.

LikeLike

Reply
1. ehudreiter says:
  
  Nov 14, 2023 at 11:12 am
  
  I appreciate that people doing product development focus on which models/features/prompts/etc will lead to highest profits and sales, and are much less interested in careful scientific evaluation. This is fine, I dont object to product development!
  
  However people who claim to do science should rigorously evaluate their hypotheses, and that requires careful measurements of meaningful outcomes. I for one believe that at least some of these measurements should look at real-world utility and impact. If nothing else, this will enable us to validate and calibrate our metrics.
  
  LikeLike
2. Juri Opitz says:
  
  Nov 14, 2023 at 12:00 pm
  
  Much agree with this!
  
  > which prompts lead to highest profits and sales
  
  In a sense, isn’t that also real world utility? Seems like a better and (depending how it’s measured) possibly more scientific way of testing utility than if we would use standard NLG metrics.
  
  LikeLike

I love your writing. I’m 23 and super inexperienced with NLP/NLG. I bet you’ve seen it all!

Qualitative metrics indeed require lots of time and money. I found most insightful that most metrics failed Francesco Moaramarco with the exception of edit distance. Occam’s razor in real life!

LikeLike

Pingback: Can LLMs make medicine safer? – Ehud Reiter's Blog

Pingback: Real-world usage of LLMs in Journalism – Ehud Reiter's Blog

Ehud Reiter's Blog

Ehud's thoughts and observations about Natural Language Generation

We should evaluate real-world impact!

Randomised controlled trial

Process efficiency

A/B Testing

Retrospectives

Let me know if I’ve missed something!

8 thoughts on “We should evaluate real-world impact!”

Leave a comment Cancel reply

Randomised controlled trial

Process efficiency

A/B Testing

Retrospectives

Let me know if I’ve missed something!

Share this:

Related

Share this:

8 thoughts on “We should evaluate real-world impact!”

Leave a comment Cancel reply