How accurate do chatGPT texts need to be?

A reader of my blog contacted me and asked me whether people would accept some errors in chatGPT texts. Thats a very important question which doesnt have a simple answer, because it depends on use case (what are texts used for), workflow (are texts checked and post-edited before being used), and the type of error (some are worse than others), among other things. I try to explain this a bit below, in case other readers have similar questions.

I should add that nothing I say below is novel, so people who are familiar with this topic may not learn much from this blog. But hopefully it will be useful to people who are new to this topic.

Example: Consultation Summaries

Let me start with an example from medicine. My student Francesco Moramarco worked on generating summaries of doctor-patient consultations; these are entered into the medical record, after being checked and post-edited (fixed) by a doctor (he didnt use chatGPT, but the principles are the same). He asked doctors to analyse the errors in the computer-generated summaries (paper), and in doing so asked them to distinguish between critical and non-critical errors; an error is critical if it is a mistake in essential clinical information (eg, patient is vomiting), and non-critical otherwise (eg, patient’s wife is vomiting). The final note entered into the patient record system (after being checked and post-edited by a doctor) must not contain any critical errors. It shouldnt contain non-critical errors either, but perhaps a small number of such errors could be tolerated in practice.

What is most important here is whether there are mistakes in the text entered into the patient record. So we need to look at the process and workflow as a whole (including doctors post-editing the computer summaries), not just the computer system in isolation. A related issue is whether errors can easily be detected by people.

Anyways, hopefully this example shows that our tolerance for errors depends on how texts are used (entered into patient record), workflow (texts are checked and post-edited by doctors), and error type (critical vs non-critical).

Use cases

High accuracy (at least for critical information) is essential in medical applications. But it is less important in some other contexts. For example, a student told me that he was using chatGPT to generate draft cover letters for job applications. Since human-written cover letters often contain some “deviation from the truth”, its not a disaster if chatGPT-written cover letters also are not 100% factual.

An interesting intermediate case is sports journalism. Readers do expect stories about sports games to be accurate, but errors in such stories may not do much real-world damage. My student Craig Thomson looked at factual errors in both computer-written and human-authored sports stories, and found that a 300-word human-written story about a basketball game contained 1.5 factual errors on average (paper). Which is much lower than the error-rate in computer-generated sports stories, but anyways this does suggest some tolerance for errors in this genre.

Workflows

Text generation systems can be used in a number of different workflows

Writing assistant. In this workflow, the system produces draft texts or suggestions for someone who is writing a document; for example the above-mentioned student used chatGPT as a writing tool for job cover letters. Since the chatGPT text here is just a suggestion for the human author, errors are probably acceptable (although of course not desirable).
Post-editing. In this workflow, the document is primarily written by the computer system, but a person checks and edits the final version; for example Francesco’s summarisation system for consultations. Errors here are definitely not desirable (and may increase the amount of post-editing required) but they are acceptable if they can be reliably detected and fixed by the human post-editor in an acceptable amount of time.
Fully automatic text generation. In this workflow, generated texts go directly to the user, without any “human-in-the-loop” (although of course the user can decide whether to believe the text ot not); for example most computer-generated weather forecasts go directly to the user. Errors are much less acceptable in this workflow

A related issue is whether a person (user or post-editor) can easily detect errors. For example, I asked chatGPT to generate a simple weather forecast, and its output included the sentence “The wind direction is primarily NWAL and NWAL WEST, but later on, it changes to SOUTH WEST” (blog). Its obvious that there is something wrong with this sentence, which limits the damage it can do (readers hopefully will ignore it).

On the other hand, some errors are very hard to detect. For example, in the context of exploring whether chatGPT could give good answers to exam-type questions, I asked it “how many registers are in the ARM architecture” (I teach computer architecture). chatGPT responded that the number varied, but generally there are 32 general-purpose registers, not counting PC and SP. In fact there are *31* general-purpose registers (documentation). If a student was trying to use chatGPT to answer exam questions (which of course is cheating), I doubt he would detect this error.

Another related issue is whether chatGPT is combined with other spftware. For example, Microsoft has announced that it will add chatGPT to Bing. Details are unclear at the time of writing, but my understanding is that chatGPT will largely be a “presentation layer” for Bing results. If so, this may reduce the number of accuracy errors.

Error types

Errors of course can have different severities. Francesco distinguished between critical and non-critical errors in his work, and Freitag et al (paper) distinguish between Major, Minor, and Neutral errors in the output of machine translation systems. Francesco, Craig (sports stories), and Freitag et al also distinguish between different types of errors, such as (in Craig’s case) incorrect words and incorrect numbers.

Its also important in many cases to identify sentences which are inappropriate even if they are accurate (blog). For example, my student Simone Balloccu is working on making health chatbots more sensitive to patients emotional and stress state, and he is looking at using chatGPT to generate comforting and reassuring sentences to make users feel better. This work isnt published yet, so I dont want to say much about it, but Simone has asked therapists to check chatGPT’s suggestions, and most are fine but some are definitely unacceptable (likely to worsen patent’s mental state) even if they are true.

Detecting errors

Last but not least, it is essential to accurately detect errors and measure factual correctness in texts produced by chatGPT and other text generation systems; if we dont know how many errors are made (and what type they are), we cannot judge whether they are acceptable in a specific context. I personally believe that the only way to accurately detect errors is by asking people with domain knowledge to carefully check texts for mistakes; this is what Francesco, Craig, and Simone do (and also Freitag et al). Craig has written up his methodology in detail (blog).

From this perspective, I am very disappointed that most academics seem to have little interest in properly measuring factual accuracy. A colleague recently pointed out to me a paper which purported to evaluate accuracy of chatGPT outputs, by using obsolete metrics such as ROUGE and BLEU. If you care about accuracy, measure it properly!!

Final thoughts

I appreciate the above is a long-winded response to what seemed to be a simple question about whether we could tolerate some errors in chatGPT output! But our tolerance for errors does heavily depend on context, including use case, workflow, and error types. And I suspect this will impact where chatGPT is used successfully.

Ehud Reiter's Blog

Ehud's thoughts about Natural Language Generation. Also see my book on NLG.

How accurate do chatGPT texts need to be?

Example: Consultation Summaries

Use cases

Workflows

Error types

Detecting errors

Final thoughts

Leave a comment Cancel reply

Example: Consultation Summaries

Use cases

Workflows

Error types

Detecting errors

Final thoughts

Share this:

Related

Share this:

Leave a comment Cancel reply