Evaluating chatGPT

Update 27-Apr-23: An excellent example of a good evaluation of GPT is “Large language models effectively leverage document-level context for literary translation, but critical errors persist” (https://arxiv.org/abs/2304.03245). I ecnourage anyone who is interested in rigorous evaluations of GPT to read this paper!

Occasionally people ask for my advice on evaluating chatGPT (or GPT4). I love getting such questions, because they are much more constructive than, say, debating whether chatGPT is “Artificial general intelligence” (AGI) or a threat to humanity. My cynical view is that much (not all) of this debate is driven by people whose “business model” requires getting the attention of lots of people (politicians, journalists, pundits, think tanks, etc). Such people want to capitalise on chatGPT’s prominence and visibility. Since they cannot comment technically and have zero interest in careful scientific experiments, they instead start pontificating on whether AI is a threat to humanity.

To me, the most important question about GPT and LLM technology more generally is what it can and cannot do. We need to understand this in order to know when and how to use the technology, and indeed where it needs to be improved. Exploring this issue requires careful high-quality scientific experiments (and I am sick of the prominence given to press releases, cherry picked examples, dubious experiments, etc). If anyone wants to do such experiments, I am happy to help and give advice! Below I give an example and some pointers on what not to do.

Example: chatGPT in health chatbots

One of my PhD students, Simone Balloccu, is interested in using chatGPT in health chatbots. I dont want to give details here since this is ongoing work, but you can get a general idea of the sort of thing Simone is interested from Wu, Balloccu et al 2022 and Balloccu and Reiter 2022.

Anyways, from an evaluation perspective, what Simone has done is

  1. Get a set of novel inputs (essentially statements and queries in a specific health domain); Simone gets these from crowdworkers (Mechanical Turk and Prolific).
  2. Run these inputs through chatGPT to get responses to the above inputs.
  3. Ask domain experts to evaluate the quality and appropriateness of the generated feedback.

As mentioned, this is ongoing work, but the preliminary results do suggest strengths and weaknesses. I wont go into this here, except to say that some of what Simone is finding supports my earlier observations that health-related LLM output can be inappropriate even if it is accurate. Incidentally, I myself would not use chatGPT for health information. If I want online information about a health issue, I will use a Google search and click on a website I trust, such as https://www.nhs.uk/.

This approach is not limited to health. I have an MSc student who is doing something similar to investigate chatGPT in a software development context, and he is also finding out interesting things about strengths and weaknesses. Hopefully I’ll have more MSc students working in this area over the summer. The approach probably wont work everywhere, but I think it does work in many contexts.

An even better experimental approach is to deploy the system in real usage and study its impact. Ive not seen any such studies of chatGPT to date, but we described one such study (using a different technology) in Knoll et al 2022.

Advice: Do not evaluate chatGPT on Internet data

The above approach involves soliciting new input test data. Of course most people who claim to evaluate chatGPT dont do this, instead they get test data from the internet. Some people even use test sets where (input, output) pairs are available on the internet, such as leaderboard data sets. This is a **REALLY** bad idea. Since chatGPT was trained on a big chunk of the Internet, “evaluating” it on test data which is available on the net amounts to testing chatGPT on its training data, which violates one of the fundamental principles of machine learning and means the results are worthless.

I’m also wary of testing chatGPT on inputs from the public internet even if a formal (input, output) dataset is not available, because in many cases corresponding outputs do exist for the inputs, even if there isnt a formal leaderboard training data. For example, if we ask chatGPT to describe a sports match and give it data from a real game, chatGPT could simply find a story written by a human sportswriter about this match and regurgitate it. And if we ask chatGPT to pass an exam, we need to ensure that the exam questions (perhaps in a different form) are not available on the Internet.

Having said this, it is possible to do a proper experiment using Internet data (a good example is Vilar et al 2022). However this is not easy and requires a lot of careful attention to above issues. For most people I think the safest approach is to use test data which is not on the internet. This could be crowdsourced data (which Simone used), or it could be data which was never publicly released because of commercial or data-protection issues.

Advice: Evaluate chatGPT outputs with human domain experts

The above approach also asks human domain experts to evaluate the quality of chatGPT’s texts. Again this is unusual, most NLP researchers use metrics or ask random Mechanical Turkers to evaluate quality. I appreciate that using metrics is a lot cheaper and quicker than using domain experts, but the results from domain experts are a lot more meaningful and trustworthy! Especially if part of the investigation is identifying texts which are inappropriate in more subtle ways, such as making users unnecessarily scared or depressed. I have discussed this in previous blogs (example).

Advice: Avoid commercial conflict-of-interest

It is better if the experiments are done by researchers who are not connected to the company which is developing and selling this software. My above-mentioned MSc student made this point strongly to me a few days ago. After having read a variety of papers, he is now very skeptical about experiments done by researchers who have a commercial link to the product they are evaluating. Researchers are expected to declare commercial conflict-of-interestes in medical research, perhaps this should be required in AI as well. I certainly am very wary when reading about experiments where the authors have a clear commercial interest in a positive outcome.

Caveat: Reproducibility

Unfortunately, experiments done with chatGPT are generally *not* reproducible, because OpenAI is continually updating chatGPT. For example, if Simone reruns his inputs through chatGPT a month later, he often gets different responses, which means the evaluation by domain experts may be different.

This is a real pain from the perspective of doing careful science! But I think that we still need to make the effort, and my personal view is that even if the detailed results of the evaluation change when an experiment is rerun, in most cases high-level insights about strengths and weaknesses will persist (but this is based on a few anecdotal examples, I have not properly tested this hypothesis).

To put this another way, even taking this issue into account, any careful scientific evaluation of chatGPT is probably going to be more meaningful than the alternative, which is low-quality experiments, cherry-picked examples, etc.

Final thoughts

In order to use LLM technology to help people, we need to do proper experiments to understand the strengths and weaknesses of the technology. So far I have seen very few such experiments, which is a real shame. High-quality experiments require a lot more work than “experiments” based on applying weak metrics to published leaderboard data, finding impressive cherry-picked examples, etc, but they are the only way to really explore what LLM technology can and cannot do.

In short, I strongly encourage people who want to understand how to use LLM technology to do high-quality experiments, using novel test data as input and domain experts to evaluate output quality. And if you think I can help, please ask!

10 thoughts on “Evaluating chatGPT

  1. Very important post. Thank you!!

    If we are doing error annotation of ChatGPT answers what aspects of answers we should consider (e.g., Omissions, Discourse errors, Factual errors [Thomson et al., 2023], Stylistic, Oversimplification, etc. )

    While doing human evaluation should an annotator know whether she is annotating errors of ChatGPT response or human response?

    ChatGPT can give multiple answers to a question. Should we evaluate only the first answer?

    Some relevant papers:
    Is GPT-3 Text Indistinguishable from Human Text? SCARECROW: A Framework for
    Scrutinizing Machine Text(https://arxiv.org/pdf/2107.01294.pdf)
    The Authenticity Gap in Human Evaluation (https://arxiv.org/pdf/2205.11930.pdf)

    Thank you…


    1. Hi, thanks for your nice words about my post.

      My recommendation is to do human evaluation by either (A) annotating specific problems in a generated or text or (B) assessing the impact of a text on task performance. I think these are more meaningful than evaluations with Likert scales or preference rankings

      Lets use error annotations to evaluate systems!

      I see that the second paper you cite seems to assume that human evaluations are always done using Likert scales, this is not true!

      In terms of specific annotation scheme, many have been proposed (I give a few examples in the above blog), and of course which is appropriate depends on the use case and context. So its hard to give concrete advice without knowing more about what you are doing. But having said this, I think it usually is important, especially in chatGPT contexts, to identify both explicit factual errors, and discourse/context/pragmatic errors (where texts are literally true but misleading in context). Its hard to generalise about omissions; they are very important in some contexts but less important in others. Similarly for stylistic errors and oversimplifications.

      We usually tell annotators when they are annotating computer-generated texts, unless the experiment is comparing generated texts to human-written texts (in which case it is important that annotators do not know the source of the text they are annotating).

      About multiple answers, the main problem is that chatGPT can generate different texts from the same prompt on different runs. In an ideal world we would run chatGPT 10 times on each prompt, and evaluate each of the ten responses, in order to get a performance profile. I’ve never done this, though, because of resource constraints; would be great if someone else tried this!


  2. It’s always enjoyable to read your work, Professor Reiter. One interesting thing about “accurate but inappropriate” behavior of NLP models though, is if this behavior is reducible to a series of machine learning problems that can be eventually solved by different NLP models. In other words, if we can quantify/discretize/objectify the “inappropriateness”, then technically it can be solved as a classification/regression/stochastic generation problem in NLP. But even this kind of “inappropriateness” model can be trained, the question remains that whether this iterative approach can be extended to a point to completely replace human factor (identify problem – redefine as an NLP research question – train a model). I somehow feel the human-in-the-loop factor is inevitable (and should be embraced in training and evaluation alike) as long as realistic application of LLM and other NLP models is the goal.


  3. I think this is a key evaluation challenge. I’m sure models will improve, I’m not going to try to predict how much! But however much they improve, we need good evaluation to understand how and when to use the improved models, including human-in-loop issues.


    1. By ecologically valid, I mean realistic. So if we want to evaluate chatGPT as a tool to help developers, we should ideally get developers to use it in real-world development tasks, and measure their productivity to see if it has increased. If this isnt possible in a real-world setting, we can do this in an artifical setting, but we should try to make it as realistic as possible.


      1. Just to follow up with another concrete example, if we want to evaluate an AI model in a medical context, such as diagnosing patients, an ecologically valid evaluation is to have the model either directly interact with patients, or support a doctor who is interacting with patients, and assess diagnostic quality (we can also look at things like time taken to diagnose). An evaluation which is not ecologically valid is to test the system on a standard medical exam; this is not ecologically valid because the test-taking context is very different from the real-life medical context; eg test-taking ignores the difficulty of getting accurate symptom information from patients (https://ehudreiter.com/2023/04/09/chatgpt-in-health-exciting-if-we-ignore-the-hype/)


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s