evaluation

Qualitative evaluation

I’ve had some discussions recently with medical colleagues about evaluation, where they have essentially suggested that I put more emphasis on qualitative evaluation. Ie, in AI and NLP we usually focus on numbers and quantitative evaluation, and perhaps in some cases this is a mistake. I think my group does more qualitative work than most NLP groups, but my medical colleagues felt we should consider doing even more.

I dont think I’ve ever said much about qualitative evaluation in my blog, so I thought I would briefly discuss some techniques here, which readers may find useful. Certainly my students have commented that it is difficult to find information on qualitative evaluation of NLG. There are a number of “best practice” guides to some types of quantitative evaluation in NLG (eg, van der Lee et al 2024), but I am not aware of any guides to qualitative evaluation in NLG or NLP.

There is an extensive literature on qualitative evaluation in medicine. Although its quite old, I recommend that people who dont know much about the topic read the qualitative research chapter in Greenhalgh’s classic book How to Read a Paper, which I think is an easy-to-read intro to basic concepts and issues.

Qualitative error analysis

One relatively well-known qualitative technique in evaluation is qualitative analysis of errors. In other words, after running a quantitative evaluation (human or metric), the researcher chooses some example cases where the system did poorly, and manually analyses them in order to gain insights about what went wrong.

For example, the Babytalk BT45 system generated short summaries of patient record data which were intended to help clinicians decide on interventions. An evaluation showed the system was less effective than hoped, and Reiter et al 2008 looked qualitatively at cases where BT45 did poorly. This revealed a number of issues related to narrative quality, such as poor long-term overview of the data in the summaries. See paper for details.

Analysis of free-text comments

Many NLG human evaluations ask subjects to provide free-text comments; these are valuable resources for understanding strengths and weaknesses. Sometimes researchers just find interesting comments and report them in the paper, but more structured analysis is also possible.

In particular, thematic analysis can be used to find common themes in the comments. For example, earlier this year I contacted people involved in the ReproHum project and asked them which factors they thought influenced reproducibility. I did a simple thematic analysis on their responses, which involved identifying themes and then annotating which themes occured in which response.

For example, one theme I identified was numberSubjects. This was expressed in different ways in the responses, including “Small Sample Size“, “The number of subjects evaluating“, and “Number of subjects is important“; I annotated all of these as numberSubjects. I then looked for the most common themes, and by far the most common was guidelinesAndTraining (blog), which other people have also identified as a major issue (eg, Ruan et al 2024).

A more complex example of structured analysis of free-text comments is discussed in Section 4.4 of Hunter et al 2012 .

Focus groups

Focus groups bring together relatively small groups of people (often 5-10 people) to discuss a topic. The participants are chosen to have different backgrounds and perspectives, and a moderator leads the discussion.

I have not been involved in many focus groups in the past, but earlier this year my student Mengxuan Sun used focus groups in her research on understanding how well language models can communicate complex medical information to patients. It was an interesting and worthwhile endeavour, and I think NLG researchers should consider making more use of focus groups.

Details are given in Mengxuan’s paper, but at a high level she had 3 focus groups, each of which had around 8 participants, who were a mixture of patients and caregivers, clinicians, NHS IT staff, and computer scientists. The sessions lasted around an hour. Participants first individually looked at GPT4 summaries of patient data, and then the group as a whole discussed how useful and effective these were.

Many interesting and important issues were raised in the groups, including overuse of medical jargon, lack of personalisation to individual circumstances, and trust (almost all the patients said they would struggle to trust GPT, especially after seeing the mistakes in the examples). Again details are in the paper, but the point I wanted to make here was that the focus group highlighted a number of important issues (especially trust) which we had not paid sufficient attention to. The fact that the focus groups included people with different backgrounds was also really useful, and discussions between patients, clinicians, and computer scientists led to additional valuable insights.

Another example of using focus groups in AI and healthcare is Musbahi et al.

Final thoughts

There are many other techniques which can be used in qualitative evaluation, such as participant observations and various types of interviews (Greenhalgh and Turner); the ones I have mentioned above are just the ones I have the most personal experience with.

A key advantage of qualitative evaluation is that it allows participants to express their thoughts in an open-ended manner, which can raise new issues and/or reveal new insights. For example, Mengxuan also asked people to rate and annotate LLM summaries of medical data, and this gave a lot of interesting data on overall quality and specific problems (eg, Americanisms, wrong URLs, lab results reported wrong). But it was a structured exercise where people followed instructions and focused on what we asked them to look at. In contrast, the focus group discussion was more open-ended and gave insights about broader issues which we had not really considered, such as trust and lack of personalisation.

Another student, Adarsa Sivaprasad, is currently asking patients to fill out a structured questionnaire (about a tool used to predict likelihood of success in IVF) which also asks patients for free-text comments. The free-text comments received so far have been really interesting and provide very useful insights on issues which the main part of the questionnaire did not address; again this is a benefit of encouraging open-ended feedback.

I am by no means an experienced qualitative researcher! But I increasingly think that if we genuinely want to understand how well a system works and what benefits it provides (as opposed to just getting a good position on a leaderboard), then we need to evaluate qualitatively as well as quantitatively.

One thought on “Qualitative evaluation

Leave a comment