I have previously discussed How to do an NLG Evaluation: Human Ratings in Artificial Context. In this post I look at how to do a human-ratings evaluation in a real-world context. Such evaluations have the great advantage of ecological validity; people are using the system on data sets they care about, in real-world contexts.
Roughly speaking, real-world ratings evaluations are done by contacting people who are using an NLG system for real tasks, and asking them to fill out questionnaires where they rate the system on a Likert scale (typically on the basis of readability, accuracy, and usefulness) and make free-text comments about the system and its texts. This exercise has many similarities to the way web sites (and indeed some software packages) are evaluated, where users of web sites are occasionally asked to fill out questionnaires about the web site’s usefulness, etc. A simple example of a real-world ratings evaluation of an NLG system is presented by Turner et al, and a more complex example is described by Hunter et al.
In the rest of this post I will give some advice below on experimental design (hypotheses, subjects, materials, procedures, and analysis) for such studies. As always, I will focus on simple advice that I hope will be useful to people who are not experienced at conducting such evaluations.
[Please note that this section is identical to the Hypotheses section of How to do an NLG Evaluation: Human Ratings in Artificial Context]
Your hypotheses are up to you! However, make sure you decide what they are *before* you do your experiment. If you do want to explore alternative hypotheses after the experiment is done, these should be reported in your publications as post-hoc hypotheses which need to be tested in future work. Best practice is to write down your hypotheses before your experiment. If you have multiple hypotheses, you should apply a multiple-hypothesis correction to your statistical analysis. The simplest is Bonferroni, which means setting the statistical significance threshold to .05/N, where N is the number of hypotheses you are testing (eg, if you are testing 5 hypotheses, then the statistical significance threshold is .05/5 = .01). The need to reduce the significance threshold means that you shouldnt willy-nilly list hundreds of hypotheses, you should focus on a small number of key hypotheses.
Post-hoc tweaking of hypotheses is very tempting. I’m sure we’ve all been in situations where our original hypothesis was not supported by experimental data, but a very plausible variation of the hypothesis is statistically significant. So why not “tweak” the hypothesis into its plausible variation and report this? However tempting, this is very bad practice, and indeed is one of the leading causes of “bad science” in medicine and elsewhere. If you are in this situation, be honest and report your original hypotheses as not supported, but also say that post-hoc analysis suggests alternative hypotheses which should be investigated in future experiments.
Subjects in real-world studies are of course recruited from real-world users of a system. The usual practice is to simply approach all users, and ask them to rate the NLG system. After all, more data is better!
One danger is that people who agree to fill out the questionnaire will not be representative of the user base of our system; this is sometimes called non-response bias. Another danger is that people who fill out the questionnaire will try to be “nice” and give ratings which do not actually reflect their experiences; this is called response bias (and it is not the opposite of non-response bias!). These are standard problems with questionnaires and there are many techniques for dealing with these issues. However, as far as I know, these techniques have rarely been used in NLG evaluation studies.
Material in a real-world study is simply whatever the subjects normally see whilst using the system. In a real-world study, we do not force subjects to look at specific scenarios which probably are not of interest or relevance to them.
Note that this means that the NLG system must be robust. If we are evaluating an NLG system in an artificial context, we can carefully choose scenarios where we know the system works reasonably well and does not crash. But if we are evaluating an NLG system in real-world usage, then it must be robust, otherwise comments and feedback will probably be dominated by complaints about the system not working or crashing.
Sometimes we show subjects alternate versions of a text, this can be considered to be a type of A/B testing. In particiular, if there is an established NLG system and we wish to evaluate whether a new NLG system is better than the existing one, then we can randomly allocate subjects to get texts from the old or new systems.
[Please note that much of this section is identical to the Procedure section of How to do an NLG Evaluation: Human Ratings in Artificial Context. However, there are some differences]
In a human ratings study, subjects are typically asked to rate texts on a Likert scale. I personally think that a 5-pt Likert scale is usually sufficient, but some researchers prefer 7-pt scales. Some people like to use magnitude estimation, where users move a slider on a near-continuous scale; however I am not convinced that this provides much benefit over a Likert scale, at least for NLG evaluations.
In my studies, I generally ask subjects to rate texts on three dimensions: readability, accuracy, and usefulness. Although I dont always use this terminology, eg I might ask subjects to rate the “helpfulness” rather than the “usefulness” of a decision-support system. These dimensions are not independent, and in particular a text is not generally useful unless it is both readable and accurate. But overall I think asking for usefulness as well as readability and accuracy gives useful insights, and subjects generally seem happy to rate texts on these three dimensions.
I always give subjects an opportunity to make free-text comments about the NLG system and the texts it generates. If a subject sees a small number of texts, we can ask for free-text comments on each generated task. However, if the subject sees dozens of generated texts, it may be better to ask for a single overall comment on the NLG system as a whole.
Finally, from an ethical perspective it is of course essential to get approval from an ethics committee if this is required in your country. In particular, if an NLG system is not currently being operationally used (either because it is completely new or because it is a variant of an existing system), then we may need to demonstrate that using the NLG system cannot harm subjects. For example, we evaluated Babytalk BT-Nurse by installing it in a hospital ward, asking nurses to use it, and soliciting ratings. In order to get ethical approval, we had a research nurse check each BT-Nurse text before the duty nurse saw it, to ensure that the text could not harm patient care. The research nurse did not in fact reject any BT-Nurses texts as potentially harmful, but having the nurse do this check was essential from an ethical approval perspective.
Sometimes similar constraints arise in a commercial context. For example, if a client commisions an NLG system and wishes to evaluate it using real-world ratings, the client may insist on safeguards to ensure that the NLG system does not do something which could damage its reputation (eg, use profanity in texts intended for children).
If a single system is being evaluated (ie, there are no alternative versions or A/B testing). then the most common practice is simply to report the raw questionnaire results (eg, how many subjects selected “Agree” on the Likert scale for helpfulness), together with a simple statistical analysis (eg with chi-squared) to show that the differences are significant. Sometimes it is useful to present subgroup analyses, for example looking at scores for domain experts vs novices.
Often the most interesting information is the free-text comments by users, since users often describe how they use a specific text in the context of a specific task; many users also describe software bugs they encountered and give suggestions for improving the system. Free-text comments are especially useful for improving a system, because they list specific bugs and enhancement requests. In the BT-Nurse study, we annotated free-text comments by type, which showed that there were many more comments about content than about language; again this kind of information is often useful for developers.
If the study compared multiple systems (eg, A/B test), then the analysis is similar to what I described in How to do an NLG Evaluation: Human Ratings in Artificial Context.