Perhaps the most rigorous type of NLG evaluation is a task-based (extrinsic) evaluation in a real-world context. In other words, we deploy our NLG system in the real world, and measure whether it achieves its desired outcomes, such as users making better decisions, changing their behaviour, learning more, etc. This has some similarity to Human Ratings Evaluation in Real-World Context, but we are measuring changes in decision quality, etc, instead of asking users whether they thought the NLG system was effective.
For example, we evaluated the STOP system, which produced personalised tailored smoking-cessation letters, by recruiting several thousand smokers, sending some of them STOP letters and the rest control material (eg, a fixed non-personalised letter), and then measuring how many people managed to stop smoking in the STOP and control groups. Of course, what we hoped to see was that more people quit in the “STOP” group than in the control group. However, it turned out the highest proportion of quitters was in the group who received the fixed non-personalised letters. The difference was not statistically significant, but it was still disappointing. But that’s science, experiments and evaluations don’t always turn out the way you want them to…
In the rest of this post I will give some advice on experimental design (hypotheses, subjects, materials, procedures, and analysis) for such studies. As always, I will focus on simple advice that I hope will be useful to people who are not experienced at conducting such evaluations.
As I’ve discussed elsewhere, it is essential to decide on hypotheses before you do the evaluation. Post-hoc tweaking of hypotheses can be very tempting but it is bad science.
The actual hypotheses of course depend on what your NLG system is intended to achieve. For example,
- If the NLG system is supposed to change behaviour, such as STOP or Braun et al 2015, then the hypothesis will about behaviour change, such as reduced smoking for STOP, and less unsafe driving behaviour for Braun et al.
- If the NLG system is used to tutor students, such as diEugenio et al 2002, then the hypothesis will be about learning gain.
- If the NLG system is intended to inform users of important information, such as Williams and Reiter 2008, then the hypothesis will often be about whether the subjects can answer questions or otherwise show increased knowledge.
As can be seen, hypotheses for this type of experiment measure a quantitative “outcome variable” (such as percentage of subjects who stop smoking) which reflects the system’s goal, and assert that this variable will be higher (or lower) in the group using the NLG system compared to a control group. The specific hypotheses are quite diverse and depend strongly on the application and use case.
Subjects in real-world studies are of course recruited from real-world users of a system. One approach is simply to ask all users of the system to participate in the study. This is not always feasible, however, especially if measuring outcomes for hypothesis testing costs money or requires the experimenter to physically assess subjects. Regardless, remember that more data is better!
One danger is that people who agree to participate in the evaluation may not be representative of the user population. For example, poor drivers may not wish to participate in the evaluation of a driving behaviour-change system. Analysing outcomes by subgroups (eg, separately analysing outcomes for good drivers and poor drivers) can be useful here.
Real-world task/extrinsic studies often need a lot of subjects (we used 2553 subjects to evaluate STOP), because they are “noisy” in a statistical sense. For example, in a laboratory study we can insist that the subjects read texts in a quiet room where they can focus on what they are reading; whereas in a real-world study, some of the subjects may be quickly glancing at the letter while trying to deal with a screaming baby. Also, the outcome variable we are measuring is usually influenced by many things in additional to the quality of the NLG system; for example if the recipient of a STOP letter fails to stop smoking, this could be because the STOP letter was rubbish, but it could also be because the subject was a committed smoker who was not going to quit smoking regardless of what we told him.
Material in a real-world study is simply whatever the subjects normally see whilst using the system. In a real-world study, we do not force subjects to look at specific scenarios which probably are not of interest or relevance to them.
This means that the NLG system must be robust. If we are evaluating an NLG system in an artificial context, we can carefully choose scenarios where we know the system works reasonably well and does not crash. But if we are evaluating an NLG system in real-world usage, then it must be robust in order to be effective.
Usually we want some subjects to use an alternative “control” or “baseline” system, so we can compare the impact of using the NLG system to the impact of using an alternative system. If the NLG system is intended to replace an existing system (which might use templates to produce texts instead of NLG), then the existing system is the control/baseline; this was the case in Williams and Reiter 2008 and diEugenio et al 2002, for example. If there is no existing system, then in some cases we can use a fixed canned text as a control (which was done in STOP).
In many cases subjects are assigned to different groups; typically one group uses the NLG system and a second group uses a control/baseline system (as described above). Sometimes we may have multiple NLG groups and/or multiple control groups. Subjects should be assigned to groups randomly, and (if this is feasible) should not know which group they have been assigned to.
Subjects then use their system normally (this is after all a real-world study), and then one or more outcome variables are measured; how this is done depends on what is being measured. I strongly recommend that subjects also be encouraged to give free-text comments and feedback on whatever system they used; if possible this should be done after the outcome variables have been measured.
Finally, from an ethical perspective it is of course essential to get approval from an ethics committee if this is required in your country! Even if it is not legally required, I recommend seeking out an ethical review of your experiment if this is possible.
We use statistical techniques to test whether there is a significant difference in the outcome variable between the NLG and control groups. Which statistical technique is appropriate technique depends on what we are measuring. For example, a chi-square test can be used for binary outcomes (such as whether subjects stopped smoking), while an ANOVA or General Linear Model can be used for numeric outcomes (such as number of unsafe driving incidents per km driven) (see Analysis section of How to do an NLG Evaluation: Human Ratings in Artificial Context). You should use a Bonferroni correction if testing multiple hypotheses, and you should always report two-tailed p values. If you believe that some subjects may just be “messing around” rather than seriously using the system, you can define an exclusion criteria, such as excluding subjects who used the system for less than a minute. You need to define this criteria *before* you do the experiment!
If you have collected free-text comments from subjects, you should report on and summarise these. Such comments can give valuable insights as to why a system is or is not effective (statistical analyses tell us what happened, but they often struggle to tell us why things happened).