How to do an NLG Evaluation: Human Ratings in Artificial Context

The quickest, cheapest, and most common type of human NLG evaluation is to ask human subjects to rate NLG texts in an artificial context (ie, not in the context of actually using the texts in a real-world context).  I give advice here on how to conduct such a study. I have elsewhere discussed other types of NLG evaluation, including asking subjects to rate NLG texts in real-world usage.

The basic structure of such an evaluation is to show human subjects NLG texts and control/baseline texts, and ask the subjects to rate the texts, most often on a Likert scale. The results are statistically analysed to determine whether the NLG text were rated significantly higher than the control texts.  I give advice below on experimental design for such studies, including hypotheses, subjects, materials, procedures, and analysis.  None of this is rocket science, but I hope it will be useful to people who are not experienced at conducting such evaluations.

This material is adapted from a talk I gave at the 2015 NLG Summer School (slides).


Your hypotheses are up to you!  However, make sure you decide what they are *before* you do your experiment.  If you do want to explore alternative hypotheses after the experiment is done, these should be reported in your publications as post-hoc hypotheses which need to be tested in future work.  Best practice is to write down your hypotheses before your experiment.  If you have multiple hypotheses, you should apply a multiple-hypothesis correction to your statistical analysis.  The simplest is Bonferroni, which means setting the statistical significance threshold to .05/N, where N is the number of hypotheses you are testing (eg, if you are testing 5 hypotheses, then the statistical significance threshold is .05/5 = .01).  The need to reduce the significance threshold means that you shouldnt willy-nilly list hundreds of hypotheses, you should focus on a small number of key hypotheses.

Post-hoc tweaking of hypotheses is very tempting.  I’m sure we’ve all been in situations where our original hypothesis was not supported by experimental data, but a very plausible variation of the hypothesis is statistically significant.  So why not “tweak” the hypothesis into its plausible variation and report this?  However tempting, this is very bad practice, and indeed is one of the leading causes of “bad science” in medicine and elsewhere.  If you are in this situation, be honest and report your original hypotheses as not supported, but also say that post-hoc analysis suggests alternative hypotheses which should be investigated in future experiments.


In general, experimenters should use subjects who are representative of the user group of the NLG system (algorithm, etc) being developed.  For example, if you want to evaluate a system which generates marine weather forecasts, your subjects should be people who use such forecasts, such as sailors and workers on offshore oil platforms.  Ideally your subjects would be representative of the user population in terms of age, experience, job, and so forth, but this is often very difficult to arrange.  If your subject group is biased (for example, they are all undergraduates at your university), you should clearly report this in your publication.

The fastest and easiest way to recruit subjects is via Mechanical Turk or similar mechanism.  However, recruiting subjects via Mechanical Turk is problematic if you need subjects with specialist skills or background (such as readers of marine weather forecasts), and/or you want to observe subjects during the experiment (or debrief them after the experiment).   In such cases you will need to recruit subjects directly, which is much more time consuming.  I have seen some cases where experiments which were initially done through Mechanical Turk (because it was so easy to recruit subjects) had to be redone with directly-recruited subjects.

I am sometimes asked how many subjects are needed in an experiment.  In principle, this question can be answered by doing a statistical power calculation, but such calculations require an estimate of the magnitude of the effect, which we often do not have in NLP.  In my experience, 50 subjects is often a good number to aim for.


Material in such studies consists of texts which are shown to and evaluated by subjects.  In most cases, we come up with a number of scenarios (input data sets), and for each scenario (data set) create an NLG text (output of our NLG system on this data set) and one or more baseline or human (topline) texts.

Scenarios can either be chosen randomly, or selected to cover a range of important phenomena.  For example, if we want to test a system which generates marine weather forecasts, we could randomly choose days from the previous year and base scenarios on these; or we could look for one very rainy day, one very windy day, etc, and base scenarios on these.  In the latter case, we would typically define criteria for each category (eg, what counts as a windy day), and than randomly choose from the days which meet these criteria.  There are arguments for both approaches.  If you want to know how well a system works on average, then choosing random scenarios is better.  However if you want to check that a system does reasonably well in a range of circumstances (eg, “worst case” behaviour instead of “average case”), then scenarios which cover different phenomena make sense.

For each scenario, we create our NLG text and one or more alternative versions.  A lot of experiments use two alternative versions: a baseline text produced by the current state-of-the-art system for this task, and a human text written by a domain expert from the data.  We can then tell both whether our system improves on current state-of-the-art, and how it compares to human-written texts.

How many scenarios should we have?  In an ideal world,. we would probably have as many as possible, so that we can assess performance on as many different scenarios as possible; this means that we should choose the number of scenarios so that each text is evaluated once. For example, if we have 20 subjects, and each subjects evaluates 6 texts, that means we will get 120 evaluations in all.  If we are producing three variants of each text (NLG, baseline, human), then ideally we would have 40 scenarios, since this would produce 120 texts (3 variants for each of 40 scenarios), so each text would be evaluated once.  In practice its not always possible to do this, and we might instead expect each text to be read by several subjects; for example, with the above number of subjects and variants, if we expected each text to be evaluated 4 times, then we would need 10 scenarios instead of 40.


In a human ratings study, subjects are typically asked to rate texts on a Likert scale.  I personally think that a 5-pt Likert scale is usually sufficient, but some researchers prefer 7-pt scales.  Some people like to use magnitude estimation, where users move a slider on a near-continuous scale; however I am not convinced that this provides much benefit over a Likert scale, at least for NLG evaluations.  An alternative which I have used in some studies is to show subjects 2 or more variants of a scenario text (eg, NLG text, baseline text, human text), and ask the subjects to rank these variants in order of quality.

In my studies, I generally ask subjects to rate texts on three dimensions: readability, accuracy, and usefulness.  Although I dont always use this terminology, eg I might ask subjects to rate the “helpfulness” rather than the “usefulness” of a decision-support system.  These dimensions are not independent, and in particular a text is not generally useful unless it is both readable and accurate.  But overall  I think asking for usefulness as well as readability and accuracy gives useful insights, and subjects generally seem happy to rate texts on these three dimensions.

I am a great believer in Latin Square experimental design for evaluations of NLG systems, as this reduces the “bias” or “noise” due to the fact that some scenarios (data sets) are harder to describe than others, and some subjects are more generous in their ratings than others.  If you are not familiar with latin squares, there are many excellent resources on the web which explain this design.  Another useful way of reducing noise and bias is to start the session with a few practice questions, which dont count towards the evaluation results but serve to get subjects “warmed up” and familiar with the task.

Finally, from an ethical perspective it is of course essential to get approval from an ethics committee if this is required in your country for experiments with human subjects.  Usually NLG evaluations are pretty straightforward from an ethical perspective.  But do keep in mind that subjects can drop out at any point.  If a subject wishes to drop out, it is ethically unacceptable to pressure him or her to stay in the experiment!


There are of course numerous books, webpages, and other resources on statistical analysis!   My 2009 CL paper expands on much of what I say below.

I like to use a General Linear Model to analyse results of ratings studies (I quite like the SPSS GLM facility).  The dependent variable is the Likert score; and subject, scenario, and text type (eg, NLG or baseline) are the fixed factors.  I use a Tukey HSD test to identify significant differences in individual factors (eg, NLG vs baseline).  Strictly speaking, GLM should not be used on Likert data because Likert ratings are ordinal; if this is a concern, you should use a Wilcoxin Signed Rank test (see page 545 of my 2009 CL paper for details).  This issue is discussed in the Wikipedia page on Likert scales.

I recommend against just performing on ANOVA on the ratings with text type as the sole independent variable.  This is very susceptible to bias/noise due to differences between subjects and data sets.

As mentioned in Hypotheses section above, you should perform a Bonferroni or other multiple hypothesis correction if you are testing more than one hypothesis.

ALWAYS report two-tailed p-values.  If I read a paper which reports a one-tailed p value, I throw the paper in the trash and am disinclined to read future papers from the authors, unless there is an excellent justification presented for using a one-tailed p value (which is rarely the case).

Finally, you may want to exclude outliers, especially if you are using Mechanical Turk and suspect that some of your subjects are not taking the experiment seriously.  This is fine, provided that you decide on exclusion criteria *before* you do the experiment.  Changing exclusion criteria after you have gathered the data invalidates your analysis (it essentially is post-hoc tweaking of hypotheses).  A common exclusion criteria is if a subject’s performance is more than two standard deviations from the mean for all subjects.


12 thoughts on “How to do an NLG Evaluation: Human Ratings in Artificial Context

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s