Uncategorized

Varying Words In NLG Texts

I’ve written a lot of academic blogs recently, so I thought I’d change focus and write something more practical (aimed at people building systems), on varying words in NLG texts.

It is a truism of writing that authors should vary the language they use, in order to keep writing interesting, and many NLG systems try to do likewise.  Most commercial NLG toolkits (such as Arria’s NLG Studio) include support for “variation”, “alternatives”, etc.   In this blog, I want to look at varying the words used in a text; I assume the goal is to change expression but not core meaning.  Of course there are many other kinds of variation and paraphrase, Bhagat and Hovy 2013 give a good summary.

Synonyms

The obvious way to vary words without changing meaning is to replace them with synonyms.  For example, introduce “The stock market will go down” as a possible variation of “The stock market will fall“.

Currently people who build NLG systems do this sort of thing explicitly, ie they create a structure or rule which is analogous to “The stock market will {fall, go down}” (or some equivalent at the level of syntactic structure or concept lexicalisation rules).  I am sometimes asked if the process of creating alternatives such as {fall, go down} can be automated, but this is not easy.  We can certainly get synonyms from an online thesaurus or similar resource, but usually the synonyn list will include a lot of inappropriate words, and miss some appropriates ones.

For example, if I look up “fall” (as a verb), I see one synonym is “is overthrown“.  But this is for a different sense of “fall“, as in “the government will fall“; “The stock market will be overthrown” is not an acceptable variation of “The stock market will fall“.   Even if I can identify a specific word sense and look up “fall” in a resource that is sensitive to word sense, such as Wordnet, I will see synonyms such as “descend” which are very rarely used in a stock market context.   “stock market will descend” sounds very strange, and indeed Google n-grams does not find any usages of this phrase in the Google Book corpus.

Synonym lists from thesaurii will also miss some acceptable alternatives.  It is probably acceptable to say “The stock market will drop”  as a variant of “The stock market will fall“, but Wordnet does not list “drop” as a synonym of “fall“.

We can of course try to be more intelligent in proposing appropriate synonyms in context, for example by using resources such as Google ngrams.   But I’ve not yet seen an automatic variation system which is reliable enough to be used in real NLG systems; although such systems can be useful in suggesting potential variations to human NLG developers.

Manually Selecting Synonyms

So NLG toolkits generally expect the developer to decide which synonyms (or alternatives) are acceptable in the specific context, perhaps with help from a tool for suggesting variations (as above).   But this is a harder task than it might seem.

One problem is that people use language in idiosyncratic ways (eg, see Reiter and Sripada 2002). which means that a developer may suggest alternatives which are not acceptable to users, especially if the developer is not familiar with the domain.   My recommendation is that developers only use alternatives which they observe in corpus texts.  If there is no corpus, developers can use Google search (or Google ngrams) to see if a particular wording occurs “in the wild”.

Another problem is that when there are several places in which words can be varied, the choices may depend on each other.   For example, “go down” is usually an acceptable variation of “fall” in a stock market context.  However,  “The stock market will rise and then fall” or  “The stock market will go up and then go down” are probably better than “The stock market will rise and then go down“.  This is because in this kind of contrastive sentence, its best to use contrastive word pairs such as rise/fall or go-down/go-up.   So if we just write a structure such as “The stock market will {rise, go up} and then {fall, go down}”, we’ll get sub-optimal results.  The second choice ({fall, go down}) is influenced by the first choice ({rise, go up}), and the developer needs to both realise this and also tell the NLG system that the choices are linked.

A good quality-assurance process, with independent testers assessing text quality, is really useful in identifying such problems.  Provided that the testers are able to see the alternatives and variations, which depends on how testing is done, and how much support the NLG toolkit provides for quality assurance in general and testing variation in particular.

Vary Non-Content Words?

Many years ago, we looked at how human meteorologists modified (post-edited) NLG-generated weather forecasts before the forecasts were released to customers (Sripada et al 2005).  One thing we observed is that a lot of the “word variation” edits were to non-content words such as connectives.  For example, a forecaster might change “then” to “and”; eg, change “temperature rising in the morning, then falling at lunchtime, then rising in the afternoon” to “temperature rising in the morning, then falling at lunchtime, and rising in the afternoon“.    In general, varying connectives and other “non-content” words is probably safer than varying content words such as nouns and verbs; there is less risk of confusing the reader because of the above issues.

Most of the time (at least in my experience), NLG developers focus on looking for alternatives to content words, but the above suggests that at least in some cases it is easier to vary non-content words.

2 thoughts on “Varying Words In NLG Texts

Leave a comment