A lot of current research in NLG is based on gathering a corpus, applying a learning algorithm to the corpus, and evaluating the quality of the result. I have written several blog entries about evaluating NLG systems, which is something which is not always done as well as it could be. But its also very important to make sure that the corpus is appropriate for the NLG task or hypothesis you are focusing on.
This point was brought home to me recently when I looked at at the Weathergov corpus and data set (Liang et al 2009), which is a collection of point weather forecasts and corresponding data which was extracted from the weather.gov website. The problem is that it is likely that these weather.gov forecasts were mostly produced by a computer system, what we would call a rule-based NLG system (see Background section below). In other words, any paper which is based on the Weathergov corpus is probably reverse-engineering a rule-based NLG system, not learning to imitate the behaviour of human forecasters. And quite a few papers (by many different authors) have used Weathergov, including papers published in “high-prestige” venues such as ACL, NAACL, and EMNLP (most of these papers looked at other data sets as well).
I don’t for a moment suggest that the authors of these papers realised that the Weathergov forecasts were probably computer-generated! But still, most of these papers claimed to present techniques for building NLG systems which did not require manually writing domain-specific rules. But it looks like what these authors have actually shown (at least in the weather domain) is that they can reverse-engineer hand-crafted rules which someone else has built, which is a very different claim.
I think there is an important lesson here, which is that we need to ensure that corpora used for training (and evaluation) are fit for purpose. Which means that we need to understand how the corpora were created and also how they are normally used. People building rule-based NLG system usually get such an understanding automatically because they work with domain subject matter experts when they write the rules. However, people using ML approaches may need to make an explicit effort to understand the provenance of their corpora.
To put this another way, there is huge potential in using ML to solve NLG problems. But people using ML need to understand the problem they are tackling and the resources (eg, corpora) they are using, as well of course as properly evaluating the result.
Background: Weather Forecasts and NLG
There are of course many types of weather forecasts, for different forecast clients such as farmers, sailors, aircraft pilots, hill climbers, road-gritting engineers, etc. From an NLG perspective, a key distinction is between point and area forecasts. A point forecast describes how the weather (wind speed, temperature, precipitation, etc) will change over time at one geographic point, while an area forecast describes how the weather will vary over a spatial region as well as time.
For example, my daughter is currently visiting the US and is in Rockville, Maryland. If we look at weather.gov, the point forecast for this location today (ie, the sort of thing which is in the Weathergov corpus) is
Patchy frost before 8am. Otherwise, sunny, with a high near 64. West wind around 7 mph.
Weather.gov also presents a number of area forecasts for the Washington area, for different users (aviation, marine, etc). Below is what I think is the most generic area forecast for the Washington region
Weak high pressure will build overhead tonight. This combined with the dry air mass will provide for great radiational cooling conditions. Expecting lows to range from the upper 20s and lower 30s in colder valleys in the Potomac Highlands…to the 30s for most other locations…to the 40s in the urban centers and along the immediate shorelines of the Potomac/Chesapeake. Therefore, Frost and Freeze headlines remain in effect.
Comparing these two, you can see that the point forecast just describes how the weather changes over time (eg, “before 8am”), while the area/regional forecast also describes how weather varies spatially (eg, “valleys in the Potomac Highlands”, “along the immediate shorelines of the Potomac/Chesapeake”). Sometimes area forecasts also describe spatiotemporal change, eg “rain moving in from the east overnight”. The above area forecast also integrates domain knowledge about the causes of radiational cooling.
From an NLG perspective, point forecasts are relatively easy to generate, and indeed the first commercial generator of point forecasts, FoG (Goldberg et al 1994), was deployed in the early 1990s. Since then many other rule-based weather forecast generators have been built. The quality of the forecasts generated by these systems is high, indeed we showed in 2005 that forecast readers preferred point forecasts produced by the SumTime system to forecasts written by people (Reiter et al 2005). Just to be clear, the comparison was against forecasts written by forecasters with a variety of skill levels and working under a lot of time pressure, not against forecasts written by top professsionals with unlimited time. Anyways, most large government weather agencies (and many commercial meteorological companies) use “autotext” technology to generate point forecasts, although often human forecasters can postedit the computer forecasts if they wish (Sripada et al 2005). Some autotext systems are based on simple fill-in-the-blank templates, but the more sophisticated ones use what are effectively rule-based NLG techniques.
I am not very familiar with specific details of how the US National Weather Service (ie, weather.gov) uses computer-generated forecasts, but certainly the US NWS has been active in this area for a long time (Ruth 2000). My guess would be that the forecasts in the Weathergov data set (which I believe were collected about ten years ago) are mostly computer-generated but may also have a human contribution (eg, via post-editing). Of course I could be wrong about this, and will happily defer to anyone who is more knowledgeable than I am about this. I did contact Prof Liang, who collected the Weathergov corpus, and he agrees that in retrospect it is likely that the texts in this corpus were produced by a rule-based NLG system.
Incidentally if anyone wants to work with a corpus of manually-written point forecasts, you can download the SumTime corpus from this blog. I guarantee these forecasts were manually created, although in many cases by copy-and-edit from a similar forecast or field, instead of from scratch. Indeed we even know which meteorologist wrote each forecast, an anonymised version of this information is included in the corpus.
In any case, the above only applies to point forecasts. While there has been some academic work on using NLG to generate area forecasts (eg, Turner et al 2008 and Oliveira et al 2016), I dont believe this technology is yet routinely deployed and used by government and commercial weather agencies.
In short, using NLG to generate point forecasts is a relatively mature and widely used technology. Of course the technology can be improved, but it works and can be used to produce quite high quality forecasts. Generating area forecasts, however, is considerably more challenging, and technological advances are probably needed before this can be done routinely by meteorological agencies. Going back to ML techniques, I personally would be far more interested in using ML to solve the tough problems involved in generating area forecasts such as the one shown above, I am much less excited about using ML to refine a technology (using NLG for point forecasts) which already works pretty well.