I split my time between the University of Aberdeen and Arria NLG, so I have “one foot” in academic NLG and the “other foot” in commercial NLG. I was recently asked about these different perspectives, so I thought I’d write something here. As always sticking to published information about Arria, I cannot reveal commercial confidential information here.
One of the most striking differences between academic and commercial NLG is in use cases. If you look at Arria’s key use cases, you’ll see there is little overlap with the academic data sets included in GEM. For example, Arria does not list use case related to WebNLG or ToTTo, and none of the GEM datasets are about business intelligence or financial reporting. There is a GEM dataset related to weather forecasts, but although Arria was worked on this in the past, weather forecasting is not currently listed as one of Arria’s key use cases.
The big exception is sports reporting, ie generating a narrative about a sports game from sports data. There are several sports data sets in GEM (including baseball, basketball, and hockey), and sports reporting is also an important use case for Arria. Indeed, academics have been working on sports reporting for decades (eg Robin and McKeown 1996) and commercial work dates back at least to 2015.
So sports reporting is an area where academic researchers and commercial companies are (at least at a high level) trying to do something similar. But there are still a **lot** of differences, some of which I explore below. I focus on requirements instead of technology, since its difficult for me to say much about Arria’s technology without revealing confidential information.
Use case details
Sports reporting of course is a broad area, and contains numerous use cases; the Arria sports page lists performance analysis (eg for coaches and talent scouts), sports betting, and media as top use cases. These different use cases require different systems (and are sold to different customers), although of course share underlying technology.
In contrast, academic papers often show little interest or knowledge of how the texts will be used, they just focus on replicating a corpus. For example, one of the most popular sports corpus for academic NLG is Rotowire and its descendants such as SportSett. The Rotowire corpus comes from https://www.rotowire.com/basketball/game-recaps.php, but it is not clear what these these texts are intended for, nor have I ever seem this discussed in an academic paper; my hunch is that they are post-game reports for people who place bets, but other researchers believe that they are supposed to help users create fantasy sports teams. Anyways, from a commercial perspective, its bizarre (to put it mildly) to build an NLG system without understanding who is going to use the texts, and for what purpose!
The Arria sports NLG team is led by a former professional basketball player, Mustafa Abdul-Hamid, and includes many other people with deep knowledge of basketball and other sports. This is very important for understanding client and industry needs as well as for building good NLG systems.
Academic researchers in this area, in contrast, dont seem to put much value on domain expertise. I’ve read a number of academic papers on sports NLG, and its rare for recent papers to say meaningful things about domain expertise, instead the problem is usually treated as an ML task where all that is needed is a corpus of input data and output texts. Which to me seems bizarre, since at minimum a basic level of domain expertise is needed to evaluate the accuracy of generated texts (Thomson and Reiter 2020); ie someone who doesn’t understand what “double-double” means is not going to able to properly evaluate the quality of a text about a basketball game!
Commercial NLG systems must be robust. They must produce decent quality texts in all circumstances, it is not acceptable to occasionally produce garbage texts (unless texts are post-edited before being released, but even here garbage texts are definitely undesirable).
Academic NLG researchers, in contrast, mostly don’t seem to care if their systems occasionally produce garbage texts, as long as “on average” texts are good (since academic evaluations focus on average case). In our 2020 paper, Craig and I give the following example of a nonsensical output from a published neural NLG system
Markieff Morris also had a nice game off the bench, as he scored 20 points and swatted away late in the fourth quarter to give the Suns a commanding Game 1 loss to give the Suns a 118-0 record in the
Eastern Conference’s first playoff series with at least the Eastern Conference win in Game 5.
The author of this system never mentioned in papers that the system could produce such terrible narratives, and indeed may not have realised this (some authors just look at metrics and never read texts). Producing texts such as above, even once in a while, is not acceptable in commercial NLG.
Academic NLG systems are invariably based on freely available public sports data sets. This makes life much easier for researchers, and also enhances replicability. Commercial NLG systems, in contrast, can use proprietary and paid-for data sets which are much richer.
There are good reasons for academics to use public data sets, and the sports data sets used by academic researchers are considerably more complex than most of the other data sets in GEM! The point I’m making here is that these data sets are still simpler than those used in commercial NLG systems, which means that in a sense academics are solving a different problem than commercial developers.
Since sports NLG is of interest to both academic researchers and commercial NLG providers, its a good venue for exploring differences in approaches. I have focused on “requirements” rather than “technology” here, but we can see that academic researchers and commercial providers in many ways are working on different problems. Commercial providers target specific use cases, place a lot of important on domain knowledge, care deeply about robustness, and use rich data sources. Academics, in contrast, seem largely uninterested in use cases, domain knowledge, and robustness, and stick to free data sets. There are many other differences that I could list, eg commercial providers care about configurability and performance (run-time speed) and most academics ignore these issues.
Of course there are exceptions! For example, in the Babytalk research project years ago (which was about medical NLG, not sports NLG), we focused on specific use cases, worked closely with domain experts, tried to make our systems robust, and used complex non-public data.
Anyways, I’m not expecting anyone to change the way they work, but I hope the above discussion helps academics understand how their work differs from commercial providers, and vice-versa.