A few weeks ago a colleague who works on neural NLG asked me what was known about the amount of time required to develop neural NLG systems versus rule/template NLG systems. I’m glad he actually asked me, because I am sick of seeing neural NLG papers which claim in their introduction that building NLG systems is much faster with “modern” (ie, deep learning) techniques, without offering a shred of evidence to support this claim.
But anyways, my colleague decided to actually investigate this issue rather than make unjustified claims in his paper. So I told him that
- We have no solid evidence or data about this issue, so no claims can be properly justified.
- The weak evidence that we have suggests that building rule/template NLG systems is no slower and may indeed be *faster* than building neural NLG, at least for data-to-text systems.
It is very hard to experimentally test hypotheses about the impact of different technologies or methodologies on the effort required to develop a software system. We can see this by looking at some of the (very few) papers which actually try to give data about the development effort required to build rules-based NLG and machine learning NLG systems with similar functionality.
Belz 2008
Belz 2008: This paper presented a ML technique for NLG microplanning and realisation, which was applied to weather forecast generation. Belz compared her system to the rule-based SumTime system, which did the same task. There was no statistically significant difference in human evaluations of the output quality of Belz’s system and SumTime’s microplanner and realiser. She also tried to compare development effort (Section 4.4.7 of the paper), which is why I mention the paper here. It took Belz 1 person-month to build a forecast generator for wind statements using her approach. She estimated that it would take another two months to add the other forecast fields, giving a total of three person-months. She asked me how much time we spent on microplanning and realisation in SumTime, and I told her 12 person-months.
So it seems that the ML system was much faster to develop than the rules-based system (3 months vs 12 months). But the comparison is flawed, because
- The SumTime estimate included all of the time spent in the project on microplanning and realisation. This included researching word choice, developing tools, wasting time on a dead-end approach, and writing papers. Whereas my understanding of Belz’s estimate was that it did not include any of these activities, only software development of the final system.
- Belz’s system was a research prototype which was tested on a dataset, whereas SumTime was operationally deployed and used. Hence we had to spend a lot of time in SumTime on robustness (handling edge cases, software testing, fixing bugs, sensible error handling and reporting, etc), which was not needed for Belz’s research prototype.
My best guess is that if looked only at software development and aimed at “research prototype” robustness levels, it would have take 1-2 months to develop the SumTime microplanner and realiser. Ie, faster than Belz’s estimate (3 months) for building her ML system.
Puzikov and Gurevych 2018
Puzikov and Gurevych 2018: Puzikov and Gurevych participated in the E2E challenge, and built two systems: one using neural NLG and one using templates. They submitted the template system to the shared task, and it did very well, coming in second in the human evaluation of quality.
In terms of development effort, Puzikov and Gurevych say that it took a “few hours” to develop the template system. They dont give an effort estimate in the paper for developing the neural NLG system, but when they presented this paper at INLG 2018, they told me that it took a few weeks. Ie, developing the template system was at least an order of magnitude faster than developing the neural NLG system. But again, there are some important confounding factors here
- Puzikov and Gurevych developed the neural NLG system first, and spent a lot of time understanding the data, domain, and what constituted a good output text. This effort is not included in the time spent developing the template system, since it was developed second.
- Puzikov and Gurevych do not include the time (several mon-months) required to build the corpus to train the neural model.
I wont attempt to resolve the above into a “fair” comparison, since I dont have any first hand knowledge of this work.
Software engineering perspective
Software engineers have been trying for decades to understand what factors influence the cost of creating software. One old model which still provides useful insights is COCOMO. COCOMO identifies several factors which have a big influence on development cost, including
- Robustness and reliability (mentioned above in discussion of Belz 2008)
- Skill and expertise of developers (domain expertise was mentioned above in discussion of Puzikov and Gurevych 2018)
- Software methodology and tools
So if we want to do fair comparison, then in addition to all of the factors mentioned above, we also want to control for differences in developer skill (a really good developer can be 10x more productive than a poor developer) and in software methodology and tools.
Summary
In short, it is very hard to do a rigorous and fair comparison of the development effort required to build an NLG system with different technologies (ie, rules-based and neural). We would need to ensure that
- Systems produce texts of similar quality.
- Systems are similar from robustness perspective.
- Effort measured is just creating the software artefact, not doing research.
- Developers have similar skills, expertise, and domain knowledge.
- Similar software engineering methodologies and tools are used.
We also need to decide whether to include corpus-creation time in the development effort for machine-learning systems.
It might be possible to do the above experiment, but I’m not aware of anyone who has actually conducted such an experiment. If anyone has done so, please let me know!
So going back to my colleague’s question about whether developing ML NLG is faster than developing rules-based NLG, the careful scientific answer is that (as of May 2020) we dont know, because no one has done the necessary experiments to answer this question.
Coda: What do I think?
I will “go out on a limb” and hazard a personal opinion, which is that the (weak) data we have so far suggests that building rule-based NLG systems is no slower (and may be faster) than building neural NLG systems, at least for data-to-text applications, even if we ignore corpus-creation costs. This is what Puzikov and Gurevych’s work suggests; there were confounding factors, but the difference they found was so large that it seems likely that there is some truth here. Belz’s work, if we adjust for confounds as described above, suggests that building rule-based NLG systems is either faster or equivalent to building ML NLG systems.
The above ignores corpus-creation costs. If we include such costs in the effort of building an ML NlG system, then I suspect that building a neural NLG system takes **more** development time than building a rule-based NLG system.
Please note that this discussion applies to building systems. If we look at individual NLG tasks, then I think it is likely that some tasks (such as lexical choice?) can be done faster and better using ML technology. But again this is a personal opinion and guess, its not based on data and evidence.
Nice post again! One small note: the post suggests that corpus-creation is not necessary for building rule-based NLG systems. I would disagree with that, since it’s often very insightful to see how humans carry out a particular task. And if you want to have good coverage, then still having a sizeable corpus would be useful (though not as essential) for rule-based NLG as well.
I suspect you’d agree with this. So then my question is: what would be a good size for a corpus for the development of rule-based NLG? And what do you think about hybrid methods, where systems automatically learn to generate templates based on a corpus? Of course these still have to be checked manually, but given a ‘general-purpose’ template generator, this could save a lot of time. At the same time, such methods also require a large corpus.
LikeLike
Good question. As always with corpora, more texts is also nice! I guess my view on corpus size for developing rule-based NLG systems is
* essential to have at least one example text
* highly desirable to at least ten example texts
* very useful to have 100 example texts, especially if there is good coverage of edge cases
More than 100 example texts are useful if they provide better coverage of edge cases or if I am using ML techniques to build part of the system.
LikeLiked by 1 person
Hi, just a quick addition to what Emiel first said about the necessity of a corpus. In my specific case, all data-to-text (narrowing the scope here) systems I’ve been involved with always started as ideas based on the same thought: “It would be great if we could generate a textual description about those graphics or those data”.
Often we would work with the expert (e.g., meteorologist) to define a standard text in terms of content and style, but there was no corpus to start from, and we did not build one afterwards (mostly because the priorities were different, but it is also true at that time we were not as aware as today about the importance of having a good corpus). Of course in this context we didn’t have much choice and went rule-based. So, looking back, I guess you don’t need a corpus, but if that’s the case you will spend more time changing your system until the expert is happy enough with the result because there is not a reference or gold standard for comparison. Then you can start thinking about a proper evaluation.
LikeLike
Thanks Ehud. Another critical aspect to consider is the purpose of use of NLG.
NLG is designed to generate commentary for a human to read.
Although there are many use cases of unstructured commentary read by machines to process/extract structured information (typically metadata) out of the text (NLP), in those use cases, there are many derived machine activities from the metadata, and there is tolerance for (statistical) errors. Examples: classifying feed of News to measure overall sentiment, or even search result that produces few irrelevant results.
NLG typically designed to be consumed by human individuals, to read. There is no tolerance for error. Machine Learning solution is statistical-based, and therefore problematic for most of NLG use cases. A human will not accept incorrect or non-relevant weather forecasts, and definitly not incorrect financial report…
Does this make sense?
LikeLike
Hi, I absolutely agree that for most NLG tasks, it is essential that generated texts be accurate! In fact I’ve written some previous blogs about this (eg, https://ehudreiter.com/2019/09/26/generated-texts-must-be-accurate/ and https://ehudreiter.com/2020/04/27/accuracy-errors-go-beyond-getting-facts-wrong/ ).
LikeLike
In a recent survey paper (https://arxiv.org/pdf/2007.15780.pdf), Cristina Gârbacea and Qiaozhu Mei claim “Compared to the survey of (Gatt and Krahmer, 2018), our overview is a more comprehensive and updated coverage of neural network methods and evaluation centered around the novel problem definitions and task formulations.” They certainly have a long list of references… Including:
“Ehud Reiter. 2019.
challenges for explainable ai. arXiv preprint arXiv:1911.08794.
Ehud Reiter. 2020. Why do we still use 18-year old BLEU. https://ehudreiter.com/2020/03/02/why-use-18-year-old-bleu.
Ehud Reiter and Anja Belz. 2009. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4):529–558.”
LikeLike