A few weeks ago a colleague who works on neural NLG asked me what was known about the amount of time required to develop neural NLG systems versus rule/template NLG systems. I’m glad he actually asked me, because I am sick of seeing neural NLG papers which claim in their introduction that building NLG systems is much faster with “modern” (ie, deep learning) techniques, without offering a shred of evidence to support this claim.
But anyways, my colleague decided to actually investigate this issue rather than make unjustified claims in his paper. So I told him that
- We have no solid evidence or data about this issue, so no claims can be properly justified.
- The weak evidence that we have suggests that building rule/template NLG systems is no slower and may indeed be *faster* than building neural NLG, at least for data-to-text systems.
It is very hard to experimentally test hypotheses about the impact of different technologies or methodologies on the effort required to develop a software system. We can see this by looking at some of the (very few) papers which actually try to give data about the development effort required to build rules-based NLG and machine learning NLG systems with similar functionality.
Belz 2008: This paper presented a ML technique for NLG microplanning and realisation, which was applied to weather forecast generation. Belz compared her system to the rule-based SumTime system, which did the same task. There was no statistically significant difference in human evaluations of the output quality of Belz’s system and SumTime’s microplanner and realiser. She also tried to compare development effort (Section 4.4.7 of the paper), which is why I mention the paper here. It took Belz 1 person-month to build a forecast generator for wind statements using her approach. She estimated that it would take another two months to add the other forecast fields, giving a total of three person-months. She asked me how much time we spent on microplanning and realisation in SumTime, and I told her 12 person-months.
So it seems that the ML system was much faster to develop than the rules-based system (3 months vs 12 months). But the comparison is flawed, because
- The SumTime estimate included all of the time spent in the project on microplanning and realisation. This included researching word choice, developing tools, wasting time on a dead-end approach, and writing papers. Whereas my understanding of Belz’s estimate was that it did not include any of these activities, only software development of the final system.
- Belz’s system was a research prototype which was tested on a dataset, whereas SumTime was operationally deployed and used. Hence we had to spend a lot of time in SumTime on robustness (handling edge cases, software testing, fixing bugs, sensible error handling and reporting, etc), which was not needed for Belz’s research prototype.
My best guess is that if looked only at software development and aimed at “research prototype” robustness levels, it would have take 1-2 months to develop the SumTime microplanner and realiser. Ie, faster than Belz’s estimate (3 months) for building her ML system.
Puzikov and Gurevych 2018
Puzikov and Gurevych 2018: Puzikov and Gurevych participated in the E2E challenge, and built two systems: one using neural NLG and one using templates. They submitted the template system to the shared task, and it did very well, coming in second in the human evaluation of quality.
In terms of development effort, Puzikov and Gurevych say that it took a “few hours” to develop the template system. They dont give an effort estimate in the paper for developing the neural NLG system, but when they presented this paper at INLG 2018, they told me that it took a few weeks. Ie, developing the template system was at least an order of magnitude faster than developing the neural NLG system. But again, there are some important confounding factors here
- Puzikov and Gurevych developed the neural NLG system first, and spent a lot of time understanding the data, domain, and what constituted a good output text. This effort is not included in the time spent developing the template system, since it was developed second.
- Puzikov and Gurevych do not include the time (several mon-months) required to build the corpus to train the neural model.
I wont attempt to resolve the above into a “fair” comparison, since I dont have any first hand knowledge of this work.
Software engineering perspective
Software engineers have been trying for decades to understand what factors influence the cost of creating software. One old model which still provides useful insights is COCOMO. COCOMO identifies several factors which have a big influence on development cost, including
- Robustness and reliability (mentioned above in discussion of Belz 2008)
- Skill and expertise of developers (domain expertise was mentioned above in discussion of Puzikov and Gurevych 2018)
- Software methodology and tools
So if we want to do fair comparison, then in addition to all of the factors mentioned above, we also want to control for differences in developer skill (a really good developer can be 10x more productive than a poor developer) and in software methodology and tools.
In short, it is very hard to do a rigorous and fair comparison of the development effort required to build an NLG system with different technologies (ie, rules-based and neural). We would need to ensure that
- Systems produce texts of similar quality.
- Systems are similar from robustness perspective.
- Effort measured is just creating the software artefact, not doing research.
- Developers have similar skills, expertise, and domain knowledge.
- Similar software engineering methodologies and tools are used.
We also need to decide whether to include corpus-creation time in the development effort for machine-learning systems.
It might be possible to do the above experiment, but I’m not aware of anyone who has actually conducted such an experiment. If anyone has done so, please let me know!
So going back to my colleague’s question about whether developing ML NLG is faster than developing rules-based NLG, the careful scientific answer is that (as of May 2020) we dont know, because no one has done the necessary experiments to answer this question.
Coda: What do I think?
I will “go out on a limb” and hazard a personal opinion, which is that the (weak) data we have so far suggests that building rule-based NLG systems is no slower (and may be faster) than building neural NLG systems, at least for data-to-text applications, even if we ignore corpus-creation costs. This is what Puzikov and Gurevych’s work suggests; there were confounding factors, but the difference they found was so large that it seems likely that there is some truth here. Belz’s work, if we adjust for confounds as described above, suggests that building rule-based NLG systems is either faster or equivalent to building ML NLG systems.
The above ignores corpus-creation costs. If we include such costs in the effort of building an ML NlG system, then I suspect that building a neural NLG system takes **more** development time than building a rule-based NLG system.
Please note that this discussion applies to building systems. If we look at individual NLG tasks, then I think it is likely that some tasks (such as lexical choice?) can be done faster and better using ML technology. But again this is a personal opinion and guess, its not based on data and evidence.