I thought I’d write a few comments on the history of the simplenlg open-source package for NLG surface realisation. I occasionally get asked about this, so I thought I would properly write down the story.
Simplenlg started in the late 1990s. We were working on the STOP system (using NLG to produce personalised smoking cessation letters). We decided to build STOP in Java (which at the time was still a fairly new programming language), and we needed a library or tool to do simple NLG surface realisation tasks. I had just finished a survey of existing NLG surface realisers (as part of writing my book on Building NLG Systems), which showed that existing realisers (A) would be painful to use from a Java system, and (B) mostly (with some exceptions) were poorly engineered as software artifacts. So I decided to create my own Java library to do simple NLG tasks.
Over the next 10 years, I expanded the library (which was called different things at different times), as it was used in several research projects (including SumTime and SkillSum) and also for teaching NLG. And the goal was always to reliably provide “simple” NLG functionality; ie, focus on doing the simple things well. But the library was still only used and known within Aberdeen University.
In 2006 we started the Babytalk project; the most challenging NLG project we (or anyone else) had ever attempted. Babytalk clearly needed a good library for surface realisation, so we decided to pull together, reorganise, and enhance the existing library, and release this to the research community as simplenlg 3 (simplenlg 1 was notionally what came out of the work on STOP, and simplenlg 2 was notionally where the library was at before Babytalk). While still maintaining the core principle that the goal was to robustly provide simple/basic NLG functionally in a fast, easy-to-use, well documented, and in general well engineered Java library. Simplenlg has always been inferior to other realisers (such as KPML or openccg) in terms of functionality, but then its focus has always been on usability and usefulness in building systems, not on providing the most functionality.
I had personally developed earlier versions of simplenlg, but simplenlg 3 was definitely a team effort, to which numerous people contributed. I am especially grateful to Albert Gatt, who was a research fellow on Babytalk and put a huge amount of effort into writing code for simplenlg; and to Chris Venour, a PhD student who had worked as a technical writer, and did a lot to ensure that simplenlg had a high-quality tutorial and documentation as well as good code.
We released simplenlg 3 to the research community for non-commercial use, via a Google Code site. We specified non-commercial use because John Carroll had kindly agreed to let us use his rules for morphological generation, but only for non-commercial purposes. A number of people outwith Aberdeen started using simplenlg, especially after I mentioned it on the relevant Wikipedia surface realisation page. I wasnt deliberately trying to publicise simplenlg, but I needed a simple example to show the concept, and simplenlg was the obvious candidate.
By 2009 increasing usage of simplenlg 3 in many different contexts by many different organisations had shown that it had some architectural problems. Also, we were getting inquiries from commercial companies who wanted to use simplenlg (and indeed we were looking at setting up our own spinout company, Data2text, which eventually was absorbed into Arria NLG). So we decided to rebuild and rearchitect simplenlg, based on accumulated feedback, and also replace Carroll’s morphology rules by a lexicon-based morphology system (using the NIH Specialist Lexicon), so that we could release simplenlg as true open source software which could be commercially used. The result was simplenlg 4, which was released in 2010. We decided that simplenlg 4 should *not* be backwards compatible with simpleng 3, which caused a lot of hassle in the short-term, but gave us much more flexibility in rearchitecting the system, which paid off in the longer term.
Simplenlg 4 has been picked up and used by numerous research labs and commercial companies, I long ago gave up even trying to keep track. It has also inspired the development of similar realisers in other languages, including Filipino, French, German (simplenlg 3), Italian, Portuguese, Spanish, and Telugu. There also is a community of people who monitor the mailing list and answer questions; I am especially grateful to Saad Mahamood for his work in supporting simplenlg, which now officially resides on Github.
Simplenlg has come a long ways since I threw together some Java code for surface realisation almost 20 years ago. Although I dont have any hard usage data, I believe it is the most widely used open-source NLG library. Certainly I keep on running across people and organisations who are using it, often in ways I never anticipated when I created the first version almost 20 years ago!
8 thoughts on “The story of simplenlg”