academics

An Architecture for Data-to-Text Systems

I was very happy to be given a Test-of-Time award at INLG for my 2007 paper An Architecture for Data-to-Text Systems. Test-of-Time awards are given to old papers (in this case, published at least 10 years ago in INLG or ENLG) which have had a major impact and continue to be cited in 2022. This is the first Test of Time award from INLG, so I guess the reviewers must have considered this paper to be top of the list!

I must say its been a good year for awards. My former student Meg Mitchell won an ACL Test of Time award for work she did on image captioning while she was a PhD student at Aberdeen, my current PhD student Francesco Moramarco won an award at NAACL 2022 for Best paper on human-centred NLP (special theme), and now I’ve won the INLG test of time award. Not bad!

Anyways, I thought I’d say a bit about the paper and also the context (which a few people have asked me about)

Context

Aberdeen in 2007 was a really exciting place to do NLG research. Our NLG group was led by Chris Mellish, and included myself, Kees van Deemter, and Yaji Sripada (faculty); Graeme Ritchie, Albert Gatt, and Francois Portet (research fellows); and several PhD students including Saad Mahamood, Nava Tintarev, and Ross Turner. I think it was the biggest NLG group in the world at the time.

I myself was a Reader (similar to Associate Professor in USA) and focusing on data-to-text, ie building NLG systems which summarised, explained, and otherwise communicated complex numeric and symbolic data sets. We had just started working on the Babytalk project, whose goal was to generate summaries of clinical data from babies in neonatal intensive care, for doctors, nurses, and parents. It was the most ambitious data-to-text project attempted to date, which involved a research team from many backgrounds (data analysis, knowledge-based reasoning, NLG, medical informatics, psychology, neonatal care), and I wanted to come up with an architecture which integrated the many types of reasoning and knowledge needed for complex data-to-text processing. The architecture also needed to cover both data-analytics and NLP and be consistent with previous work in this area.

Paper

In the paper, I essentially proposed that data-to-text systems be treated as a pipeline of four components, each of which did a different type of processing. Identifying the types of reasoning needed was as important as constructing a pipeline. Anyways, the stages were

  • Signal analysis: Look for patterns in numeric input data, such as spikes and trends. This is essentially data science and is usually done with standard signal analysis, pattern detection, and noise suppression algorithms, perhaps fine-tuned on the dataset.
  • Data interpretation: Extract important messages (insights) from the signal analysis patterns and also symbolic input data. Also look for links between insights/messages, such as causality. This requires domain knowledge about which insights are important and how they related to each other. In 2007 I considered this to be a knowledge-based reasoning task; in 2022 machine learning techniques could also be used for data interpretation.
  • Document planning: Decide which messages should be communicated and create a document structure (message ordering, rhetorical relations, paragraph and section breaks, etc); the goal is to create a story or narrative about the data. This requires some understanding of what constitutes a good story or narrative, which can either be encoded algorithmically or learned from corpora.
  • Microplanning and realisation: Create an actual text which expresses the document plan in fluent grammatical language. This is essentially linguistic processing, and involves tasks such as lexicalisation, referring expression generation, aggregation, and surface realisation. This paper introduced an early version of the simplenlg package for surface realisation.

The above architecture separates different types of reasoning into different modules, which makes it easier for multidiscipinary teams to collaborate. For example, data scientists can work on signal analysis, subject matter experts can work on data interpretation, and computational linguists can work on microplanning and realisation.

In 2007, we expected machine learning to be used in signal analysis but not elsewhere. In 2022, machine learning could in principle be used in all of the above stages. However, I would strongly recommend that something like the above architecture be used; I believe that a pipeline of focused modules will do a better job than an “end-to-end” approach, especially for more complex data-to-text applications.

If you want to learn more, I suggest you read the paper, people have told me that it is it is relatively accessible and easy to read, even for non-specialists.

Since 2007

This paper currently (at the time I’m writing this blog) has around 300 citations on Google Scholar, which I suspect makes it one of the most cited ENLG/INLG papers. I like to think that most people working in (complex) data-to-text cite this paper; even if they dont use the pipeline architecture, their systems still needs to perform the above types of reasoning in some fashion.

The Babytalk project also achieved a lot of recognition (with the main publications being in journals, not conferences), and many researchers and indeed developers started using the simplenlg package (especially after we had improved it and released version 4).

I should say that in general I am proudest of my journal papers, not my conference papers. But among my conference papers, An Architecture for Data-to-Text Systems is certainly one of my favourites!

Full Citation

E Reiter (2007).
An Architecture for Data-to-Text Systems.
Proceedings of ENLG-2007, pages 97-104.
URL: https://aclanthology.org/W07-2315/

2 thoughts on “An Architecture for Data-to-Text Systems

Leave a comment