Data-to-text systems summarise and present insights from (usually numeric) data sets. Most such systems take the data at face value and assume it is correct. But of course in the real world many data sets are messy: incomplete (missing data), incorrect, inconsistent, etc. Interpreting such data literally will probably lead to false insights and conclusions; hence an important challenge for data-to-text is to generate good summaries and analyses from flawed real-world data.
One of my PhD students, Stephanie Inglis, has been working on this problem. Steph started off looking at data from the 2014 Ebola outbreak, where key data about the progress of Ebola was coming from remote rural areas of some of the poorest countries in the world, so it had many problems. She then looked at several other data sets which combined data from countries around the world (and in some cases from different time periods). The problem here is not just missing or incorrect data, its that data is collected in different ways in different countries (and in different time periods), which can easily lead to incorrect insights because of inconsistencies in how data is collected.
Steph didnt look at the Covid pandemic, but exactly the same issues come up with pandemic data, as has been made clear in a series of excellent articles in the Guardian by David Spiegelhalter and Anthony Masters. Spiegelhalter and Masters have done a great job of explaining the impact of the above issues on Covid data to a general audience, and drawing out valid insights from the data. Their most recent article cites a great example of an analysis by a well-respected journalist which suggests that vaccinated people seem more likely to get Covid than unvaccinated. This analysis is flawed because it is based on incorrect population data. The flaw is a subtle one, and a great illustration of the problem of taking data literally without trying to understand underlying data quality issues.
Spiegelhalter and Masters conclude by saying “Data does not speak for itself – it needs people to speak honestly and carefully on its behalf.” It would be fantastic if NLG systems could take on this role, of “speaking honestly and carefully” on the behalf of data!
In a sense, Steph’s work is an initial attempt at this task; its clear that we have a long way to go before NLG systems can do this remotely as well as Spiegelhalter and Masters. Basically, Steph’s approach is to create texts which (A) communicate insights which seem robust and reliable as well as important and (B) include explicit warnings and cautions about relevant data quality issues. This is motivated by talking to data journalists about how they perform this task.
Another approach is to try to “fix” the data quality problems. We attempted this in the Babytalk project, where we used NLG to generate summaries of clinical data and discovered that some of the human-entered clinical notes had incorrect timestamps, which led to confusing texts because events were not correctly time-ordered. We addressed this in Babytalk by using sensor data to identify when relevant events probably happened (eg, even minor surgical procedures cause massive disruption to heart rate and other sensor data, so Babytalk looked for such disruptions in sensor data in order to identify the time of surgery). But this sort of reasoning is complex and domain-specific, so this approach does not scale. Perhaps ML techniques could be used, but getting good training data would be hard.
If anyone is interested in this problem, by the way, I am currently looking for a PhD student to work on explaining probabilistic reasoning. The PhD is notionally about explaining Bayesian networks, but in practice I’m happy for the student to work on any topic related to explaining probabilistic or statistical reasoning, and summarising messy statistical data would certainly fall within this remit! If you’re interested, please contact me; note that the closing date for the studentship is 16 Jan 2022.