I am teaching an NLG course to our MSc students, along with Yaji Sripada. As usual, its an interesting experience for me as well as for the students. This year we are using Arria’s Studio for Students for the first time (previously we used simplenlg). Which has mostly gone well, although there were a few glitches; I’m happy to share my experiences with other people who want to teach NLG using Studio for Students.
Anyways, though, one of the best (toughest?) questions I was asked by the students was about data quality. I gave the students a data set about actors (names, movies they have appeared in or directed/produced, awards, etc), and asked them to create an NLG system which produced a microbiography of the actors career. One of the students (who had worked in industry before starting the MSc) came to my office and pointed out that the data set I gave them was flawed, and in particular was missing a lot of information. For example, I asked the students to include in the microbiography how many Academy Awards the actor had won. But the data set was incomplete, for example it didnt list any Academy Awards for Sean Connery, despite the fact that Connery had won one award for Best Supporting Actor. So anyways, the student asked, how on earth was he supposed to create an NLG system which produced good microbiographies of Connery and other actors when the data set I gave him was buggy and incomplete?
In this case, I just told him to ignore the data quality issues, and said I would mark his project based on faithfulness to the data set, not to reality. But while this works in an academic context, it doesnt solve the problem in real-world contexts. If a user gets an inaccurate text from an NLG system, she doesnt care whether this is because of poor data, buggy programming, bad technology, or whatever; what matters to her is that the text is useless. In other words, it doesnt matter how wonderful our technology is, we are still subject to the “Garbage in, garbage out” principle.
And I have seen this issue arise in *many* NLG projects, and indeed in AI projects generally. NLG and AI developers and researchers are usually very interested in technology, and developers at least care deeply about robust software. But at least in my experience, one of the biggest (maybe the biggest) source of poor performance in real NLG/AI systems is problems in the input data.
Techniques for Addressing Data Quality in NLG
So what do we do about this? One approach is to focus the generated text on information we feel confident about, and explicitly acknowledge missing data and other data quality issues if we need to talk about problematical data. One of my PhD students in fact is working on this (Inglis et al 2017).
Another approach is to try to identify and fix data problems. For example, in the Babytalk-Nurse project, which generated summaries from hospital electronic clinical records, we saw that many time stamps in the clinical record were incorrect. This typically happened when a nurse or doctor entered details of an intervention into the electronic record after the fact (which is common, the focus of clinicians is on patient care, not data entry), and did not accurately specify the time at which they performed the intervention (see section 3.2.5 of Hunter et al 2012). Anyways, interventions of course usually impact sensor data (eg, heart rate), which does have accurate time stamps. So we attempted to fix the time stamps for interventions by looking into the sensor data to determine when the intervention actually happened. This was a lot of work, and it was pure analytics (not NLG), but it did substantially improve the quality of many Babytalk-Nurse texts.
But we cannot always fix data problems. No amount of clever analytics or ML is going to inform an NLG system that Sean Connery has an Academy Award if this information is not present in the input data.
Data Quality Robustness and Hallucination
On a more speculative note, one of the great advantages of using machine learning techniques to build classifiers is that they can learn how to deal with data quality issues, if they are given large amounts of training data. In other words, the classifier learns how to do the best it can with the available data. And I suspect this robustness to data quality problems is one of the reasons ML technology has been so successful in “classifier” applications such as face recognition and sentiment analysis.
In an NLG context, though, I suspect that such robustness is linked to hallucination. For example, consider missing data, which is the most common data quality issue. When building a classifier, it makes sense to effectively try to reconstruct the likely value of the missing data, this leads to better classification performance. However, when building an NLG system, trying to reconstruct missing data can easily lead to hallucination. I commented in a recent blog on hallucination that one example seen in the E2E challenge was hallucinating that a coffee shop is cheap, even if price information is not specified for this coffee shop. Since most coffee shops are cheap, this is a reasonable inference for an ML system to make, and probably the right thing to do if we were using this data within a classifier, such as a recommendation system. But in an NLG context where we are generating descriptions of the coffee shop, it is a mistake to explicitly say it is cheap unless we are confident about this; if we dont know its price range, we should omit this information from the description.
6 thoughts on “Bad Data Means Bad Output”
Better data, however, brings much higher cost. It would be great if one could come out with an universal metric/methodology to quantitatively measure the quality or “purity” of data, so we can somehow “control the circumstances”. I believe such method could be extremely helpful in industrial scenarios (because *trade-off* is a keyword): during my internship, we are always worried about the quality of data (especially when it serves as test set). Even though a lot of time has already been spent cleaning the data, we are still unconfident whether it’s enough and we could just stop!
Thats a really good point. In the real world “perfect data” is an ideal which we can approach, but usually not achieve, and getting closer to the ideal costs money. Ie, its like engineering reliability, where 100% is impossible, and achieving 99.9% costs a lot more than achieving 99%. So if we are trying to build a real-world NLG system, we need to tradeoff investment in actually building the system against investment in improving data quality.