At a recent meeting of the Aberdeen NLP/NLG group, I made what I thought was an obvious statement, namely that if we build an NLG system by learning from human-written example texts, we want these texts to be high-quality texts which are accurate, readable, and effective. In other words, we want high-quality training data.
Well, I thought this statement was obvious, but the reaction of some of the PhD students showed that it clearly was not. The students all realised that quantity mattered, and more training data was better than less data. But many of them had never thought about quality or even realised that this was an issue. After all, if we want to teach a computer to write like a person, then we want lots of examples of how people write, and it may not make much sense to categorise individual examples as “good” or “bad”. I also noticed that there didnt seem to be any mention of data quality issues in a book on deep learning which we have been collectively discussing.
The underlying problem is that the usual objective of NLG is not to produce texts which look like they were written by a human writer, but rather to produce texts that are helpful to human readers. And of course we know that some human-written texts are much better than others, in terms of readability, accuracy, effectiveness, reader-satisfaction, recall, etc. So if we want to build an NLG system which produces texts that are effective for human readers, then we should train it on a corpus of such texts. Which means that we should filter out human-written texts which are confusing, misleading, or otherwise not useful.
Individually testing corpus texts for readability, etc, is a lot of work, but we can go a long way by just identifying individual authors who are good writers. Some people write a lot better than others, so if we restrict our training corpus to texts from these good writers, we’ll probably have pretty good quality training data.
An argument could be made that a lot of mixed-quality training texts may be more useful than a smaller amount of high-quality training texts. I am willing to believe this in principle, but I need to see evidence that systems built from a lot of mixed-quality corpora texts are better than systems built from a smaller amount of high-quality corpus texts. And this evidence needs to be based on reader assessment of quality; ie not BLEU or other metrics that assess similarity to human writers!
Anyways, I think we can deal with data quality issues, but we need to first acknowledge that it is something we need to think about. The lack of awareness of data quality, in the deep learning textbook as well as amongst the PhD students, is worrying.