I went to a workshop last month where there was a very interesting talk about how deep-learning MT systems occasionally produce really bad outputs, and in particular translations with high fluency but low adequacy. The speaker cited Koehn and Knowles 2017, who say (page 30)
Note that the output of the NMT systems [when used for out-of-domain translations] is often quite fluent (e.g., Take heed of your own souls.) but completely unrelated to the input … This is of particular concern when MT is used for information gisting — the user will be mislead by hallucinated content in the NMT output.
Which is interesting, because we see exactly the same thing in NLG, where deep learning systems hallucinate texts which are readable but misleading. Perhaps deep learning approaches (at least in 2018) favour readability (fluency) over correctness?
Fluency vs Correctness?
There probably are some contexts where readability is more important than correctness. However, in the NLG projects I have worked on, correctness has always been more important than readability. For example (page 552 of Reiter and Belz 2009), when we evaluated the SumTime weather forecasts system, we asked evaluators to compare two forecasts in terms of accuracy, readability, and appropriateness. When evaluators thought one forecast was more readable but the other was more accurate, they said the more-accurate forecast was overall more appropriate in 55% of cases, and the more-readable forecast was overall more appropriate in only 18% of cases.
From a commercial perspective, accuracy is also of paramount importance to Arria, especially in contexts where inaccurate and misleading texts could lead to poor decisions and perhaps even result in lawsuits.
Indeed, in many contexts accuracy is so important that it is unacceptable for even a small number of generated texts to be inaccurate and misleading. That is, users will prefer a system which generates clunky-but-always-accurate texts over a system which generates very readable texts which are hallucinated/misleading in a few cases.
End-to-End vs Pipeline?
I suspect what we’re seeing is partially a consequence of using end-to-end learning approaches (regardless of the type of learning used) instead of a pipeline. In a pipeline NLG system, accuracy is mostly determined by the content determination module, and fluency/readability by microplanning and realisation. Of course there are exceptions, since content choices affect readability (eg, complex content may decrease readability) and microplanning choices affect accuracy (eg, inappropriate referring expressions may mislead the reader). But accuracy and readability mostly come from different pipeline modules, so there isnt a direct tradeoff between these.
End-to-end systems, in contrast, make “content” and “expression” choices at the same time, so they can make choices which directly tradeoff accuracy and readability. On the plus side, this enables optimisations and solutions which cannot be done within a piplined system. On the minus side, though, this means that text quality may suffer if inappropriate tradeoffs are made. Basically such systems need to be explicitly told, via some kind of error/cost/evaluation/whatever function, of the relative importance of accuracy vs readability. This is less of an issue for pipeline systems, since to some degree they can separately optimise accuracy and readability.
What I am seeing in current research suggests that deep learning systems often prioritise readability/fluency over accuracy, which is the wrong thing to do for most NLG applications, and I suspect also the wrong approach for many MT applications.
Perhaps this is because it is difficult to assess accuracy in an error/etc function? I imagine its extremely hard to assess accuracy in most MT contexts. Its probably easier to assess accuracy in many NLG contexts, but I note that the most successful approaches to date involve an extra post-processing or “beam reranking” step which focuses on accuracy. Ie, we ask the deep learning system to produce a number of possible texts (over-generation), and then use an accuracy-focused metric to filter out texts which are not accurate. Which works to some degree, at least in simple applications like the restaurant descriptions in the E-to-E challenge, but its striking that the accuracy check is largely separated from the rest of the system (maybe this is because it is domain dependent?).
I’m also not sure how well a postprocessing/reranking approach would work in a complex application such as Babytalk, since it would be difficult to parse/analyse a candidate text to measure its accuracy. In fact, I suspect it might be easier to construct an accurate text from first principles (ie, classic NLG content determination) than to check the accuracy of a proposed text produced by a deep learning system.
I suspect one underlying problem is that most deep learning researchers use evaluation metrics (such as BLEU) which are really bad at assessing accuracy. So if accuracy is as important as I believe it is, researchers need to switch to evaluation techniques which properly measure and prioritise accuracy.
Another point is that almost all NLG researchers (even those who do human evaluations) focus on average case performance, not worst case. But if we need to guarantee a minimal level of accuracy for all texts produced by our NLG system, then we need an evaluation technique which is sensitive to the worst case. And indeed test cases which are perhaps designed to be “nasty” and “weird” in order to ensure acceptable accuracy on boundary (edge) cases.
Current deep learning approaches seem to be better at producing readable texts than at producing accurate texts, in both NLG and at least some MT contexts. This is a real problem in NLG, since accuracy is more important than readability in most NLG applications. I’m sure progress can be made in improving this in contexts where accuracy can be measured; probably the first step is introducing evaluation techniques which properly prioritise accuracy and look at worst-case as well as average-case performance. However in some contexts it may be very difficult to measure accuracy; these may be more challenging to deal with using deep learning approaches.