I went to INLG last week, as usual a very interesting and enjoyable event, with thought-provoking papers, posters and discussions. Anyways, one topic which came up several times was “hallucination” in neural NLG systems. By this I mean the problem that such systems generate texts which say things that are not true, or at least are not in the input data. There were some nice examples of this in the presentation about the E2E challenge (slide 10). For example
Input data: name[Cotto], eatType[coffee shop], near[The Bakers]
TGEN output: Cotto is a coffee shop with a low price range. It is located near The Bakers.
GONG output: Cotto is a place near The Bakers.
SHEF2 output: Cotto is a pub near The Bakers.
In this case, TGEN2 has said that Cotto has a low price range even though this is not specified in the input data, GONG has not said that Cotto is a coffee shop, and SHEF2 has said that Cotto is a pub when in fact it is coffee shop.
Hallucination is a well-known problem in neural approaches to image captioning (eg, Rohrbach et al 2018). The papers presented at INLG suggest it is a problem in neural approaches to other NLG tasks as well. This is important because hallucination (which is much less of a problem in rule-based NLG systems) is unacceptable in many NLG applications. Especially when generating text that communicate important medical, financial, or engineering information, it is completely unacceptable to hallucinate non-existent or incorrect content (TGEN and SHEF2 examples), and very undesirable to omit important information (GONG example). I think this is also unacceptable in many consumer facing applications. Eg, if I was told that Cotto had a low price range, went there, and discovered that it was expensive, I would not be happy.
Anyways, several ideas (eg beam-reranking) were presented at INLG for reducing hallucination, which I wont discuss here. What is of more interest to me is evaluation. One point that was made many times was that the common automatic metrics (BLEU, METEOR, etc) ignore this problem. They just check whether words and ngrams in the generated text match the ones in the reference text(s), and dont care whether a mismatch is because of a paraphrase (minor problem at worst), poor lexicalisation of content (major problem), made-up content (TGEN example) (unacceptable), or incorrect content (SHEF2 example) (completely unacceptable).
For example, lets suppose that the BLEU reference text for the above was Cotto is a coffee shop near The Bakers. Then the completely unacceptable Cotto is a pub near The Bakers would probably get a better BLEU score than Cotto is a coffee shop. It is located near The Bakers (which is OK if not ideal). This is because the surface form Cotto is a pub near The Bakers is much closer to the reference text from the perspective of shared words and ngrams.
One result of this is that researchers which rely soly or primarily on metrics such as BLEU for evaluation may not realise that their systems are generating completely unacceptable texts despite having reasonable BLEU scores.
Of course the best way to address this is to do proper human evaluations! But I also think that this is a case where there is potential for better automatic evaluation metrics. The obvious thing to do is to analyse/parse the generated text, extract its content, and then check that all of this content is genuine. This was effectively suggested by Wang et al 2018 for E-to-E, and indeed by Rohrbach et al 2018 for image captioning.
The techniques suggested by these papers are specific to their application domains, because they make assumptions about the kind of content which is present in the input, and the type of sentences which will express this content. But I think they are on the right track, and more generally metrics which assess “content fidelity” of generated texts could be quite useful in NLG, especially neural NLG.
As a final note, the above may partially explain why BLEU score is a poor predictor of human evaluations in NLG. However BLEU does an OK job of predicting human evaluations in machine translation (MT), which I find surprising considering the above. Presumably either
- Neural MT systems do not hallucinate. AND/OR
- Neural MT users (or at least human subjects who evaluate NMT systems) do not regard hallucination as a major problem.
Both of the above seem seem a bit surprising to me, but presumably one or both must be true.
This is just anecdotal, but I see MT systems hallucinating things all the time. Living in Denmark at the moment, Google’s MT constantly changes prices in Danish kroner to be Swedish or Norwegian kroner instead, or sometimes even dollars. I’ve seen Facebook’s MT replace the names of composers when translating concert programs. I suspect these things might not occur in MT test sets in large enough quantities to destroy the correlation between human eval and BLEU, but as long as we rely on using only BLEU, they’ll probably not get resolved.
LikeLike
Re BLEU correlations for MT, see your other post about average- vs. worst-case evaluation 🙂
LikeLike
Good point! Hallucination in neural MT may be something which doesnt happen often, but is really bad when it does happen. If so, it might have a major impact on “worst case” performance, but less of an impact on “average case” peformance, which is what BLEU and related metrics evaluate.
I discuss the average vs worst case issue in https://ehudreiter.com/2017/05/03/metrics-nlg-evaluation/
LikeLike
I am genuinely thankful to the holder of this web site who
has shared this impressive post at here.
LikeLike
As a matter of fact, many NMT systems are suffering the problem of hallucination, at least for ZH-EN. In a lot of such cases, OOV words are involved. For instance, recently I tried to translate “you are such a pxt” (pxt is the acronym of my name. This sentence has grammar mistakes in the strict sense) into Chinese, and Google just returned “你是这样的朋友”, which means “you are such a friend”!
In China, people have created many similar cases using the names of celebrities, on various MT platforms. Some results are too funny and ridiculous, triggering heated discussions on social media.
LikeLike