I’ve seen numerous papers (datasets, shared tasks) on hallucination detection and other hallucination-related topics which treat whether something is a hallucination is a binary feature; ie, either something is a hallucination or it is not. But in real-world contexts, this is too simplistic. I describe some of the issues I have seen below.
Severity
Some hallucinations do a lot more damage than others. For example, if an NLP system is summarising a doctor-patient consultation, it is a serious error to say that the patient was vomiting if this is untrue, since this could lead to inappropriate patient care. On other hand, mistakenly saying that the patient’s wife is vomiting probably will not impact patient care and hence is less serious. Moramarco et al (paper) refer to such examples as critical and non-critical errors.
Similarly, in a tourism domain, Schmidtová et al (paper) distinguish between errors which have low business impact, medium business impact, and high business impact. In machine translation, the MQM evaluation technique (paper) defines three severity levels: Major, Minor, Neutral.
So in the real world, some hallucinations cause a lot more damage than others, and reducing the number of Critical/Major/High-Business-Impact hallucinations is much more important than reducing the number of Non-Critical/Minor/Low-Business-Impact hallucinations. Which means that System A maybe preferable to System B from a hallucination perspective even if A has more overall hallucinations, if A has fewer damaging hallucinations.
True but misleading
Hallucinations are usually defined as statements which are not true. However, it is possible for statements to be true but contextually misleading. Thomson et al (paper) calls these Context errors, and gives as an example “Marc Gasol scored 18 points, leading the Grizzlies. Isaiah Thomas added 15 pts“. In the game this describes, Thomas did score 15 points, but he scored them for the opposing team, not the Grizzlies. Hence the statement about Thomas is literally true but but contextually misleading because most readers assume that Thomas also played for the Grizzlies.
Schmidtová et al likewise give examples of true-but-misleading statements in a tourism domain, and Moramarco et al give examples of such statements in a medical domain.
I believe that any statement which leads the reader to believe something which is not true should be considered to be a hallucination, regardless of whether it is literally true or not.
Borderline cases
Many borderline cases exists for hallucinations. For example,
- Different word interpretations: Thomson et al report that the statement “The Bucks’ frontcourt did most of the damage” was considered a hallucination by some readers but not others. This is because some people interpreted frontcourt to mean 3 players (center, power forward, small forward), while others interpreted it to mean 2 players (just center and power forward).
- Subjective statements: Schmidtová et al et al point out that it can be difficult to assess the truthfulness of subjective statements. For example, if a tourist attraction is described as close when it is 10km away, is this a hallucination?
- Contradictions: Moramarco et al regard contradictions as hallucinations even when ground truth is not known. For example, if a consultation summary says both “no family history of bowel issues” and “father has history of colon cancer“, then one of these statements must be false, even if we dont know which one.
There are many other subtleties to hallucinations! For example, Thomson et al point out that that the number of hallucinations in a text can be ambiguous and depend on how the text is analysed, and Moramarco et al point out that generated texts can contain inferred statements which are likely but not guaranteed to be true (are these hallucinations?)
Final thoughts
Almost all of the academic work I see on hallucination assume that this is a binary criteria; a statement either is or is not a hallucination. But the real-world is messy, and if we care about the perspective of actual users and readers, we need to accommodate complexities such as the ones I describe above.
References
M Freitag, G Foster, D Grangier, V Ratnakar, Q Tan, W Macherey (2021). Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. TACL (ACL Anthology)
F Moramarco, A Papadopoulos Korfiatis, M Perera, D Juric, J Flann, E Reiter, A Belz, A Savkov (2022). Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation. Proc of ACL-2022 (ACL Anthology).
C Thomson, E Reiter, B Sundararajan (2023). Evaluating factual accuracy in complex data-to-text. Computer Speech and Language. (journal link).
P Schmidtová, O Dušek, S Mahamood (2025). Real-World Summarization: When Evaluation Reaches Its Limits. To appear in Findings of EMNLP-2025. (Arxiv).
And another type of hallucination: A factual statement which is true, but its truth cannot be inferred from the given input.
LikeLiked by 1 person
This type is quite hard to detect.
Example 1
Input contains: Diabetes Type I is mentioned as comorbidity
Output: Insuline-dependency occurs in the output
This is not hallucination.
Example 2
Input contains: Diabetes Type II is mentioned as comorbidity
Output: Insuline-dependency occurs in the output
This is hallucination, iff insuline dependency (or add’l Type I diabetes) is not contained in the input.
Example 3
Input contains: Diabetes Type V is mentioned as comorbidity
Output: Insuline-dependency occurs in the output
This is hallucination, iff insuline dependency (or add’l Type I diabetes) is not contained in the input. As Type V diabetes has only been introduced in 2025 and is a rare condition in developed countries, it is unlikely to even be in the training data set.
3. And finally, there is a blurred line between this hallucination and “common sense inferences” without which you can’t make any meaningful factual statement.
Examples of “common sense inferences”:
While some of these inferences are universally valid (6), the validity of others may change over time and with medical progress (2, 4), and others’ validity may depend on absence of information to the contrary (e.g. 1). Example 3 is even more complicated …
LikeLike