evaluation

Dont ignore omissions!

I have an undergraduate student who is interested in AI and Law, and is evaluating how well LLMs do at a legal task. She is working with a company, and is focusing on omissions, because this is what the company thinks is the biggest problem. Just as a reminder, a hallucination occurs when the generated text says something that is not true) (of there are there are complexities) and an omission occurs when important information is left out of the generated text.

Both of these are important problems, but the NLP community seems much more interested in hallucinations. I did a quick search on ACL25 for papers which mentioned “hallucination” and got 96 hits. I then did a search on “omission”, and got 0 hits… EMNLP 2025 was slightly better, with 64 papers about hallucination and 1 paper about omission. This disinterest in omissions is a relatively recent development. I have a dump of the ACL Anthology from early 2024, and this shows around 5x more papers on hallucination than on omission. So there is bias, but at least there are some papers on omissions!

This is a shame, because omissions are a very important real-world problem with language generation systems.

Medicine

Omissions are a big problem in medicine. To take an example from my student Francesco Moramarco’s work (Moramarco et al 2022), if patient tells a doctor during a consultation that he is feeling hot all the time, this information needs to be included in a summary of the consultation! If it is omitted, it could lead to misdiagnosis and inappropriate care.

Wu et al (2025) wrote a very interesting paper where they evaluated LLM recommendations for 100 real medical cases. In 23% of the cases, the LLMs gave responses that could cause serious harm, and 76% of these were omissions. In other words, LLMs made many mistakes, and most of these were omissions, not hallucinations. In their words, “accuracy-focused benchmarks cannot substitute for explicit measurement of safety, and likely underestimate the risks of live clinical deployment”. If we want to deploy LLMs in medical contexts, we need to ensure that they include all the key information, as well as being accurate.

Oukelman et al (2025) point out that there is little work on detecting omissions in medical applications. In their words,

Despite these advancements, a significant challenge remains unaddressed: the detection of omissions in LLM-generated texts. Existing datasets and evaluation frameworks predominantly focus on hallucinations—instances where the generated text includes incorrect or fabricated information (Li et al., 2024). While hallucination detection is crucial, the issue of omissions, where critical information from the original input is missing, poses a unique and severe risk, especially in the medical field. Omissions can lead to incomplete medical records, potentially jeopardizing patient care and treatment outcomes.

Other domain

Omissions have been reported as serious issues in many other domains, including law (my sudent mentioned above) and

  • Machine translation (paper)
  • Risk reporting (paper)
  • Summarisation (paper)
  • Weather forecasts (paper)
  • Coding assistants (eg, lack of safety checks in code, paper)
  • etc

So the problem is not unique to medical domain, although (as far as I know) only in medicine do we have papers such as Wu et al (2025) which claim that omissions are a bigger problem than hallucinations.

Finding omissions

Hallucinations are usually detected by splitting a text into facts (claims, assertions, etc) and checking if these are true. This can be done manually (eg, Thomson et al 2023) or automatically (eg, Tang et al 2024).

There is less agreement on detecting omissions. If we have a “gold standard” list of content that should be included in the text, we can check whether this content is present. An early version of this was the Pyramid technique for evaluating summarisation (Nenkova and Passoneau 2004). This was originally done manually, but automated versions of Pyramid were proposed by later researchers (eg, Hirao et al 2018). Similar approaches can be used in machine translation, if we assume that all content in the source text should be included in the target text; indeed checking for omissions is part of the MQM evaluation protocol.

However in most language generation tasks which I am familiar with, the generated text is a summary of the input data, it does not communicate all of the input data. In such cases we need to identify the key content which must be present to fulfil the text’s communicative goal and purpose, and check that it is present; we do not care whether unimportant input data is communicated in the text (indeed, this may decrease the text’s utility).

Of course this requires domain knowledge, and an understanding of what is important to the user in the relevant context. We can ask domain experts to manually check if important content is present, which is what Moramarco et al 2022 did. In principle we can ask an LLM to assess this, I do not know how well this works.

Final thoughts

I am very interested in evaluating NLG systems in health domains, and it is very clear that in this context, omissions are (A) extremely important, (B) largely ignored by the NLP research community, and (C) harder to detect than hallucinations. (B) and (C) are probably related, ie many researchers like to focus on “low-hanging fruit” which is relatively well understood. But omissions in medical NLG are a huge problem, and we are not going to be able to deliver high-quality solutions unless we have a better understanding of omissions!

Leave a comment