I occasionally write blogs about what my students are doing, and thought I’d write about Barkavi Sundararajan, who is exploring using LLMs for data-to-text, and in particular trying to reduce hallucinations and other errors. Other people have looked at impact of models and prompts, Barkavi is looking at whether LLMs do a better job at data-to-text when the input data (which is being summarised) is well structured.
Barkavi will present a paper about her work at NAACL (https://arxiv.org/abs/2404.04103). Unfortunately Barkavi was not able to get a visa for Mexico, so she will be presenting her work remotely, at the first virtual poster session, which is 9 to 11 AM CST on 13 June (Thursday).
Example
Its probably easiest to explain what Barkavi is doing with an example. Her NAACL paper focuses on the ToTTo dataset, which essentially is about producing texts from tables in Wikipedia. These tables have a wide range of structures, and some structures work better than others as LLM input.
To take a concrete example, one ToTTo task is summarising the results of some of the US Senate elections in 2014. In the ToTTo dataset, the information is presented in JSON within a text string, eg
"value": "Dan Sullivan (Republican) 48.0% Mark Begich (Democratic) 45.8% Mark Fish (Libertarian) 3.7% Ted Gianoutsos (Independent) 2.0%"
When asked to summarise this data, LLAMA 2 (7B) model instead regurgitated generic information about the election. However, LLAMA 2 gave an accurate summary if the input data was instead presented in a structured way, as follows
{“candidate”: “Dan Sullivan”, “party": "Republican", "% votes": 48.0}, {“candidate”: “Mark Begich“, “party": "Democratic", "% votes": 45.8}, {“candidate”: ““, “parMark Fishty": " {“candidate”: “Libertarian", "% votes": 3.7},“, “parTed Gianoutsosty": "", "% votes": 2.0}Independent
In short, LLAMA 2 did a much better job of generating a summary of the data if it was structured in a meaningful way, instead of thrown together into a text string.
Other data issues
Barkavi found a number of other cases where restructuring the data, often following database normalisation principles, reduced errors and hallucinations in LLAMA 2 output; details are in her paper. The complete set of restructurings and “fixes” reduced the number of content errors for the ToTTo table summarisation task by 52% for LLAMA 2 (7B) summaries, and by 76% for LLAMA 2 (13B), for the test set that Barkavi used.
So well structured data does not eliminate hallucinations, but it does reduce them by a significant amount.
However, Barkavi also found that the number of omissions (where expected information was not present in the summary) *increased* when the data was restructured. It is unclear what caused this.
Discussion
LLMs make mistakes in data-to-text. I wrote a blog about problems I saw in early 2023, and Kasner and Dušek show that models still make semantic errors in 2024 (Barkavi has also seen this). Of course errors will hopefully be reduced as models get better, and perhaps better prompts will help, but Barkavi’s work shows that improving the structure of the input data can also make a big difference.
This is useful because while models and prompts are constantly evolving and changing (so advice about models and prompts which is useful today may not work next month), the principles of database normalisation have not significantly changed in decades. So if we structure our data according to these principles, this should improve summary quality next year as well as today.
One thought on “Well structured input data helps LLMs”