In 2019 LM output was fluent but not trustworthy: still true in 2024

In my NLG course a few weeks ago, I tried to give a very high level description of what LLMs can and cannot do from an NLG perspective, and ended up saying that they produced texts which were very fluent but could not be trusted content-wise. After I said this, I realised that I said pretty much the same thing 5 years ago. Ie, despite all the progress in neural LLMs since 2019, my ultra-high-level description of what they can do has not changed. Do this tell us something fundamental about LLMs?

What has changed since 2019

To me the most amazing change in LLMs is in usability. In 2019, most people were training models from scratch, which required large amounts of data, expertise, and (for the time) compute. We were starting to move to fine-tuning pretrained models, which required less data, expertise, and compute; but significant amounts of these were still needed. In 2021 or so, people started using prompted models trained on Internet-scale data, which further massively reduced need for training data, ML expertise, and compute. In 2022, RHLF and free access to models over the web meant that Joe Public was able to use LLMs, which would have been unthinkable in 2019!

So improvements in LLM usability have been massive. But what about quality of output texts in a data-to-text NLG context? Texts were already fluent in 2019, what about accuracy and content quality?

Certainly the content quality of GPT4 output is much better than the content quality of a typical 2019 academic NLG system. But I dont know how much of this is due to improvements in technology, and how much comes from the fact that GPT4 was built by hundreds of professional software developers who deeply cared about content quality. In contrast, the typical 2019 academic NLG system was produced in a few months by a PhD student who may have had little interest in content quality as long as BLEU scores were good. My personal suspicion is that content-quality improvements have mostly come from this massive increase in engineering resource and focus on content quality, but I dont have any data to back this up.

What has not changed since 2019

What has not changed since 2019 is that LLMs still make mistakes in content. I continue to see numerous examples of this, and indeed have written many related blogs (example). There are also of course many papers showing this (eg, Kasner and Dušek (2024), to take a fairly random example).

Of course fewer mistakes are made in 2024 than in 2019. But the key thing is that mistakes are still being made. This means that in domains where accurate content is important, LLMs cannot be used without some kind of human supervision, checking, and post-editing. This was true in 2019 and is still true in 2024.

I realise tha content accuracy is not essential in all use cases! But it is essential in the ones that I care about.

Fundamental limitation in LLM?

Does this mean that LLMs fundamentally cannot reliably produce accurate texts? After all, the core of LLMs is predicting the next word in a text based on patterns observed in the internet (what others have called “stochastic parrots“). I think this works well for generating fluent texts, but I suspect that it is fundamentally *not* a good way to robustly generate accurate texts.

Of course there are caveats. If the text being generated already exists on the Internet, then the LLM can simply copy it (but perhaps we dont need LLM tech for this, maybe classic web search suffices?). And the huge amount of software engineering effort being invested by AI companies means that LLMS can be augmented with specialised modules to accurately respond to common queries, and indeed detect many potential mistakes.

But ultimately I wonder if the fact that my ultra-high-level desciption of LLMs has not changed in 5 years despite the massive advances in LLM technology (and huge investments in LLM engineering) means that this is a fundamental limitation of LLMs; LLMs are not the right technology for reliably producing accurate texts.

Ehud Reiter's Blog

Ehud's thoughts and observations about Natural Language Generation

In 2019 LM output was fluent but not trustworthy: still true in 2024

What has changed since 2019

What has not changed since 2019

Fundamental limitation in LLM?

Leave a comment Cancel reply

What has changed since 2019

What has not changed since 2019

Fundamental limitation in LLM?

Share this:

Related

Share this:

Leave a comment Cancel reply