building NLG systems

The latest/trendiest tech isnt always appropriate

Many people in NLP seem to think that you need to work with the latest and trendiest technology in order to be relevant, both in research and in applications. I think this is a narrow view of the world – sometimes the latest tech is just what is needed, but sometimes it is not.

LSTMs do not work in data-to-text

Let me start with an example from the late 2010s. I know this is a while ago, but enough time has passed to make solid assessments, and also I feel strongly about this.

At the time, the latest and trendiest NLP technology was LSTM (and variants such as biLSTM). LSTMs worked very well in lots of areas of NLP, including machine translation. The problem was that researchers and (to a lesser extent) commercial developers started assuming that LSTMs were the best solution for *every* NLG task, including data-to-text NLG, which is my special interest.

However, the LSTM approach does not work for building data-to-text systems. LSTM data-to-text systems either (A) address simple tasks which are easy to do with templates (eg, E2E) or (B) produce low-quality unacceptable output (eg, Rotowire systems). Even in 2024, I have *never* seen an LSTM data-to-text system which is anywhere near being useful. I don’t think I ever will; fundamentally LSTMs are a poor way of building data-to-text systems.

Note I am referring to “pure” LSTM systems, not hybrid systems which combined rule/symbolic and LSTM approaches. This is because 90% of researchers were pushing “pure” LSTM solutions.

Despite, the above, I saw large numbers of papers published about using LSTMs for data-to-text, many of which were of dubious scientific quality (blog). Weird datasets, weak evaluations, limited knowledge of related work – and no one seemed to care! If the paper was about LSTMs, all of the normal scientific quality criteria were ignored.

Most authors also refused to acknowledge the limitations of LSTMs. If people had said “this is an interesting approach which does not work well now but can be improved”, this would be fine as a research goal. But few people said this, instead I read endless papers which started by claiming that LSTMs were taking over from symbolic data-to-text NLG because they were so much better, without giving any evidence that this was the case. Of course they couldnt give such evidence since (as above) LSTMs are a *worse* approach to data-to-text than symbolic NLG, but this didnt bother the researchers who presumably saw no need to support claims with evidence. I also heard invited/keynotes talks claiming that LSTMs were the best approach of all NLG tasks.

Perhaps worst of all, the ACL community became hostile to non-neural approaches. Papers which used symbolic or non-neural ML approaches to data-to-text had an increasingly hard time getting accepted, with some reviewers explicitly saying that only papers about neural NLG were worth publishing. So much for scientific openness.

Commercial developers were more open-minded, perhaps because they needed to ground themselves in reality in order to build real systems. But even here, in Arria contexts I had discussions with potential clients who said they were only interested in LSTM solutions. I remember once trying to carefully explain why an LSTM approach was not appropriate for what a potential client wanted to do, and the response was “I’m a techie and I agree with you, but my manager insists that we have to use LSTMs because this is what everyone is talking about.”

BERT/BART/etc can be used in data-to-text, but may not be best approach

Around 2020 LSTMs got replaced by fine-tuned transformer language models such as BERT and BART. This is a much better way to build data-to-text and other NLG systems, and I know of several production-quality NLG systems built using BART (etc). It requires a lot of engineering effort (if you simply fine-tune on random corpus data, the resulting system will not be robust), but it can be done (blog), especially if data analytics is done by a separate analytics module outside the language model.

I was also happy to see researchers working on using fine-tuned LMs for data-to-text acknowledge robustness and engineering challenges and try to address them. From a scientific perspective, evaluation got better with trained metrics such as BLEURT and increased attention to good human evaluation. Better datasets were also released.

Despite this, it was still the case that a symbolic or hybrid LM-symbolic approach was a better way to build most data-to-text systems than a pure LM approach, not least because of the large amount of engineering effort required to build a production system with BART (etc). Safety was also a big issue, and I know of commercial companies which decided against LMs because of difficulty of ensuring safe outputs.

Anyways, as a researcher it was great to see that NLG research was becoming more scientific! On the negative side, though, it became even harder to publish papers about non-neural approaches to NLG, indeed I saw reviews which stated that such papers were not relevant to ACL. Ie, some reviewers (not all!) seemed to think that ACL was about neural language models, not about natural language processing in the wider sense.

ChatGPT/etc can work well, but not always

Of course in 2024, the focus is on large aligned prompted language models such as GPT4. The technology is very impressive, and its great to see serious attention being paid to safety issues. But on the other hand, GPT (etc) have robustness, reliability, and hallucination issues which power-users can work around, but which are not acceptable in systems used by a wide customer base. Many LLM-based products simply don’t work well (paper), and I’m aware of companies which tried to use LLMs for data-to-text and gave up, and switched to older approaches.

Its possible that progress in LLM technology (or indeed better prompt engineering) will address these problems, its too soon to make strong claims about what LLMs can and cannot do in data-to-text. But the above experiences suggest that they are unlikely to be a panacea that works everywhere.

Final comments

The most depressing thing about the above-mentioned LSTM episode is that I sometimes felt that I was dealing with religious enthusiasts rather than scientists. It was dogma that LSTMs were the best approach to everything, and anything that agreed with dogma was great (regardless of scientific quality and unjustified claims), while anything that disagreed with dogma was wrong and should not be published. I realise this is extreme and unfair, but nonetheless it is how I sometimes felt. Fortunately things have gotten better since!

I should also say that many other people have pointed out cases where symbolic or non-neural ML techniques work better than the latest neural tech, including Same et al 2022, Sproat 2022, Lin et al 2023 . In 2024, I hope that the commercial community is open-minded enough to evaluate effectiveness carefully and choose the approach which works best, even if it is not the latest tech; and that the research community will investigate *all* potentially useful approaches to NLG, even if they dont use the latest LLMs.

2 thoughts on “The latest/trendiest tech isnt always appropriate

  1. I guess there’s an element of wanting to attract talent / forge alternative technology directions too. Unfortunately this often loses sight of the real problem at hand. From a pragmatic perspective, it would be great to see more insights on what the total cost of ownership of some of these approaches are.

    Like

Leave a comment