There is a lot of excitement in the NLP world about prompting approaches to NLP, where a large language model such as GPT3 or BLOOM is not explicitly trained or fine-tuned to do a task, but rather is given a relatively small prompt which defines the task, usually through examples. For example, we can get such models to translate text by giving them a few examples of translated texts, without doing extensive training or pretraining.
What is unclear is how effective this approach is; for example, how does the quality of prompt-based translation compare to translation using trained models? I’ve read a number of papers which claim to address this question, but almost all of them have serious experimental flaws, so I do not have confidence in their findings. I describe one such paper in a previous blog; this is perhaps an extreme example, but illustrates the kinds of problems I have seen in other papers as well.
So I was very pleased when I recenty read Vilar et al’s Prompting PaLM for Translation: Assessing Strategies and Performance (https://arxiv.org/abs/2211.09102). This is a solid experimental paper which compares prompting and non-prompting approaches to MT. Even putting aside its findings, I strongly recommend this paper to anyone who is interested in doing high-quality experiments in NLP; its a great example of good experimental work and I hope it inspires other researchers to do better experiments!
Characteristics of a solid experiment
In a previous blog I described a set of questions to ask when evaluating the quality of an experiment, and gave an example based on a paper with dubious experiments. Below I respond to these questions for the Vilar et al paper, to show what this looks like for a strong experimental paper. All of the below information is well explained and easy to find in the paper.
- Hypothesis – is this clear: Yes, the hypothesis is that the quality of texts produced by the best prompt-based MT system (PaLM) is lower than the quality of texts produced by state-of-the-art MT systems, for English translated to/from Chinese, French, German.
- Datasets – are they representative, real data (not synthetic), is test data separate from training data: Mostly. The main data is WMT 2021 news task data, which is representative and real, with good separation of test data from the data PaLM is trained on (this is explicitly checked). WMT2014 news data is used for French (French was not included in WMT2021), and it has test/train issues. The authors do however explicitly measure how much of the WMT2014 test data was present in PaLM training data.
- Baseline – are they actually state of art? Yes. Baselines were (A) best performing systems from WMT2021 for German and Chinese and (B) Google Translate (commercial state of art) for all language pairs.
- Metrics – are they appropriate: Yes. The main metric is BLEURT, which is a good metric for MT, although perhaps not the best (Kocmi et al 2021)
- Human evaluation – is it appropriate: Yes. They use MQM with professional translators as annotators, which is the best human evaluation for MT (Freitag et al 2021)
- Statistical analysis – is this straightforward and appropriate: Yes. Sensible analyses are clearly presented, and supported by a good qualitative analysis. P-values are given for human evaluation, not for metrics.
- Results – does the experimental data support the papers claims: Yes
- Replicability – is enough information provided for replication: Yes, although actually repeating the MQM evaluations requires resources which are not available to all resesarchers.
Findings
So now that we finally have a solid experimental paper looking at prompting (in the specific context of MT), what does this paper tell us?
The key finding is that the quality of translations produced by PaLM (prompt-based MT) is not as good as the quality of texts produced by state-of-the-art trained MT systems. The human evaluation shows that the key difference is in accuracy. PaLM translations are generally as fluent and stylistically appropriate as translations by non-prompting systems, but they are less accurate, especially with regard to omissions. In other words, PaLM produces translations that read well but do not accurately communicate the information in the source text.
However, the quality of PaLM tests is similar to Google Translate (commercial system) for some language pairs. Of course a commercial MT system must deal with many constraints and issues (eg, fast response time, spelling errors, attacks by hackers) which research systems can ignore. However, it still is impressive that the prompting-based PaLM approach produces texts of this quality in some cases.
Vilar et al also look at which prompts are better. They conclude that exact format of the prompts does not matter much, and that there is relatively little benefit in more than 5 examples in the prompts. For the examples used in prompts, they look at the importance of (A) examples being high-quality translations and (B) examples being similar to the sentence being translated, and conclude that (A) is more important than (B). In other words, the most important requirement for examples in prompts is that they are of high quality. Indeed, they also show that PaLM can do well with prompts based on a single fixed example, if this example is very high quality.
Final Thoughts
Vilar et al show that prompt-based MT is not as good as MT based on explicitly trained models, but it is still very good and indeed equivalent to Google Translate in some contexts, The main problems are accuracy rather than fluency. All of this makes sense to me, indeed I was surprised that prompt-based MT did as well as it did.
Vilar et al are good experimentalists, so they make no claims beyond MT in the language pairs they examined. But it would not surprise me if the above findings generalise. Ie, prompt-based NLP is not as good as trained models, but it still can reach reasonable quality in many cases, especially in contexts where accuracy is not of paramount importance. In such cases prompt-based NLP has the advantage of being much easier to set up (no need to train models), which can make it an attractive option in many contexts.
And last but not least, I encourage all NLP researchers to read Vilar et al as an example of a really good experimental evaluation!
2 thoughts on “How effective is prompting?”