When working on new evaluation techniques, its incredibly useful to have a trusted “gold standard” evaluation which predicts real-world utility. Once this is established, we can use it to assess the effectiveness of alternative techniques (cheaper, faster, more appropriate in some niches). In medicine, for example, the community might first agree on 5-year mortality (how many patients die over the next 5 years) as a gold standard, and then look for alternative techniques which do not require waiting 5 years to get evaluation data.
Until recently, I have not seen this kind of thing in text generation. The closest was “direct assessment” in MT, but in all honesty I did not have a huge amount of confidence in these evaluations, which asked monolingual crowdworkers to assess the quality of a translation by comparing it to a reference.
So it is exciting that I do see the emergence of something which feels like a true gold standard evaluation in the above sense, which is the MQM evaluation technique in Machine Translation (Freitag et al 2021). MQM is an annotation-based human evaluation protocol, where human experts (translators) examine a translation and mark individual errors. Errors are categorised and assigned a severity level. MQM analysis can be used qualitatively to assess where a system needs to be improved; it can also be used quantitatively by computing an overall score, usually a severity-weighted error count.
Anyways, the really exciting thing about MQM is that the MT community seems to have accepted it as a “gold standard” evaluation protocol in the above sense (Craig Thomson and I proposed a similar annotation-based technique for data-to-text (Thomson and Reiter 2020), but it has not been widely adopted). Agreeing on a high-quality gold-standard evaluation has enabled lots of very interesting research on better metrics, alternative human evaluations, and what it means to be a good translation. I describe a few such papers below. The key point is not so much the individual papers (although they are all very interesting), as the fact they were made possible by adoption of MQM.
Better metrics: Using MQM to guide LLM-as-Judge
In 2024 there is a lot of interest in using LLMs such as GPT to evaluate texts. Doing this requires giving the LLM a prompt which explains the evaluation; one approach is to base the prompt on MQM, since it defines the properties of a good translation.
A good example is GEMBA-MQM (Kocmi et al 2023), which essentially gives GPT4 a prompt with the input source text, the output target text, a fairly straightforward description of MQM, and a few examples. The authors report that GEMBA-MQM had the best correlation with high-quality human evaluation (ie, MQM) in the WMT23 metrics shared task.
In short, the existence of MQM supports both creation of better metrics, and also reliable assessment of metric validity.
Alternative human evaluation: Simplified annotation schemes
Most work on evaluation in NLP focuses on metrics, but it is also very valuable to look at alternative human evaluations, which are cheaper/faster/etc than the gold standard (eg, MQM). Unfortunately, I’ve seen little such work in NLP evaluation (one exception in data-to-text evaluation is Garneau and Lamontagne 2021)
So its exciting to see approaches such as “error span annotation” (Kocmi et al 2024), which combines a simplified annotation scheme with an overall ranking of the text. Human evaluation with this method is faster than MQM and (perhaps even more importantly) requires less expertise and training, but has good correlation with MQM for ranking MT systems.
I welcome such research, and would love to see more work on assessing and evaluating alternative human evaluations.
What is a good translation: New quality criteria
The existence of a gold-standard evaluation based on clear quality criteria also allows us to ask whether other quality criteria matter in some contexts. In medicine, for example, 5-year mortality isnt the only thing that matters in the real word; patients may also be concerned about quality of life, 10-yr mortality, etc.
MQM is essentially based on annotations for accuracy and fluency, but sometimes this does not work. For example, Zhang et al 2024 evaluate the quality of literary translations, and point out that MQM is not a reliable assessment of quality in this domain. Counting errors is not appropriate, because human literary translations often include deliberate additions and omissions in order to meet the norms of target languages, but these operations are misjudged as errors by MQM.
In other words, MQM focuses on accuracy and fluency, but this is not sufficient in many domains. In addition to literary quality, other quality criteria which have been important in my projects include emotional impact (Balloccu et al 2024) and dialect conformance (Sun et al 2024). In any case, is easier to explore the issue of what quality criteria matter in different contexts if we have a baseline evaluation technique such as MQM which incorporates “standard” quality criteria.
Discussion
A high-quality gold-standard evaluation is essential for performing trusted evaluations, but it also is a great enabler for research into key evaluation topics such as better metrics. I applaud the MT community for accepting MQM as a gold standard, and hope that other communities likewise converge on a high-quality gold-standard evaluation!
One thought on “MQM shows the power of a gold-standard evaluation”