Evaluation Grand Challenge: Is NLP System Good Enough for a Use Case?

The Challenge

A few weeks ago I had an interesting exchange with a woman who works in a hospital in Australia. Over-simplifying to some degree, her hospital has patients who speak 100 different languages, and they want to know if they can use MT to translate material into these languages. The material ranges from daily living (eg, menus and instructions for using the TV) to clinical (eg, medical history, informed consent).

In order to make this decision, they want to know how many minor (does not affect comprehension), major (affects comprehesion, could mislead), and critical (offensive, defamatory) mistakes an MT system will make, when translating a particular type of document in a particular language pair. This data would allow hospital staff to decide whether it was acceptable to use MT for specific use cases, document types, and language pairs.

I told her that I didnt think this could be done in 2019. Which is a shame, because an algorithm, system, or methodology which produced such information would be incredibly useful to users of NLP technology!

Quality Estimation

I think the closest we come to this in 2019 is MT quality estimation metrics (eg, Specia et al 2018). The idea behind these metrics is to train a model which predicts the quality of specific MT output texts (usually measured by post-edit effort, but other measures can be used), using training data of MT outputs annotated with quality.

This is really interesting stuff, but it doesnt help my hospital colleague, because there is no way that her hospital could provide the necessary training data to build the QE models. I dont know the exact figures, but I think they are looking at something like 200 language pairs (100 English->X and X->English), around 10 different use cases and doc types, and 3 potential MT systems. So 6000 training sets would be needed (200x10x3). A decent QE training set should contain at least 10K sentences. So the hospital would need to provide 60M sentences (6Kx10K), each of which is manually annotated for quality, in order to use current QE models. And this is just the startup cost, they would need to provide additional training sets whenever they wanted to look at a different MT system, add languages, or indeed when the texts being translated changed substantially (eg, because of new regulations). This is unrealistic (to put it mildly).

I dont mean to be negative about QE in general, there are lots of great use cases for modern QE techniques, for example in large international ecommerce web sites. But QE does not solve the hospital’s problem.

What the hospital would really like is to be able to give examples of the texts they want translated into an “evaluation box”, which would then tell them that MT system S operating on language pair P would translate texts of type T with XX minor mistakes per 1000 words, YY major mistakes per 1000 words, and ZZ critical mistakes per 1000 words.

I dont think we can build such a box now. But if we could, for and if the box worked for all sorts of NLP technologies (including NLG), it would be a fantastic resource for people who want to use NLP!

BLEU

As a side note, the hospital person initially asked if she could use BLEU for her task. Ie, get the BLEU score of an MT system for a language pair from the latest WMT evaluation (or whatever), and use it to estimate the number of minor, major, and critical mistakes the MT system would make on a specific type of text.

I told her that this wouldnt work, because BLEU doesnt provide this kind of information. I have seen papers on how well BLEU scores correlate with the amount of post-editing required to make an MT text of acceptable quality (Sanchez-Torron and Koehn 2016), but the focus is on using BLEU to compare systems, not to check if a quality threshold is met.

Indeed, most research on evaluation seems focused on comparing systems. Which makes sense for researchers who want to show that their system is an improvement over the state of the art. But I suspect that a lot of real-world users are less interested in comparison than in knowing whether a system is good enough for their use case.

Another point is that the hospital couldnt just get BLEU scores from WMT, since WMT usually only covers 10-20 language pairs (the hospital needs 200), and furthermore WMT evaluations use different text genres than the ones of interest to the hospital. So the hospital would need to put a lot of effort into creating BLEU reference texts for the language pairs and text genres of interest to them.