I’m currently working on a systematic review of the validity of BLEU. Nowhere near finished, but it has prompted me to wonder what the ultimate goal of evaluation is. In particular, I have usually thought of evaluation as a type of scientific hypothesis testing, which it is. However, we can also look at evaluation from a user or “consumer” perspective, as a way of helping people choose the best tech for their needs.
Below are some initial thoughts along these lines, most (all?) of which are probably not original. Comments are very welcome, especially since I will be teaching AI/NLP/NLG evaluation both within Aberdeen University (CS5063) and externally (eg, I will give a tutorial on evaluation at INLG 2017).
User (Consumer) Perspective
What a user really wants to know is which technology/system/algorithm is most effective for achieving his task. Of course there is huge variety in users and tasks, which means that different technologies are likely appropriate in different contexts. In machine translation, for example, a few of the zillions of use cases are
- Professional translator who is using MT system to generate a draft document, which she will post-edit into a high-quality translation.
- Analyst who reads MT translations to identify interesting/relevant documents which should be properly translated and analysed in depth.
- Member of the public who is selecting a hotel for a foreign holiday, and wants to extract some key information (eg, wheelchair accessibility) from an MT translation of a hotel website.
So the user wants to specify his background, task, and other contextual parameters, and be told which systems (technologies) are best for him.
To make this concrete, lets consider a specific persona (a trick I learnt at Arria). Joe is a professional translator of biomedical texts from Chinese to English, who wants to use an MT system to create an initial draft which he will postedit into a high-quality text. Which MT system should Joe use?
Ideal: Evaluate systems in user/task context
Of course, the ideal approach is to evaluate the effectiveness of the candidate NLP systems in the specific user/task context. But this only is feasible for regular users, occasional users wont be able to try lots of systems. And even regular users would really appreciate some initial guidance/filtering, to reduce the number of systems they investigate in depth.
In other words, Joe could simply try every plausible Chinese-English translation system, and see which one he thinks is best. If there are a lot of these, though, Joe would love to get some advice on which 2 or 3 systems are likely to be best, so he can focus his energies on evaluating these.
Second-best: Recommend best NLP system using data-based model
So it would be really useful if we had an app which took in information about the user and the use case, and recommended the best NLP system(s) for this context. This could be analogous (at least at a conceptual level) to recommender systems used by e-commerce websites to recommend books and televisions. And like e-commerce recommender systems, the best way to build such a model is to get a lot of real data on task performance (or user satisfaction), and build the model from this data.
In other words, in theory we could
- get 1000 (10000?) people to use a variety of systems to perform real-world tasks,
- measure how successful these people were at performing tasks (and/or how satisfied they were with the system they used)
- obtain generic attributes of the systems being used, which could include both human measures such as readability of geneated texts and also automatic metrics such as BLEU. User-interface attributes may also be very important.
- Use standard statistical or learning techniques to build a model which predicts task-performance success (or user satisfaction) of a system applied to a specific user/task context, based on the user/task characteristics and the generic attributes of the systems.
Of course there are lots of ways of doing this exercise, but I dont want to get into experimental design details here. The most important point is that I dont think such studies have been done in NLP, presumably because gathering the data would be a costly and time-consuming exercise. The Paradise framework used to evaluate spoken diaogue systems is perhaps the closest the NLP world has come to this model.
Going back to Joe the translator, we could tell Joe that we have done a careful analysis along the above lines, which includes measuring the success of 100 professional translators doing real-world translation, and our model predicts that X and Y will work best for Joe. Joe tries X and isnt satisfies, it just doesnt mesh with the way he does translation. However he loves Y, which gives quality translations and also fits very well into his workflow.
Third choice: Recommend NLP system based on theoretical analysis
Since the NLP community has not collected the data required to build a data-based model which predicts system utility for a specific task and user, we could instead build a model based on theoretical criteria (and again there are analogies in ecommerce recommender systems).
To take a simple (and hypothetical!) NLG example, suppose we have a bunch of weather forecast generators, and for each of these generators we have readability, accuracy, and utility scores (from human ratings) for different aspects of the weather (precipitation, wind, temperature, etc). Then we could hypothesise plausible rules such as “picnickers mostly care about precipitation”, “cyclists mostly care about precipitation and wind”, and “readability is very important for users with limited literacy”, and build a function which predicts utility based on this analysis.
Again I dont think much has been done along these lines in NLP. Femti in MT perhaps resonates with this approach, but I dont think it has been widely used (???). In NLG, there has been a lot of work on adapting the output of an NLG system based on a user/task model, but I’m not aware of any work on recommending/scoring NLG systems along the lines mentioned above.
Going back to Joe the translator, we could tell Joe that we have accuracy, fluency, and lexical coverage (for biomedicine terminology) data on a bunch of systems, and we think (based on analysis of translator workflow) that coverage is most important, followed by accuracy, with fluency last. So we propose the below scoring formula
(3*coverage + 2*accuracy + fluency)/6
We then tell Joe which systems score best under this formula. He might take this seriously, but then again he might not.
Fourth choice: Choose a system based on parameters with weak empirical and theoretical justification
So, if we dont have hard data on how well candidate NLP systems perform in a specific context, the best approach is to use empirical data to build a predictive model of how well a specific NLP system will perform in a specific context. If we dont have the data to do this empirically, an alternative is to build a predictive model based on a theoretical analysis of the user and task.
Of course there is another alternative, which is to use a model which is neither empirically nor theoretically justified to compare systems. This is my current view of the “consumer perspective” on using BLEU to choose NLP systems. There is little empirical data about how well BLEU scores correlate with task performance or user satisfaction in real-world studies. The closest I have seen is correlations about BLEU and translation error rate in MT, and BLEU and referential success in NLG, but the studies I am aware of have been carried out in artificial rather than real-world contexts. And from a theoretical perspective, I can argue that human readability assessments are relevant in some use cases and contexts, but I struggle to argue on theoretical grounds that ngram statistics are relevant to real-world users.
So we can give Joe the translator recommendations based on BLEU score. However we should be honest with him and say there is no theoretical justification for BLEU, and the closest empirical studies look at HTER, that is measuring how long it takes monolinguals in an artificial context to post-edit MT translations of news stories into minimum acceptable English (remember that Joe is a bilingual professional translator working in a real-world context to creatre high-quality translations of technical biomedical texts). Joe politely thanks us, and decides he would be better off forgetting about “scientific” evaluations, and instead he will just go with the market leader.
Of course, most researchers are looking to advance technology in general, not develop a solution for a specific user and task (use case). But yet, if we claim that our technology evaluations assess utility (which is what we usually do), then ultimately we are claiming that at least some real people will find our technology useful in real tasks. So how do we link our tech evaluations to real-world user/task evaluations?
I think at some point someone needs to “bite the bullet” and do the empirical study I mentioned above, where we get a substantial amount of real-world user/task evaluation data (task success and user satisfaction), and correlate these with all of the ways in which we currently evaluate NLP systems (including non-real-world human-based evals as well as metrics). Once we have this data, we can solidly assess how meaningful our technology evaluation measures are. Until then, its just speculation…
An alternative perspective is that any of “keeping score” will drive technology progress and evaluations, even if it is very weakly correlated with real world user/task evaluations. Eg, even though BLEU doesnt mean much, one could argue that chasing ever-higher BLEU scores has nonetheless led to demonstratably better NLP tech in real-world contexts. Perhaps, although I cant help but wonder if we would have made a lot more progress if we had put more effort into grounding evaluation techniques in real-world studies, as mentioned above. I guess this is one of those hypotheticals which is impossible to answer…