Learning does not require evaluation metrics

At a recent seminar that I gave (outwith Aberdeen), a PhD student asked me “Can we do machine learning without evaluation metrics”?   The answer of course is that we can do ML without metrics, there are plenty of human-based evaluations of ML systems.  But the question bothered me, so I thought I would respond in more detail in my blog, both about the question and (more speculatively) what this says about the underlying scientific culture in NLP.

Learning does not require metrics

Machine learning is a way of building AI and NLP systems.  The same evaluation options are available for systems built with ML as for systems built with other technologies (eg, rule-based).   We can evaluate the performance of any AI/NLP systems (regardless of whether it uses ML) by

  • comparing its output on specific scenarios against a gold standard, typically with some kind of evaluation metric.
  • assessing its real-world utility and effectiveness (eg, measuring change in patient outcomes if doctors use an AI decision support system).
  • asking people whether they think the system is doing a good job.
  • measuring non-functional performance, such as compute speed.

I probably missed a few things in the above list, but the key thing is that the evaluation options are the same regardless of the technology used to build the AI/NLP system.  So I can evaluate an ML system with metrics, but I also can evaluate an ML system by assessing its real-world utility, asking people their opinion of the system, and measuring compute speed.

For example, the WMT conferences and shared tasks (such as WMT18) include human-based evaluations of machine translation systems built with a wide range of technology, including deep learning.  All systems submitted to the shared task are evaluated in the same way, regardless of their underlying technology.  And outwith the NLP world, medical decision support systems are usually evaluated on the basis of real-world effectiveness, again regardless of their underlying technology.

One caveat is that many learning systems, including deep learning, work better if hyperparameters can be tuned with a development or validation data set.   Tuning hyperparameters generally requires some kind of error function or quality estimator, which is often based on an evaluation metric.   This is certainly a problem, but it is possible to tune hyperparameters in other ways (not using evaluation metrics).  Personally, I think it would be really interesting to see if we could use pilot experiments with humans to tune hyperparameters; I think in theory this should work (provided there are only a few hyperparameters being tuned), although I’ve not seen this discussed in the research literature.

However the “standard” NLP research model does require metrics

So “of course” we can do machine learning without metrics.   The fact that this question is even asked makes me wonder if PhD students and other newcomers to NLP are absorbing a somewhat limited perspective on how NLP research is done, along the following lines

  1. Someone creates a corpus (data set) of “gold standard” inputs and outputs for an NLP task, which can be used as training data for ML.
  2. Someone (maybe the same person) creates an evaluation metric for the task.
  3. Researchers then try out machine learning algorithms (usually focusing on deep learning) and write papers on how well their particular LTSM (or whatever) does on the data set, as measured by the evaluation metric.

While I havent explicitly counted papers, I suspect that most (probably well above half) papers in recent ACL-type conferences follow the above model.  Which of course means that newcomers to the field, such as PhD students, will probably assume that the above model, which relies on evaluation metrics, is the “right” way to do NLP research.  Hence the concerns from students who have absorbed this model and realise they cant use it in NLG if there are no good evaluation metrics for NLG.

Of course this analysis is speculative, but it makes me sad. I think the above model is a reasonable way to do research *if* the corpora, data sets, and evaluation metrics are appropriate and high quality.   Unfortunately they often are not, which leads to “research” that is more akin to trying to get the highest score on a fancy computer game than to developing useful technology or meaningful scientific insights.

But also, there are other ways of doing research.  I’m a great fan of deploying systems in the real world (or as close to real world as possible) and seeing what happens, I think this leads to valuable and unexpected insights that you dont get from tweaking an LTSM to better solve a well-defined problem.  Some of my colleagues are keen on using linguistic, psychological, or even philosophical techniques to understand NLP problems, I think this is great and can lead to valuable insights in both science and technology.

So if you are a PhD student or other newcomer to NLP, please realise that the above “model” is NOT the only way to do NLP research!  The field will be stronger if researchers avoid a “monoculture” and use a variety of scientific methodologies and approaches.

3 thoughts on “Learning does not require evaluation metrics

  1. Great post. Thanks. I recently moved into industry from academia. I have been reading and comparing research papers for a standard NLP task, in my work context. A complex architecture, that results in a 0.01% improvement in F score from previous work, gets carried into further research papers as the next benchmark. But what does that 0.01% mean? In those multiple categories, what is done well, what is not? is something no one discusses – I find it surprising.

    Another thing I noticed is, even the numbers reported do not match with original papers sometimes. i.e., original paper has a 0.02% or so higher number somewhere in the tables, and the paper that cites it seems to conveniently cite only the number that is 0.01% lower, not 0.02% higher. I am not saying this is a universal trend, but it is not very uncommon either. So, what are all these papers achieving? I am kind of clueless on what exactly is the point.


  2. Many thanks for your comment, and I agree that tiny changes in F score probably don’t mean much. If papers do not correctly report results in previous work, this is very disappointing; definitely poor practice and bad science! In an ideal world reviewers would detect this, but in practice reviewers may not investigate papers in sufficient detail to detect this kind of thing, especially for conference or workshop (as opposed to journal) submissions.

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s