Mistakes in Evaluating ML

In my role as a university faculty member, I mark student projects and dissertations at many levels.   A lot of students try to use up-to-date sophisticated machine learning (ML) and neural net techniques, for many tasks (not just NLG).  However, I also see that many students do not properly evaluate the result.  Below I try to explain some of the more common problems I see (which unfortunately I sometimes see in published academic papers as well).

I will use a made-up example of a classifier which takes in some medical data and predicts whether a patient has the (mythical) disease Malitis.   The type of data doesnt matter for my purposes, so NLP people can imagine that the data contains free-text notes from doctors, vision enthusiasts can imagine the the data contains scans or camera images, etc.

Testing on Training Data

One of the easiest ways to make an ML classifier look good is to evaluate it on the data you used to train it.   If you dont understand this, imagine an “ML” system which simply recorded its training data, and during the testing phase looked up the test input in its memorised training data set.  Such a system would score 100% accuracy if tested on its training data, but would be useless in real applications.

So if you want to convince me that your system does a great job of diagnosing Malitis, you need to show this on a separate test data set (or use cross-validation)!

I see this problem less often now than 5-10 years ago.  So I think most people are getting the message across that you need separate testing data (or cross-validation), and should not test on training data.  However, I still do occasionally see this problem, so some people have not yet gotten this message.

Testing on Synthetic Data

There is growing interest in using techniques such as GANs to generate simulated data, this is especially useful when the data sets we are given are not big enough to train data-intensive deep learning techniques.   I personally am not a great fan of this approach, incidentally (my view is that if you dont have enough data for DL, you should use a different ML technique), but anyways it is scientifically valid **provided** that evaluation is done on real data.  Unfortunately I see projects where the simulated data is used for testing as well as training, this is completely inappropriate.

So if you want to convince me that your system does a great job of diagnosing Malitis, you need to demonstrate this on a test data set which contains real medical cases!   I wont be convinced if you test on simulated data.

Positive and Negative Examples from Different Sources

Another problem I’ve seen is where the positive and negative examples in data sets come from different sources.   Imagine, for example, that I create a data set which contains healthy patients who are adults, and Malitis patients who are children.   If I train and test a Malitis classifier on this data set, then my classifier may be detecting Malitis, but it may also simply be detecting whether a patient is an adult or a child.    Positive and negative examples should come from the same source if at all possible.

So if you want to convince me that your system does a great job of diagnosing Malitis, tell me how you created your data set, and convince me that there is no bias which would allow an ML system to “cheat” in the above fashion.

False Positives vs False Negatives

I remember a project where a student was working with an unbalanced data set (many more negatives than positives, ie only a few people had Malitis), so he created a cost function which put correspondingly more weight on getting negatives right.  Eg, if the data set contain ten times more healthy people than Malitis sufferers, his cost function weighted false positives (missing Malitis) ten times more than false negatives (hallucinating Malitis).   He explained to me that he needed to do this, otherwise the baseline of “always healthy” would beat anything else.

There is absolutely nothing wrong with a  cost function which puts different weights on false positives and false negatives. But such a function should be chosen based on domain knowledge of the real-world costs of incorrect classification.   For example, if the cost of missing Malitis is the patient possibly dying, while the cost of hallucinating Malitis is a few days wasted on more detailed and intrusive tests, then we might want to weight false positives (missing Malitis) a thousand times more heavily than false negatives (hallucinating Malitis)!  The relative frequency of positives/negatives in the data set is irrelevant.

In short, if you use a weighted cost function which reflects domain knowledge, I will be impressed!  But I will not be impressed if you use a weighted cost function purely to beat the baseline.

No Qualitative Analysis

When you evaluate a classifier (or indeed any AI system), you should qualitatively analyse errors and cases where the system screwed up, as well as providing performance numbers.   A lot of students don’t bother with this, their goal is to show that some snazzy new neural techniques gets a good score, not to understand why it fails.  But understanding why AI systems fail is essential to making progress in real-world tasks!

So if you want me to use your Malitis system, dont just give me statistics about how accurately it diagnoses the disease, also give me a good qualitative analysis of when and why the system fails.


I dont think any of the above points are controversial or “rocket science”, they’re basic things which people evaluating ML should automatically do.   But many students (and academics and indeed practitioners) violate the above points.  I think the core problem is that some people focus too much on getting better evaluation numbers (ie, a higher score, in the computer-game sense, than anyone else), instead of on trying to develop and evaluate a useful system.    I think it is essential that we as academics emphasise to our students that ML is about solving problems, not getting the highest score, and indeed that we take this perspective in our own research.

To my readers in companies, another lesson is that you should be wary of AI salesmen trying to get you to buy an ML system because it has great performance numbers.  Sometimes this is indeed the case, but keep in mind that its pretty easy for unscrupulous or uninformed people to make a system look much better than it is, using the above tricks (and there are many others!).

2 thoughts on “Mistakes in Evaluating ML

  1. Can’t agree more on your doubts about utilizing GAN-generated simulated data during training & testing!
    – For training, in information theory, entropy cannot be created from a void, i.e., **there will only be information loss instead of gain through any channel**. Therefore, many papers which claim improvements after GAN augmentation really confuse me… Maybe adding new data works as some kind of regularization in these situations?
    – For testing, using simulated data via GANs **makes no sense**. Let alone the fact that simulated data might be somehow irrelevant to real-world data, it can even introduce unexpected bias (reference: [a psychologists’ ICLR paper](https://openreview.net/pdf?id=Bygh9j09KX)).
    However, in transfer learning, GANs are used to generate data for domain adaptation. From my point of view, it’s a more promising practice!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s