Napoleon Bonaparte is supposed to have said “amateurs discuss tactics: professionals discuss logistics”. In AI and machine learning, we should say something similar: amateurs focus on models, professionals focus on data. In the decades that I have been working on AI and NLG, I have seen numerous times that the key to success is good training data; this is usually (at least in my experience) more important than which model or algorithm is used.
And I think most people who build real-world commercial AI systems have a similar attitude. So from this perspective its striking that many academic researchers dont care about data, and worrying that the commercial world has many AI hypsters and gurus who fixate on models and ignore data. I have seen salesmen (and indeed engineers who should know better) promise commercial clients that their magic AI system can learn from 5 examples, so the client doesn’t need to worry about the time, hassle, and effort needed to build proper training datasets. There are a few circumstances where it is possible to learn from 5 examples, but they are pretty rare; most of the time such claims are gross exaggerations at best, and outright lies at worst.
Anyways, below I list a few simple questions about data for people who want to build real-world AI systems. I apologize if these questions seem trivial and obvious; perhaps they are, but nonetheless I have seen time and money wasted by people who did not think about these questions. I’ll illustrate the questions with the NLG example of using ML to build a lexical choice module which chooses an appropriate trend verb (eg, “inch up” vs “increase” vs “soar”) to describe changes in prices, sales, etc.
Do I have the right data?
The first question is whether your dataset includes the key information which is needed in principle to generate text, make a diagnosis, etc. If it doesn’t, you unlikely to succeed even if you are using the latest/trendiest ML technology.
If we look at the verb choice problem, for example, you are not going to have much success if your data set does not include the amount by which profits/sales/etc have changed!
Similarly, if you want to predict the risk that someone is going to get lung cancer, you need to know about their smoking behaviour and family history (or genetics); without this data you will struggle to make good predictions.
Do I have enough data?
Of course the amount of data needed depends on the circumstances and context. But at the most basic level, at minimum you need enough examples of the phenomena you are interested to model it. If we look at verb choice, for example, you cannot learn a model which predicts the usage of “soar” if your training corpus has no instances of this verb! And one instance of “soar” wont be enough either. This is fundamental, and true regardless of the AI/ML technology we are using.
So how much data do we need? Zhang et al 2018 in their work on trend verbs only look at verbs that occur at least 50 times in their corpus, this is probably a good “rule of thumb” for this problem. So if you want to learn how to use trend verbs, you will need a corpus where all of the verbs of interest occur at least 50 times. Which means thousands of sentences with trend verbs if all of the trend verbs occur equally often, and tens or even hundreds of thousands of such sentences if some important trend verbs are relatively uncommon.
And of course this is a relatively simple NLG task. If we want to learn something more complex, we will need more data!
Is my data good enough from a quality perspective?
It is much easier to learn from data if the data is high quality! In an NLG and data-to-text context, this means accurate and complete input data and high-quality corpus narratives. As one of my students once complained to me with regard to an assessment for an NLG course, it is impossible to generate a biography of Sean Connery which includes Connery’s Academy Awards if the dataset used by the student is incomplete and doesnt mention Connery’s award for Best Supporting Actor in The Untouchables. Similarly we are going to struggle to learn how to produce high-quality texts if the training data is poorly written texts produced by Mechanical Turkers who dont care and are trying to do the task as fast as possible (Dusek et al 2019).
Looking at verb choice, for example, we are not going to succeed in learning sensible verb choice rules if the training data is texts written by Turkers who only used “rose” and “fell” because they couldnt be bothered to think about whether a more specific term such as “soared” might be appropriate in a context.
Is my data representative?
Finally, the training data needs to cover the context in which you expect your system to be used. For trend verbs, for example, if you want to choose trend verbs to describe change in prices and if your training data comes from a high-inflation period where (nominal) prices are always rising, then your system isnt going to work well if you use it in a period when some prices are falling.
This came up recently in a project at the university where we are looking to give feedback to drivers (basically an extension of Braun et al 2018). We are looking at different countries, and realised that driving behaviour in (for example) UK and Pakistan are very different, so a system for Pakistan cannot be trained on UK data.
If you are trying to build a real-world AI system using machine learning, remember that data is the key to success! If you have a lot of high-quality and representative data, which includes the key information needed for the task, then success is likely regardless of the ML technique you use. On the other hand, if you dont have good training data, then you are unlikely to succeed no matter what ML model you use.