I’m working with a researcher at another university, on an explainable AI project. A few months ago we had a discussion about domains, and I suggested that she might want to look at training a model to predict university admissions decisions, and then explain the model’s predictions to prospective students. She agreed, and found what looked like a suitable admissions dataset on Kaggle. She asked me if it was OK, and I said it looked fine.
A month later, though, she came back to me and said that the models built from this dataset were making very strange predictions. I suggested that she check the Kaggle discussion page about the dataset, which she did – and discovered that there were a lot of problems with the dataset, which explained the bizarre behaviour of the models and also made the dataset unsuitable for our purposes.
The unfortunate thing is that we didnt check the Kaggle discussion page (which took less than an hour) until a month after we started working with this dataset. So a lot of effort wasted because we didnt spend a bit of time at the beginning checking out the dataset!
In short, we should have done some “due diligence” on the dataset before investing a lot of time in using it.
Checking out datasets
There are a huge number of datasets which are available to academic researchers. Some of these are very high-quality, but remember that producing a high-quality dataset requires a lot of time and effort. Perhaps for this reason, there are also a lot of datasets with serious quality problems on Kaggle and other large repositories. I should say that as far as I know, there is no quality control on Kaggle datasets; I suspect Kaggle may block certain types of datasets for legal/ethical reasons, but I dont think Kaggle checks the quality of the data in datasets. Similarly there are some great datasets available on Github, but also some really flawed ones.
So how can we determine which datasets are high-quality and which are not? I dont have a rigorous procedure for this, but below are some suggestions.
Check discussion forums: Always read the discussion forum if one is present! This may be the single best way to quickly understand quality issues. Similarly you can do a Google search on the data set, this may pick up relevant blogs or discussions on Medium, Twitter, etc.
Check meta-data: It may be useful to check meta-data such as how often a dataset has been downloaded, and how recently it has been updated. I should say that there are some great data sets from years ago which are not often used in 2021 but still are really useful in some contexts! Nonetheless, if a dataset is not used much, its worth trying to understand why this is the case.
Just use “standard” datasets: Another approach is to restrict yourself to “standard” data sets which are widely used in the research community. I have mixed feelings about this, because some of these datasets are flawed and also limiting yourself to standard datasets rules out working in some really interesting and valuable domains. Regardless, the quality of “standard” datasets is almost certainly higher than the quality of average Kaggle datasets.
Check the data: Ultimately there is no replacement for checking the dataset yourself. Dont just assume it is high-quality; eyeball the raw data and also build some simple models from the data and see if they behave sensibly.
Data is the key to most research in AI and NLP. If you collect your own data, you’ll probably be very familiar with its problems and limitations – and if you publish your dataset, please make these problems and limitations clear to people who use the data! If you use existing datasets (which is what most people do), don’t automatically assume that the dataset is of high quality. Check it out (as described above or otherwise); a few hours of “due diligence” can save weeks (months?) of time wasted on trying to work with poor quality data.