Care Needed in Analytics and Data Science!

When we build a data-to-text system, we often need to build data (analytics) components as well as linguistic components.  Building these components usually requires some collaborations between data scientists (who understand analytics, machine learning, etc) and domain experts (who understand which analytical insights are useful, and also potential problems in the data), and also a lot of care and attention to detail.  Testing analytics in particular is often very hard, since it requires careful checking of the result of the analytics against the source data.   Which is a pain even in something like sports stories, and a much bigger hassle in contexts where we run sophisticated analytical algorithms on complex medical, engineering, or financial data (there are some good “war stories” about this from Arria, but unfortunately they are commercially confidential).   Anyways, in my experience building the analytics side of a data-to-text system is usually more work than building the linguistic side.

The need for care and attention to detail was brought home to me recently when a student who is building the analytics side of a data-to-text system showed me some preliminary work he had done, which he was very excited about.   I gently pointed out to him that some of his results made no sense (eg, an average value which was higher than any of the values being averaged) and that he had not properly considered outliers and strange patterns which probably reflected data quality problems (eg, constant (flat-line) values in a time-series for something which should vary over time).  A bit chastened he went off to redo his analysis with careful checking of his logic and consideration of outliers.   Well, this is part of my job, to educate students about this sort of thing.

What worries me, though, is that I see a growing attitude in both commercial and academic contexts that you dont need to be careful about data science and analysis.  In commercial contexts you see companies whose marketing literature basically claim that all you need to do is throw your data into a super amazing AI deep learning tool, which will auto-magically give you amazing insights.  These companies know better (and in fact often give quite sensible advice in technical material aimed at engineers), but the top-level marketing pitch is “you dont need data scientists, our tool lets anyone easily build an amazing analytics/ML system”.

On the academic side, I strongly suspect that a lot of academics are not very careful about data science, including statistical analysis of experimental results.   This is perhaps encouraged by a reviewing processing in CS/AI/NLP venues (medicine is different) where reviewers tend to assume that all analyses presented in a paper are correct, either because they dont have access to the underlying data, or because carefully checking the analyses would be an enormous amount of work.   Which unfortunately means that if an excited author make the kind of mistakes that my student made, there is a small chance that reviewers will detect the problem and reject the paper on this basis, but probably a larger chance that the paper will be accepted because reviewers wont notice the mistakes but will be impressed by the unexpected numbers and insights reported in the paper.

So if you are doing data science, for building the analytics part of a data-to-text system, analysing experiment results, or any other purpose, make sure you do this carefully!  And if you see amazing and surprising results, your first thought should be “did I make a mistake or ignore data quality issues”, not “wow, I’ve discovered something amazing, what are its implications”.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s