Do We Encourage Researchers to Use Inappropriate Data Sets?

I have always that thought that it was a “no-brainer” that NLP researchers should use appropriate and high-quality data sets for training and evaluation. But I am now beginning to think that the NLP field in fact *encourages* researchers to use poor quality and inappropriate data sets, which is a depressing thought.

Junior Researcher: Easier to get papers and funding with poor data sets

About a year ago I was contacted by a junior researcher who asked me where he could get the Weathergov corpus. I explained to him that the Weathergov corpus contained the output of a rule-based NLG system, and hence using ML on Weathergov was mostly an exercise in reverse-engineering the rule-based system (ie, stealing the IP of the people who wrote the rules), not an exercise in NLG as we usually think of it. I suggested that he instead use the SumTime corpus, which contains human-written weather forecasts.

However, this researcher then told me that it was much easier to publish papers in ACL-like venues if he used Weathergov instead of SumTime (and certainly a lot more ACL papers use Weathergov than use SumTime), and also pointed out to me that the first author of a NAACL 2018 paper based on Weathergov had been awarded a fellowship from Google. In other words, it was clear to him that the best way to progress his career, in terms of both publications and funding, was to use Weathergov. So why wouldnt I help with this?

I cant blame the researcher who contacted me, he is simply responding to the incentives which he is presented with. But I think it is a very bad sign for the field that young researchers see that the way to “get ahead” is to use questionable data sets.

Reviewing: We cannot question a data set if its been used before

A recent interaction reinforced this impression. I was reviewing a paper, and was concerned that some of the data sets used by the paper were unrepresentative and otherwise inappropriate. When I raised this concern, though, one of the other reviewers said that since these data sets had been used by previous researchers, it was unfair to reject the paper on this basis. In other words, the other reviewer thought that once a data set had been used a few times in published papers, it was no longer appropriate to question its usage in papers.

I feel really uneasy about this, especially given the mixed quality of reviewing at conferences and (especially) workshops. In my mind, the fact that a data set has been used in a previously published paper does not mean that it is representative and appropriate, since I have seen many papers (even at prestige venues such as ACL) use very inappropriate data sets. I do appreciate that many researchers have a different perspective, and focus on showing that their techniques improve on state-of-the-art on existing datasets, without worrying about the relevance and appropriateness of these data sets. But in all honesty I think that if we want to make progress in NLP, both practically and theoretically, we need to work with sensible data sets.

Gresham’s Law: Do bad data sets drive out good ones?

The whole thing is very depressing, and I sometimes wonder if there is a sort of “Gresham’s Law” operating with NLP data sets. Creating a good data set is a **lot** of work; its so much easier to just grab some random stuff off the internet without worrying about representativeness, quality, diversity, reliability of annotations, etc. So if the NLP community doesnt distinguish between “good” and “bad ” data sets (after all, we can still pump out zillions of papers showing 0.5% increase on state-of-art, regardless of quality of data set), then people are likely to continue creating and using poor quality data sets. In other words, we can publish more papers if we ignore quality, and reviewers dont seem to care…

Can we do anything about this?

Can the community do anything to encourage the use of good data sets? I dont know, certainly what has happened with evaluation metrics is not encouraging. We’ve known about the problems with BLEU and other metrics for 15 years, but we still use them in contexts where they are inappropriate. It would help if reviewers, especially for journals and prestige conferences, insisted on proper data sets and evaluation techniques, but I dont know how likely this is.

I once published a paper in the British Medical Journal (BMJ), and they had a special reviewer whose job was solely to check the quality of statistical analyses and other evaluation details. I dont think this is feasible at NLP conferences (too large, too short a time scale for reviewing), but maybe this is something our journals could consider?

On a smaller scale, we should at least make researchers aware of problems with data sets. I’ve seen cases where people use poor data sets (and indeed evaluation techniques) because they dont realise there are problems with these, since the people who know about these problems do not publish this information. There’s not much I can do about this in general, but in the specific case of SIGGEN’s list of Data Sets for NLG, I will update this if I discover problems with data sets. For example, the SIGGEN list does tell you how to get WeatherGov, but also clearly states that this consists of computer-generated forecasts instead of human-written forecasts.

18 thoughts on “Do We Encourage Researchers to Use Inappropriate Data Sets?”

Xutan Peng says:

Aug 12, 2019 at 9:11 am

Take me for example, another reason why I prefer data sets which previously appeared in papers at “prestige venues” is that I believe they are **reliable**, so I could just shorten the time on further investigation… It’s really helpful and crucial for the SIGGEN’s list to warn the community of the pitfalls and problems with these data sets! On the other hand, do you think it would be a good idea to add more details, e.g., “origin” (how were the corpora constructed? by computer or by human?), scope (to avoid overclaiming) and “best practice” (show links to papers which use the corpra **properly**)?

LikeLike

1. ehudreiter says:
  
  Aug 13, 2019 at 1:36 pm
  
  This is a really good point. I will discuss with the other SIGGEN board members and see what we can to do provide this information
  
  LikeLike
  
2. thingsmeta says:
  
  Oct 8, 2019 at 5:16 pm
  
  At the West Coast NLP event the topic of dataset quality came up, although not specific to NLG. Bias was the motivating concern IIRC. Having something like a ‘nutrition label’ that gave basic facts about the dataset, including its origin and collection method, was something I took away from the meeting. But not sure how to agree on what should be in the label or how to verify the values for any particular dataset.
  
  LikeLike
  
  1. ehudreiter says:
    
    Oct 8, 2019 at 8:17 pm
    
    I’ve seen a few suggestions about labelling datasets, or providing “datasheets”. This cant hurt, but I’m not sure how much it will help. What I think could be more useful is forums where dataset users can post comments on datasets; maybe a bit like reviews on e-commerce sites?
    
    LikeLike
Pingback: Amateurs focus on models; professionals focus on data – Ehud Reiter's Blog
Pingback: Do people “cheat” by overfitting test data – Ehud Reiter's Blog
Pingback: How can I tell if a paper is scientifically solid? – Ehud Reiter's Blog
Pingback: I enjoy reviewing for TACL – Ehud Reiter's Blog
Pingback: Reviewing has changed over the years; conferences need to change as well – Ehud Reiter's Blog
Pingback: Best Papers I Read in 2020 – Ehud Reiter's Blog
Pingback: Has Neural NLG Become More Scientific? – Ehud Reiter's Blog
Pingback: NLG=Task+Data+Model/Alg+Eval – Ehud Reiter's Blog
Pingback: Quality assurance for academic research – Ehud Reiter's Blog
Pingback: I dont like leaderboards – Ehud Reiter's Blog
Pingback: Ten tips on doing a good evaluation – Ehud Reiter's Blog
Avinash V says:

May 27, 2024 at 10:57 am

It’s fascinating to hear about the discussions on dataset quality at the West Coast NLP event, even if it wasn’t directly focused on NLG. The idea of a ‘nutrition label’ for datasets is definitely intriguing and could be a game-changer in ensuring transparency and trustworthiness in data usage.

LikeLike

Pingback: Hard to Change Poor Research Culture – Ehud Reiter's Blog
Pingback: I am worried by NLP research culture – Ehud Reiter's Blog

Ehud Reiter's Blog

Ehud's thoughts about Natural Language Generation. Also see my book on NLG.

Do We Encourage Researchers to Use Inappropriate Data Sets?

Junior Researcher: Easier to get papers and funding with poor data sets

Reviewing: We cannot question a data set if its been used before

Gresham’s Law: Do bad data sets drive out good ones?

Can we do anything about this?

18 thoughts on “Do We Encourage Researchers to Use Inappropriate Data Sets?”

Leave a comment Cancel reply

Junior Researcher: Easier to get papers and funding with poor data sets

Reviewing: We cannot question a data set if its been used before

Gresham’s Law: Do bad data sets drive out good ones?

Can we do anything about this?

Share this:

Related

Share this:

18 thoughts on “Do We Encourage Researchers to Use Inappropriate Data Sets?”

Leave a comment Cancel reply