How can I tell if a paper is scientifically solid?

A PhD student recently complained to me that a lot of the papers he was reading were scientifically worthless – dubious data sets, meaningless evaluations, terrible methodology (eg, training on test data), etc (I’ve discussed such problems in previous  blogs, eg Many Papers on Machine Learning in NLP are Scientifically Dubious).   He further complained that sometimes problems were pretty obvious on first reading, but other problems only surfaced when he took a close look at the data or code behind a paper.  So how, the student asked me, could he avoid wasting time reading worthless papers, so that he could spend his “reading time” on papers which were scientifically solid and meaningful?

I have a lot of sympathy for the student.  When I did my PhD in the late 1980s, we could assume that any paper which appeared in an ACL conference (for example) was probably scientifically solid.  Unfortunately this is no longer true.  There are plenty of excellent papers in recent ACLs, but there are also plenty of scientifically atrocious papers (I described one example in an earlier blog; unfortunately there are many like this).  So we cannot assume that a paper is good if it appears in a “prestige” venue.

There are no foolproof techniques for detecting scientifically flawed papers, but I give some suggestions below on things readers can look for.

Checks on the Paper

I teach a class on “Research Methods” to our final-year undergrads, and a class on “Evaluation of AI” to our MSc students.  In both classes, we look at what to expect to see in a good paper.  I summarise below some things which you should be able to check on an “initial skim read” before reading a paper in detail.  These points apply to full “results” papers which present models, algorithms, or systems; not all of these would apply to short “work-in-progress” papers or to papers on other topics such as data sets and methodology.

  • Understandable: If a paper is hard to understand, this is likely because the authors are confused about what they are trying to do.  I usually bin papers that I struggle to understand and/or do not present a clear research question/issue/hypothesis.
  • Understands problem: The introduction of the paper should show that the author understands the problem he is trying to solve.  I bin papers which describe the problem in 2 sentences and then dive into how the author’s model gives 1% increase on some existing dataset.  If people dont understand the problem they are trying to solve, then their model probably will not generalise beyond the specific data set.
  • Aware of both recent and historical related work: I am suspicious of papers that primarily cite really old (eg 1990s) research; this suggests the author is not up-to-date.  I am also suspicious of papers which only cite very recent papers, especially if they also claim that “modern” methods are obviously superior to previous methods so there is no need to talk about older work.  I have seen a number of such papers which develop complex neural NLG models for problems which can easily be solved by a few fill-in-the-blank templates.
  • Good datasets: Properly checking datasets is a lot of work (see below), but you can check a few things quickly.  I am somewhat dubious about using synthetic data sets (eg, output of a rule-based NLG system) for training, and automatically bin any paper that uses synthetic data for testing.   I also check whether the dataset seems representative; eg TV dialogues are very different from real-world dialogues, so I am suspicious of papers that use TV corpora.
  • Good baselines: Usually we compare models against baselines.  These baselines should be plausible. Eg, if someone is building a neural NLG system to tackle a simple problem which can be solved well with templates, then I expect that the baselines should include a good template system.
  • Meaningful evaluation: NLG papers should include human evaluations, since metrics do not work well in NLG.  I am very suspicious of papers that only evaluate using BLEU (and will automatically bin papers that use ROUGE as their main evaluation).
  • Statistics: I expect to see p-values in papers, and am suspicious if these are not present.
  • Qualitative analysis: I expect to see some kind of qualitative error analysis or case study.  I am not impressed by papers that just give numbers.

Of course the above are guidelines and there are exceptions!   For example, if we are doing a real-world extrinsic evaluation (ie, fielding a live system and assessing its impact on users), then there will be ethical constraints on which baselines we can use.   But such exceptions should be explained and justified in the paper.  If they are not, then the paper is probably not worth reading in detail.

Checks on Dataset and Code

Sometimes papers look fine (and meet the above criteria), but the research is still seriously flawed because of problems in dataset, coding, or execution.   Such problems are harder to detect, unfortunately.    My suggestion is to do the following.

  • Datasets: Unfortunately, a lot of NLP data sets are seriously flawed, and indeed I get the impression that much if the research community doesnt seem to care, which I find incredible.  But if you want to do proper science, you need good data sets!  From a practical perspective, I suggest two steps.
    • Ask a domain expert to check the dataset.  A lot of datasets are misused because they are downloaded or scraped by NLP researchers who dont really understand what is in the dataset or how it should be used.
    • Use the data set yourself, and investigate anything that seems odd or suspicious to you, such as test data which seems very similar to training data, or amazingly clean and consistent corpora.
  • Code: There is often a mismatch between code and what is reported in the paper, eg preprocessing steps which are important but not reported.  I dont know of any shortcuts to detecting this, you will need to inspect the code and see what it actually does.
  • Methodology: One of the hardest things to check is research methodology. For example, did the author try 100 test sets and only report the one which worked best?   The only real way to check this is to do a full replication of the author’s experiment, which of course is a lot of work.

Final Comment

I highly recommend Greenhalgh’s book How to Read a Paper (Amazon or online older version).  This is written for medical practitioners (eg, doctors) who are trying to read medical research papers, so some of what it says does not apply to NLP.  But nonetheless it has a lot of excellent advice and insights which apply to NLP as well as medicine.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s