**UPDATE 8-Dec-2022. A great example of a *good* evaluation of using prompting for machine translation is Prompting PaLM for Translation: Assessing Strategies and Performance . I analyse this paper using the below framework in How effective is prompting?
One of my favourite books on doing good scientific research is Greenhalgh’s How to Read a Paper (old version available online, newer versions can be bought from Amazon). This is a wonderful book whose goal is to help doctors critically read medical research papers, so that they can assess for themselves whether the paper is scientifically solid or not; especially with regard to whether experimental results are trustworthy. Of course its also intended to help researchers write scientifically solid papers, by pointing out common problems. The book is easy to read, and I highly recommend it.
Anyways, being able to read a scientific paper and assess whether its experimental results are trustworthy is an important skill for all scientists, including CS/AI/NLP researchers. I’m not aware of an equivalent book for AI researchers, sadly, but I do encourage my students to critically read papers and look for flaws in their experiments or evaluation. For PhD students, we try to do this in our weekly reading group, so that PhD students get used to critically analysing papers. For MSc students, I teach a class on Evaluation of AI, and we have tutorials (and sometimes assessments) where we read papers and look for problems.
I strongly encourage students elsewhere to do likewise! Critically reading a paper and identifying flaws is a very important skill for researchers.
What to look for
For students who are new to this kind of exercise, I suggest focusing on the following relatively “basic” questions
- Hypothesis: Does the paper have a clear hypothesis or research question?
- Dataset: Is it clear what datasets are used? Are they appropriate and representative? Is there an ironclad separation between testing and training data? Is the data synthetic or real?
- Baseline/state-of-art: If the paper compares against “state-of-art” system, is this system actually state of art?
- Metrics: If evaluation metrics are used, are they appropriate for the task?
- Human evaluation: If a human evaluation is done, is it well-designed and run with appropriate subjects?
- Statistical analysis: Is the statistical analysis of results standard and straightforward? If not, is there good justification for doing a complex statistical analysis? Is statistical significant presented, with multiple hypothesis correction?
- Results: Are the results/findings claimed by the paper supported by the experiment?
- Replicability: Is enough information provided to enable someone else to replicate the experiment?
None of the above is rocket science! Of course additional questions can also be asked, but in my experience most AI papers will fail some of the above basic questions. In some years I have asked my MSc students to choose a paper from IJCAI and find mistakes in evaluation, and 90% of the papers they investigate have problems in one or more of the above issues. Which is a pretty sad comment on the scientific quality of AI research (and indeed on the quality of reviewing, even for “prestige” events).
Recently one of my students got excited by prompting GPT3 language models, so I told him to read the arxiv version of Language Models are Few-Shot Learners and look at the evaluation from the above perspective, focusing on machine translation. We identified the following issues
- Dataset: Paper says it uses data from WMT 2016, but doesnt say which of the three WMT2016 datasets (news, IT, biomedical) was used. Also, WMT data is on the internet, has been discussed in many papers, and is largely based on published material. Since the training data is Internet Common Crawl, there is a real danger of test data being present in the training data. This is discussed in the paper, but the student and I still had concerns about this.
- Baseline/state-of-the-art: Paper compares against “state-of-the-art” systems which are old, including a French-English MT system from 2014. A 2014 MT system is not state-of-the-art!
- Metrics: Paper uses BLEU (and nothing else) to evaluate MT systems. BLEU is not the best way to evaluate MT systems, especially when hallucination is a danger.
- Statistical analysis: Paper does not give error bars or statistical significance figures. No analysis is given of whether the BLEU differences are large enough to be meaningful.
- Results: Paper makes generic claims, but only a small number of language pairs are investigated, and performance differs widely across these pairs.
- Replicability: We cannot replicate paper since we dont have access to GPT3. Also, we dont know what prompts were used or how they were chosen, or indeed which WMT dataset they used (as mentioned above).
I realise that some of the above are fixed in the published version of the paper, where for example the authors explicitly acknowledge “due to our unfamiliarity with the literature and the appearance that these are un-competitive benchmarks we do not suspect those results represent a true SOTA.” But in my experience people usually refer to and read the arxiv version of this paper, not the published version.
I also appreciate that there are subsequent papers on prompting which address some of the above issues! What I am trying to do here is give an example of finding problems in a paper, using a paper which is reasonably well-known.
Critical reading is an essential skill
I highly recommend the above exercise, especially for students. Its really important to be able to critically read a paper and recognise its experimental shortcomings, and also having this skill helps people write better papers themselves! I think the best approach is to do this in a group, where several people read a paper and discuss the flaws that they found. This helps people learn from each other, and also gives people confidence that they are doing a good job.
So give it a go!