Could some NLP research be fraudulent?
Is fraud (eg fabricating or falsifying data) a problem in NLP? It certainly is a problem in other scientific areas, and it wouldnt surprise me if it affected NLP as well.
Is fraud (eg fabricating or falsifying data) a problem in NLP? It certainly is a problem in other scientific areas, and it wouldnt surprise me if it affected NLP as well.
Since commercial researchers dominate the “hot” area of large language models, I’ve seen a number of people ask “what should academic researchers focus on”. There are of course huge numbers of exciting and valuable scientific research questions which are not of much commercial interest, including long-term work which wont pay off commercially for 10+ years, high quality evaluation, socially useful but low-profit applications, and using NLP to research fundamental cognitive science questions.
A reader asked me how accurate chatGPT texts need to be. The answer is that this depends on context, including use case, workflow, and error type.
CSL journal has just published a paper “Evaluating factual accuracy in complex data-to-text”, which summarises our work in this area. I strongly recommend the paper to anyone who is interested in evaluating the accuracy of texts produced by neural NLG systems.
Last week I played around with using chatGPT for data-to-text, and to be honest overall I was disappointed. A few people have asked me about this, so I’ve written up some of my notes here.
An example from MedPaLM highlighted to me that generated texts can contain information which is factually accurate but still not appropriate, because (in this case) of its negative psychological impact. There are other such cases, and we should ensure that our evaluation criteria are sensitive to them.
I get asked a lot about chatGPT, so I thought I’d write a blog explaining my views, which focus on its impact on data-to-text NLG. Basically I think chatGPT is really exciting science which shows major progress on many of the challenges in neural NLG. However, commercial potential is unclear, and the media hype is annoying…
I thought I’d end 2022 with a summary of the papers written by my students and I in 2022. All of them are about requirements, resources, and/or evaluation of NLG.
I was very impressed by a recent paper that compared prompting-based MT to MT based on trained models. Results are very interesting; prompting-based MT generates fluent texts which however have accuracy problems. Also the paper itself is an excellent example of a high-quality NLP evaluation, and I recommd it to anyone who wants to do good NLP evaluations.
I dont like academic leaderboards. Poor scientific techniques, poor data, and poor evaluation means leaderboard results may not be worth much. I also suspect that the community’s fixation on leaderboards also means less research on important topics that do not fit the leaderboard model, such as understanding user requirements.