academics

Hard to Change Poor Research Culture

When I was at Arria, I worked a bit with Barbara Kendall; Barbara is an Olympic gold medalist who now (amongst other things) consults about organisational culture. Barbara was very passionate about the importance of culture, which includes things like what values people have and how well they collaborate with and support each other. She thought this was the key to doing well in business as well as sports.

I think Barbara was right, and that culture also is very important in research. Certainly in a university department or other research organisation, it makes a huge difference whether people are supporting each other or work in a more isolated fashion. It also matters a lot whether they prioritise scientific rigour or pushing out lots of papers, how willing they are to “cheat” (eg use LLMs to fabricate papers), etc.

Barbara also said that once culture is established, its hard to change. So its possible to establish a “good” culture in a new organisation, or one is undergoing massive changes, but its much harder to improve culture in an established organisation. Unfortunately making culture worse is easier than making it better; eg trust in leadership is difficult to establish and easy to destroy.

Changing culture is hard

At an institutional level, one small example of the difficulty of changing cultire is in the Aberdeen Computing Science department. For most of my time at Aberdeen, we had a really good research culture which amongst other things encouraged staff to interact with and support each other. Unfortunately, we partially lost this during Covid, when we had many new staff join who could not physically interact with other staff. Recently we have been trying to strengthen this aspect of our culture, but its hard. No one thinks it is a bad idea for staff to interact more, but people have gotten used to doing their own thing, and changing this is difficult.

At the field level, I have been complaining for years about poor experiments in NLP, including poor data sets (blog), lack of interest in reproducibility (paper), serious data contamination issues (blog), poor executed experiments (paper), bad benchmarks (blog), lack of interest in real world impact (paper), etc. Other people have made similar complaints. But yet little seems to change. I think this is because these issues are aspects of difficult-to-change research culture; people are used to doing experiments in certain ways, and do not see any pressing need to change. Indeed, a senior researcher explicitly told me this many years ago; she agreed in principle with my concerns but said that in practice she was not going to change what she did, because the NLP field accepted it.

Culture can change in new or changing fields

So culture is hard to change once established. But it can change when a field changes. I think one example of this in NLP was the WMT shared task in Machine Translation, which was introduced around 20 years ago when statistical MT was getting established as a research field. Charles Calliston-Birch and others insisted that WMT shared-task evaluation would be based on human evaluation, not on metrics, and ever since the MT community has had a research culture that values good evaluation. Not perfect by any means, but evaluation in MT has been far better than evaluation in summarisation, for example, where the ROUGE metric dominated for decades despite the fact that it is meaningless (blog).

From this persective there may have been a chance to improve evaluation culture in NLP when LLMs took off around 2023. Unfortunately, this did not happen. Indeed in some ways evaluation became worse as OpenAI and others pushed a “culture” where evaluation was based on buggy metrics of little relevance to real-world users (blog).

Publication venues can change culture

A strong push by publication venues can also change research culture. A good example is in medicine, where the top journals jointly decided that they would only accept trials that had been preregistered (paper). Ie authors had to enter the design of their experiment into a public website *before* carrying out their experiment, if they wished to publish in one of top journals. Preregistration is important for experimental rigour, and this decision by the journal editors led to widespread adoption of preregistration; ie, they succeeded in changing research culture in medicine.

Unfortunately, its hard to imagine this sort of thing happening in NLP. Most NLP papers go through ARR, and the people who run ARR have their hands full dealing with exploding numbers of submissions, a reviewer pool which is increasingly inexperienced and reluctant, and more and more inappropriate use of LLMs to write papers and reviews. Improving experimental quality in NLP is not high on their list of priorities.

In fairness, there was an effort several years ago by the big ACL conferences to improve reproducibility, for example with reproducibility checklists. But this did not achieve much (I say this having worked on reproducibility) and seems to have largely been abandoned.

It might be easier for smaller venues such as TACL to insist on changes such as better experimental rigour or reproducibility; in an ideal world this could encourage similar actions in ARR. However, I see limited interest in this kind of thing in TACL, the focus once again is mostly on dealing with rapidly rising numbers of submissions.

Final Thoughts

It is very difficult to change the NLP research culture which encourage poor scientific work, which is depressing. But this is the way the world works, unfortunately. Perhaps one “take-home message” is that if an opportunity to change culture arises, we should grab it, because these do not occur very often.

One thought on “Hard to Change Poor Research Culture

Leave a comment