Ten tips on doing a good evaluation
I present some suggestions for doing good evaluations, which are based on previous blogs I have written.
I present some suggestions for doing good evaluations, which are based on previous blogs I have written.
A really important and interesting research challenge is how to effectively communicate complex information to patients. At Aberdeen we are working on this topic in several areas of medicine, and are looking for a research fellow to join us.
Data contamination (testing and evaluating LLMs using test data which is known the the LLM) may be a huge problem in NLP, leading to a lot of invalid scientific claims. Unfortunately, many NLP researchers ignore the problem, which is really worrying.
In 2019 I told students that neural language models produced texts which were fluent but could not be trusted content-wise. In 2024 I told them the same thing. My high-level message hasnt changed despite the huge improvements in tech, maybe this is a fundamental aspect of LLMs?
Communicating uncertainty to non-experts is very important but also very difficult. Problems in communicating risk probabilities are well known, but additional challenges arise in many real-world use cases, including communicating time-series of probabilities and explaining the impact of features which the model ignores.
Systematic literature reviews are a powerful and useful methodology for investigating many research questions. I give a high-level overview for NLP researchers who are not familiar with this technique.
Our latest paper from the ReproHum project discusses experimental flaws we have encountered while reproducing earlier experiments, including code bugs, UI problems, inappropriate exclusion of data, reporting errors, and ethical lapses. Pretty depressing. These types of errors are not detected by usual NLP reviewing practices, so I suspect they may be pretty common…
There is a lot of justified concern about the risks of using LLMs in healthcare. But LLMs can also make medicine safer, if they are used to support doctors and help them make fewer medical errors.
One very positive aspect of 2023 for me was that I saw lots of really interesting research papers, much more than in previous years. Perhaps because the emergence of LLMs have encouraged some people to move away from scientifically dubious leaderboard chasing and towards more interesting research on scientific fundamentals? I describe a few of these papers here.
I summarise a few papers I have recently read on what LLMs can and cannot do. One (not surprising) finding is that LLMs skill profile is very different from humans. Which is good, means that human+LLM together can do things that humans/LLMs cannot do on their own. Also means that it makes little sense to evaluate LLMS using tests/techniques designed to evaluate people.