Cycling from Perth to Preston
This is a personal blog about a recent bike trip I did from Perth (Scotland) to Preston (England).
This is a personal blog about a recent bike trip I did from Perth (Scotland) to Preston (England).
I suspect we may be reaching the point where the most common type of human evaluation in NLG (ratings/rankings by crowdworkers or students) are less meaningful than evaluations using LLMs. But better forms of human evaluation, based on annotation or impact, are still very useful and give insights which we cannot get from LLMs.
My student Barkavi Sundararajan has shown that LLMs do a better job at data-to-text if the input data is well structured. She will present a paper about this at NAACL.
In human evaluation, it is absolutely essential that subjects understand what they are supposed to do; otherwise evaluations will not be meaningful or replicable. This may sound obvious, but it was repeatedly raised as a concern in the replication shared task in the 2024 Human Evaluation workshop.
People working in AI in Medicine (and indeed AI more generally) should be aware of the long history of previous work in this area. Our technology is much better in 2024, but real-world success is still challenging, as has been the case for the past 70 years (the first claims that models could be better than doctors were made in 1954).
I really liked a recent survey of gen AI in journalism, which looks at issues such as how journalists use/interact with LLMs, and what impact this has on journalists. Some unexpected (to me) findings, for example the most common ethical concern is that news organisations will use LLMs without human supervision.
I present some suggestions for doing good evaluations, which are based on previous blogs I have written.
A really important and interesting research challenge is how to effectively communicate complex information to patients. At Aberdeen we are working on this topic in several areas of medicine, and are looking for a research fellow to join us.
Data contamination (testing and evaluating LLMs using test data which is known the the LLM) may be a huge problem in NLP, leading to a lot of invalid scientific claims. Unfortunately, many NLP researchers ignore the problem, which is really worrying.
In 2019 I told students that neural language models produced texts which were fluent but could not be trusted content-wise. In 2024 I told them the same thing. My high-level message hasnt changed despite the huge improvements in tech, maybe this is a fundamental aspect of LLMs?