One-day class on NLG evaluation
In early Sept I ran a one-day class on evaluation. I summarise what I did in this class and give links to my presentations, in case this is useful to other people.
In early Sept I ran a one-day class on evaluation. I summarise what I did in this class and give links to my presentations, in case this is useful to other people.
Sometimes the latest technology is *not* appropriate for an NLG task. I saw this very strongly in the late 2010s with LSTMs (which do not work well for data-to-text), and continue to see this in 2024 (GPT4 is not always the best approach). Both researchers and developers need to be open-minded about alternative approaches.
I’m now part-way towards retirement, and working fewer hours (for less money) while have more time for trips and other personal activities. A few people have asked about this, so I thought I’d explain in a blog.
AI has many promising applications in healthcare, but adoption of AI in healthcare is very slow. One message from a recent workshop I attended is that it would help if AI researchers had a better understanding of requirements of the health sector, including evaluation, challenges, and business cases.
I list five challenges to evaluating LLMs, which unfortunately seem to be ignored by many researchers. Which means that many published LLM evaluations cannot be trusted. This blog is based on a recent workshop talk.
This is a personal blog about a recent bike trip I did from Perth (Scotland) to Preston (England).
I suspect we may be reaching the point where the most common type of human evaluation in NLG (ratings/rankings by crowdworkers or students) are less meaningful than evaluations using LLMs. But better forms of human evaluation, based on annotation or impact, are still very useful and give insights which we cannot get from LLMs.
My student Barkavi Sundararajan has shown that LLMs do a better job at data-to-text if the input data is well structured. She will present a paper about this at NAACL.
In human evaluation, it is absolutely essential that subjects understand what they are supposed to do; otherwise evaluations will not be meaningful or replicable. This may sound obvious, but it was repeatedly raised as a concern in the replication shared task in the 2024 Human Evaluation workshop.
People working in AI in Medicine (and indeed AI more generally) should be aware of the long history of previous work in this area. Our technology is much better in 2024, but real-world success is still challenging, as has been the case for the past 70 years (the first claims that models could be better than doctors were made in 1954).