Cycling in Netherlands
This is a personal blog, about a recent bike trip I did which was mostly in Netherlands.
This is a personal blog, about a recent bike trip I did which was mostly in Netherlands.
My student Adarsa Sivaprasad is looking into what questions users of an AI prediction model actually have, and how these should be answered. Amongst other things, users seem to have more questions about what information a model considers than about how a model works.
We have a really nice NLP research group in University of Aberdeen, with a dozen researchers who work on topics such as evaluation, interpretability, health applications, cognitive aspects, and cross-temporal research. We regularly publish and win awards in top venues. Its exciting!
Its been around 6 months since my new NLG book was released. I summarise what I now think are its key messages, for rule-based NLG, ML and neural NLG, requirements, evaluation, safety/testing/maintainability, and applications.
Most LLM benchmarks and leaderboards are garbage. Unfortunately, it now seems that even the few “good” benchmarks (such as SWEBench and Chatbot Arena) are compromised because they are being gamed by the big LLM vendors, who tweak the benchmarks and rules so that their systems do better.
I describe several papers which measure real-world impact of NLP systems, using different methodologies (A/B test, before/after eval, clinical trial, observational study). I hope these examples inspire and encourage more people to consider evaluating real-world impact!
I suspect that our fixation with LLM benchmarks may be driving us to optimise LLMs for capabilities that are easier to benchmark (such as math problems) even if they are not of much interest to users; and also to ignore capabilities (such as emotional appropriateness) which are important to real users but hard to assess with benchmarks.
People will make much better use of LLMs if they understand what the technology can and can not do. Unfortunately many people have little understanding of this; I make a few suggestions which perhaps could help a bit.
Nikolay Babakov has recently published several papers on Bayesian networks, including challenges in reusing BNs, ideas for explaining BNs (work with Jaime Sevilla), and using LLMs to help build BNs. I help to supervise Nikolai, and think BNs can potentially be a useful way to do reasoning with uncertainty which is configurable and explainable.
I would love to see benchmarks which assess whether generated texts are emotionally upsetting. This is a major problem which we frequently encounter in our work on using AI to support patients. It would be challenging to build such a benchmark (nothing like it exists today), but we need a braoder range of benchamarks which assess complex real-world quality criteria such as emotional impact.