I saw lots of interesting papers in 2024 (same as 2023). I’ve mentioned some of them in previous blogs, see list below. But there are others which I have not blogged (much) about, I describe a few of these below.
Obviously these are papers which *I* have found interesting, so they are on topics such as evaluation, experimental rigour, real-world utility, and healthcare applications. I assume most of my blog readers share some of my interests, so hopefully will also find at least some of the below papers to be interesting.
A Aggarwal et al (2024). NHS cancer services and systems—ten pressure points a UK cancer control plan needs to address. Lancet Oncol 2024 (DOI)
I am very interested in using AI in healthcare, in contexts where it is possible to build systems that are deployable and address real needs. The AI in medicine community sometimes seems to have little awareness of this (blog), so I found it really useful to see what doctors thought was needed in cancer care. The section “Pressure point 9: Technology adoption and value” should be read by anyone who cares about real-world impact in healthcare (and elsewhere). Their number one pressure point was changing demographics and dealing with inequality; can we use AI to help poor people in deprived areas improve their health and deal with cancer? That would make a difference!
D Braun and F Matthes (2024). AGB-DE: A Corpus for the Automated Legal Assessment of Clauses in
German Consumer Contracts. Proc of ACL24 (ACL Anthology)
This paper presents a a corpus of German consumer contracts, which have been annotated by legal experts; the resource is released to the community. I think the community badly needs high-quality expert-annotated corpora, especially in languages other than English, so I was very happy to see this. I also loved their example of domain drift, where GPT gave a wrong answer because it used an old law which had been superceded in 2014 (ie GPT could not deal with a domain change that happened 10 years ago). Something to keep in mind in any domain which is changing over time!
F Moramarco (2024). Evaluation of medical note generation systems. PhD thesis, University of Aberdeen (Aberdeen library)
Its unusual for CS researchers to read PhD theses, but I highly recommend Francesco’s thesis to anyone interested in real-world evaluation. Partially because the thesis tells a story which shows how a wide range of evaluation techniques (metrics, ratings, annotation, task-based) were used to assess a medical note generator. And partially because Chapter 7 discusses Francesco’s evaluation of the system in real-world clinical usage; this kind of evaluation is very rare in NLP (and its just in the thesis, its not described in any of Francesco’s papers).
J Opitz (2024). A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice. Transactions of the ACL (DOI)
I am very worried about the experimental quality and rigour of evaluations in NLG. I loved Opitz’s paper, which is a very thorough and careful analysis of metrics for classification-based evaluations, for example of sentiment analysis. The detailed problems and issues that Opitz finds are different from what I see in NLG, eg he observes that “Macro F1” is ambiguous and means different things in different papers (and some authors don’t say which definition of Macro F1 they are using). But the overall message is similar to our work (eg, Thomson et al (2024)); experimental quality in NLP is often poor.
A Peppin (2024). The Reality of AI and Biorisk. Arxiv
There was a period when I saw lots of gurus speculating that AI could destroy humanity, with one likely path being enabling terrorists to build out-of-control bioweapons. I see less of this in Dec 2024, but it is still the case that many AI safety regulators worry much more about this kind of thing than about, say, the risk that a chatbot encourages a child to commit suicide (CNN). So I really liked this careful and technically grounded analysis of the “biorisk” threat, which concludes that (A) there is no immediate risk and (B) serious analysis of biorisk need to look at the bigger picture (“whole-chain risk analysis”), not just the AI system in isolation.
Papers I have described in previous blogs
[Papers I co-authored]
S Balloccu et al (2024). Ask the experts: sourcing high-quality datasets for nutritional counselling through Human-AI collaboration. Findings of EMNLP (ACL Anthology) (blog)
S Balloccu et al (2024). Proc of the Human Evaluation workshop (ACL Anthology) (blog)
E Reiter (2024). Natural Language Generation. Springer (companion site) (blog)
A Sivaprasad and E Reiter (2024). Linguistically Communicating Uncertainty in Patient-Facing Risk Prediction Models. Proc of EACL workshop on Uncertainty-Aware NLP. (ACL Anthology) (blog)
M Sun et al (2024). Effectiveness of ChatGPT in explaining complex medical reports to patients. Arxiv (Arxiv) (blog)
B Sundararajan et al (2024). Improving Factual Accuracy of Neural Table-to-Text Output by Addressing Input Problems in ToTTo. Proc of NAACL-2024 (ACL Anthology) (blog)
C Thomson et al (2024). Common Flaws in Running Human Evaluation Experiments in NLP. Computational Linguistics. (journal link) (blog)
[Papers I did not co-author]
S Balloccu et al (2024). Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs. Proc of EACL-2024 (ACL Anthology) (blog)
A Bavaresco et al (2024). LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks. Arxiv. (Arxiv) (blog)
N Diakopoulos et al (2024). Generative AI in Journalism: The Evolution of Newswork and Ethics in a Generative Information Ecosystem. (ResearchGate) (blog)
T Kocmi et al (2024). Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies. Proc of ACL-2024 (ACL Anthology) (blog)
T Kocmi et al (2024). Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation. Proc of WMT (ACL Anthology) (blog)
G Leech et al (2024). Questionable practices in machine learning. Arxiv (Arxiv) (blog)
V Magesh et al (2024). Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools. Arxiv (Arxiv) (blog)
J Ruan et al (2024). Defining and Detecting Vulnerability in Human Evaluation Guidelines: A Preliminary Study Towards Reliable NLG Evaluation. Proc of NAACL-2024 (ACL Anthology) (blog)
R Zhang et al (2024). How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs. Arxiv (Arxiv) (blog)