I’m just back from attending INLG2025 in Hanoi. Lots of interesting things, but one thing that came up a few times was user studies to understand what NLG users want and expect (ie user requirements). This is not much discussed in the academic NLP and ML communities, but its essential to building useful tech. It is also something which will remain interesting and important even if implementing NLG systems becomes trivial in some cases (eg, write a few LLM prompts and hook together some modules).
Of course companies have much better data than academics on how users use NLG and what they would like to see in the future, but they usually are not able to share this data (I had a good discussion with one of the invited speakers, Hadas Kotek from Apple, about this). So if academics want data on what users want, they need to collect this themselves (which Hadas encouraged me to do).
I share some thoughts about this below. There is also a chapter on Requirements in my 2025 book, which covers some (but not all) of below; note the book can be read for free on Arxiv.
Aberdeen work on user studies
We are already doing quite a bit on understanding user requirements in my group at Aberdeen, and in particular are trying to do this in a structured and replicable manner. Techniques include:
- HCI techniques (Knoll et al 2022): We used HCI techniques (mockups, Wizard-of-Oz, etc) to understand what clinicians wanted and expected from a medical scribe system which writes notes that summarise doctor-patient consultations (previous blog).
- Qualitative analysis (Sun et al 2024, Sivaprasad et al 2025): We used qualitative techniques (structured interviews, focus groups, etc) to understand what users wanted from health apps which supported IVF users and cancer patients (previous blog).
- Surveys (not yet published): We used surveys with drivers and other stakeholders to understand what they were looking for in systems which provided feedback to drivers in UK and Nigeria (previous blog).
- User feedback (Sivaprasad et al 2025): Last but not least, we analysed feedback comments from users of a deployed IVF success prediction model (users of the model had the option to leave comments) in order to understand what extra help they wanted (previous blog).
All of the above activities led to interesting and valuable insights about what users wanted. For example, we discovered that users of the medical scribe system wanted summaries to be generated incrementally as the doctor-patient discussion progressed; we found out that people using an IVF success prediction model wanted to know why the model ignored features which they thought were important; and we realised that a Nigerian driving-feedback app should emphasise drunk driving information.
The above insights are important research contributions in their own right (which will help other people working on related topics) and also made our systems much more useful. So trying to understand what users wanted was definitely worth doing!
Issues and advice
Participants: The first thing I tell people in this space is that it is essential to work with people who understand and care about the application. We always work with either domain experts or committed users who care about the app. We do not use crowdworkers or random students.
Numbers: In part because of limitations on participants, the work described above was done with relatively small numbers of users (50 participants or less). Because of this, we often use qualitative techniques to analyse data, which is fine as long as it is done rigorously (blog). Of course more participants would be better, not least because this would give a better understanding of functionality which is only important to a limited number of users. A related issue is that in our user studies, participants typically were asked to look at a small number of scenarios; more scenarios would give better coverage of unusual or edge case situations.
User feedback on deployed systems: One way to get feedback from more people and scenarios can be to look at feedback from users of deployed systems. This usually requires getting permission from the system’s owners, which is not a problem for employees (eg, Hadas has access to feedback from Apple users) but can be very difficult for outsiders such as academics. We have managed to do this on two occasions, although on one occasion we were not allowed to fully disclose what we learned.
Publication: Unfortunately the academic NLP community often does not seem to place a lot of value on studies of user needs and requirements, which seems somewhat strange to me; surely this is more interesting and valuable than experimenting with different prompts or commercial LLM models to see which is better? Sometimes it is easier to publish in domain venues, such as medical journals for analyses of user needs in medical contexts.
Final thoughts
If we want to make NLP applications more useful, then a deep understanding of what users want (and do not want) is essential, and often has much more impact than model fine-tuning, prompt engineering, etc (of course we can do both). It is also very interesting, and can reveal real insights about the application. So I encourage both researchers and developers to take this seriously!