Questions from readers of my book

I had a really nice discussion session last week with a group of people who are reading my book on NLG. They had a lot of questions both at the session and afterwards, I thought I would share my responses in a blog. Note that I have simplified and reworded the questions, and also in some cases combined related questions.

Introduction (Chapter 1)

Q: Looking back, are there NLG ideas that were underappreciated at the time but feel relevant today?

A: Absolutely. A good example is language grounding (connecting language to the world). I did some work on this in the early 2000s, and ran a NAACL-2003 workshop and journal special issue. Interest from the NLP community was minimal; I think we had more computer vision than NLP people at our (thinly attended) NAACL workshop. Whereas now language grounding and language/vision are popular topics. Perhaps one difference is that in 2003 this was a topic only of scientific interest, whereas now it is also of commercial interest.

Q: What types of NLG-system-as-model-of-human-language-production experiments do you find interesting?

This is not my core interest, but I did have a small involvement in a project which explored whether LLMs exhibited human-like language behaviour. To be honest, I am not sure that such experiments teach us much about how humans process language, because of the black-box nature of LLMs. I think it is more interesting to investigate human-like behaviour of rule-based NLG systems, because the rules can then give us insights about human language processing.

Q: This book seems to focus on applications. Does it overlook non-application oriented areas of NLG? Is there space for complementary books which are less applied?

A: I have always been interested in applications, in part because I think working on real-world NLG problems gives us valuable scientific insights about language, NLG, and computational linguistics (some of these are described in blog). But of course other people have different interests! It would be great if other people wrote books about NLG, and I am happy if I can.

Rule-based NLG (Chapter 2)

Q: Where are rule-based (or hybrid rule+neural) NLG systems used?

A: Usually in mission-critical or safety-critical applications where generated texts must be correct, and where quality assurance is essential. LLMs fail in unexpected and unanticipated ways, and quality assurance is very hard for complex black-box stochastic systems. I see many hybrid systems where rules or “white-box” ML are used to select content and LLMs are used to express content; or where rules are used to generate critical portions of a documents which must be correct, and LLMs are used to generate supporting material which is not safety-critical.

Q: Since omitting important patient information is a content error, should content planning remain explicit to prevent omissions?

A: Explicit content planning is possible in limited patient-information domains, but very difficult in open-ended dialogues about a wide variety of problems. If explicit content planning is feasible, I suggest that it be used. If not, I suggest considering RAG approaches; not a panacea, but can help.

ML and Neural NLG (Chapter 3)

Q: In Neural NLG, to what extent does adoption and success depend on data quality versus architecture-related factors?

A: I think data issues are paramount (and I stress them in this chapter). Modern LLMs are trained on the internet, which means that their training data is full of mistakes, quackery, marketing material, obsolete information, etc. A lot of the failures I see in LLMs can be traced to these problems. In ML generally, data quality is usually more important than ML algorithm, model, architecture.

Q: Does lack of data make AI less effective in less developed countries.

Absolutely! One well-known aspect is limited linguistic data in under-resources languages. But there are many other data problems as well. For example my student Thompson worked on a safe-driving app for Nigeria (blog), and he had to create his own dataset of Nigerian driving data, because no such dataset existed.

Requirements (Chapter 4) and Evaluation (Chapter 5)

Q: What, in general, are the things we need to evaluate for an NLG system? How do we choose evaluation techniques for a system or use case?

A: This should be based on requirements. We first decide which quality criteria matter in this context, at both the text and system level (this is described in Chapter 4 of my book), and then choose appropriate evaluation techniques for these quality criteria (this is described in Chapter 5). In many cases we will end up evaluating some form of accuracy, fluency and utility (blog).

Q: Is evaluation based on requirements too narrow a concept? Should we instead characterise systems and models in a way which lets us predict utility in new use cases?

A: I love the concept, but I dont know how to operationalise it. With LLMs, fluency is almost always high, and utility by definition depends on use case. Perhaps accuracy could be characterised in general terms (eg, expected type and frequency of hallucinations), would be worth exploring.

Q: What should we report alongside evaluations; there are several frameworks (eg, HEDS, EvalCards), which should we choose from?

A: Unfortunately there is no agreement in this space, not least because the different frameworks are aimed at different kinds of evaluation. For human evaluation, which is my main interest, the best framework at the moment is HEDS. It would be very useful if we had agreed and widely used frameworks for reporting evaluations!

Q: Can human-in-the-loop evaluation introduce new forms of bias?

A: Human evaluators of course have their own biases, and more generally some things are best evaluated automatically (eg, spelling mistakes, and perhaps fluency more generally) and others are best evaluated by humans (eg, utility). I would love to see carefully designed evaluation protocols which combine human and automatic evaluation in a complementary fashion.

Safety, Testing, and Maintenance (Chapter 6)

Q: How difficult is it to maintain rule-based systems and adapt them to changing domains?

Maintenance (including domain shift) is challenging for both rule-based and neural NLG systems (blog). For rule-based system, maintenance is conceptually similar to maintenance of other software systems; a lot of work (most of the life-cycle cost of software is maintenance), but we understand what needs to be done. It is less clear how to maintain LLMs; in particular we can add new knowledge to an LLM, but how do we remove obsolete knowledge (we do not want LLMs following obsolete medical guidelines, recommending bankrupt companies, giving legal advice based on superceded laws, etc). I have seen very little research on maintaining NLG systems of any type, which is a real shame.

Applications (Chapter 7)

Q: Are inconsistencies in regulatory frameworks a risk to NLG systems?

A: Absolutely. The problem is not just differences between countries, its also that rules within countries frequently change. A stable international-accepted regulatory framework would really help, but it may take years or even decades for this to emerge.

Q: How should we document NLG systems (beyond standard software documentation)?

A: If using LLMs, it is of course essential to document the exact version of the LLM used. Its also important to be clear about quality assurance processes and the types of mistakes that the system can make.

Q: If Babytalk were implemented today, how might a 2026 version of Babytalk leverage LLMs while still preserving its safety and utility goals?

A: Probably we would use a hybrid approach (as described above), where rule-based NLG was used for core content, while LLMs were used to adapt content to individual users and also add additional background and contextual information. Actually, one of the biggest difference in 2026 in this space is that there is much more data available, and also the data is starting to become more standardised; this makes it much easier to build scalable systems which can be widely deployed

Other: NLG academic life

Q: Many scientific papers have mistakes. How do you deal with this?

A: Scientific errors are a huge problem, which we explored in part in ReproHum (eg, paper). I personally am skeptical when I initially read papers, and usually assume they are scientifically flawed unless the authors convince me otherwise. Unfortunately, the AI and NLP communities do not have a culture of acknowledging mistakes and correcting or retracting papers, which makes the problem much worse. It would really help to change this culture, but this is difficult (blog).

Q: How should CS people approach people from other fields to get their inputs and/or actively involve them in projects?

A: You need to understand where the other people are coming from; not just domain knowledge but also perspective (eg, medical researchers have a very different perspective on evaluation). You should also make sure theat your collaborators benefit from the collaboration; for example publish papers in medical journals as well as NLP venues.

3 thoughts on “Questions from readers of my book”

Swapnil Hingmire says:

May 3, 2026 at 2:19 pm

Hello Prof. Reiter,

In your answer to the question “Can human-in-the-loop evaluation introduce new forms of bias?”, you say “utility” of NLG output should be evaluated by humans.

This leads me to the following questions:

Utility can be subjective, and aggregation of utilities across many users may not be feasible, so do you think the current evaluation metrics are appropriate for generalization (for example, does it make sense to compute “average” utility?).

A related question—in another blog post, you talk about Personal AI Health Assistants. While evaluating such assistants, which evaluation strategy is more appropriate, idiographic or nomothetic?

LikeLike

1. ehudreiter says:
  
  May 3, 2026 at 2:58 pm
  
  Hi, thanks for the quesions. I usually formalise utility as change in key performance indicators when people use a system, and this is very contextual. Eg AI coding assistants are incredibly useful for some coding tasks, and useless for others; and also help some developers more than others. We can try to measure average utility, but often we are more interested in knowing where the system has good utility (eg, when should we use an AI coding assistant). Anyways, I dont think benchmarks or metrics do a good job of evaluating utility (in part because it is contextual), which is why I recommend human evaluation for utility.
  
  If we want to deploy an AI system in healthcare or elsewhere, we usually need both quantitative and qualitative evaluation (which is what I assume you mean by Idiographic and nomothetic?). Certainly in all of our current “AI in personal health” projects, we do both kinds of evaluation.
  
  LikeLike
  
  1. paradisedutifullyd95b8c7138 says:
    
    May 4, 2026 at 12:54 pm
    
    Thank you for the reply!
    
    LikeLike

Ehud Reiter's Blog

Ehud's thoughts about Natural Language Generation. Also see my book on NLG.

Questions from readers of my book

Introduction (Chapter 1)

Rule-based NLG (Chapter 2)

ML and Neural NLG (Chapter 3)

Requirements (Chapter 4) and Evaluation (Chapter 5)

Safety, Testing, and Maintenance (Chapter 6)

Applications (Chapter 7)

Other: NLG academic life

3 thoughts on “Questions from readers of my book”

Leave a comment Cancel reply

Introduction (Chapter 1)

Rule-based NLG (Chapter 2)

ML and Neural NLG (Chapter 3)

Requirements (Chapter 4) and Evaluation (Chapter 5)

Safety, Testing, and Maintenance (Chapter 6)

Applications (Chapter 7)

Other: NLG academic life

Share this:

Related

Share this:

3 thoughts on “Questions from readers of my book”

Leave a comment Cancel reply