Skip to content

Ehud Reiter's Blog

Ehud's thoughts and observations about Natural Language Generation

Home
Blog Index
About
What is NLG
Publications
Resources
University
Contact

Blog Index

Search

Note: I have added a (***) to blogs which have been viewed 1000+ times

Building NLG Systems

Accuracy Errors Go Beyond Getting Facts Wrong
AI professionals also focus on change management
Amateurs focus on models; professionals focus on data
Bad Data Means Bad Output
Boring uses of language models
Can ChatGPT do Data-to-Text? (***)
Care Needed in Analytics and Data Science!
Challenges are Same for Neural and Rule NLG
Challenges of Surface Realisation
Check Out a New Dataset Before Using It
Dealing with Edge Cases in NLG
Difficult Words for Neural NLG Systems
Does Deep Learning Prefer Readability over Accuracy?
Does Quality Matter in Training Data?
Election results: Lessons from a real-world NLG system
Embedding Machine Learning in a Rules-Based NLG System (***)
Generated Texts Must Be Accurate! (***)
Hallucination in Neural NLG (***)
How do I Build an NLG System: Requirements and Corpora (***)
How do I Build an NLG System: Testing and Quality Assurance
How do I Build an NLG System: Tools? (***)
How Should Different NLG Components Add Value?
INLG: What real-world NLG users want
Is building neural NLG faster than rules NLG? No one knows, but I suspect not.
Is GPT3 Useful for NLG? (***)
Language is diverse!
Learning does not require evaluation metrics
Lexical Choice Needs Machine Learning!
LLMs and Data-to-text
ML is Used More if it Does Not Limit Control
Natural Language Generation and Machine Learning (***)
NLG Systems Must be Customisable
NLG vs Templates: Levels of Sophistication in Generating Text (***)
NLG=Task+Data+Model/Alg+Eval
Pain Points in Health NLG: Data, Evaluation, Safety
Pragmatic correctness is a challenge for NLG
Problems in using LLMs in commercial products (***)
Real-World Neural NLG
Should I Use Deep Learning?
Simple vs Complex Models
Sports NLG: Commercial vs Academic Perspective
Skills Required to Use Different NLG Technologies
Summarisation datasets should contain summaries!
Testing Multiple Hypotheses
Texts should be adapted to users
The story of simplenlg (***)
Use Good Engineering Methodology When Building NLG Systems!
Varying Words In NLG Texts
We Need Robust Ways to Select Content of NLG Texts
We need to understand what users want!
What are the Problems with Rule-Based NLG?
What Makes a Good Narrative?
You Need to Understand your Corpora! The Weathergov Example (***)

Evaluating NLG Systems

A bad way to measure hallucination
A Consumer Perspective on Evaluation
Accuracy, Fluency, and Utility
Are Experts Needed in Human Evaluation?
BLEU-Human Correlation is Increasing: What does this Mean?
BLEU in Different Languages: Dont use it for Germa n
Evaluating Accuracy
Evaluating chatGPT (***)
Evaluating factual accuracy in complex data-to-text
Evaluation: Plan ahead, details matter, keep it simple, pilot, be careful
Evaluation Grand Challenge: Is NLP System Good Enough for a Use Case?
Evaluation in Medicine and NLG/NLP
Exercise: Find Problems in an Evaluation
Future of NLG evaluation: LLMs and high quality human eval?
How to do an NLG Evaluation: Metrics (***)
How to do an NLG Evaluation: Human Ratings in Artificial Context (***)
How to do an NLG Evaluation: Human Ratings in Real-World Context
How to do an NLG Evaluation: Task-Based (Extrinsic) Performance in Real-World Context
How to Validate Metrics (***)
How Would I Automatically Evaluate NLG Systems?
Humans make mistakes too
I’m very worried about data contamination
Is BLEU valid? First observations and concerns
Keep Good Records of Your Experiments
Lets use error annotations to evaluate systems!
Mistakes in Evaluating ML
MSc Course on Evaluating AI
My Guidelines for Evaluating AI Systems
My MSc students evaluate chatGPT
Objective evaluation of NLG texts
Please Use Two-Tailed P Values!
Real-world utility is based on many things
Research Ethics of A/B Testing
Regression to Mean
Shared Task on Evaluating Accuracy?
Small differences in BLEU are meaningless
Study Design for Systematic Review of BLEU Validity: Comments Welcome!
Ten tips on doing a good evaluation
Texts can be accurate but still not appropriate
There are many types of human evaluation!
Types of NLG Evaluation: Which is Right for Me? (***)
Use Proper Baselines!
We need more extrinsic (task) evaluation!
We should evaluate real-world impact!
Why do we still use 18-year old BLEU? (***)
Why doesnt BLEU work for NLG?
Why is ROUGE so popular?

Academic Life

Academic NLG should not fixate on end-to-end neural
Academic Researchers Should be Scouts and Explorers
Academic Teaching vs Commercial Training Courses
ACL vs TACL Reviewing
Apologies to my students for limited feedback!
Best Papers I Read in 2020
Can I present my paper twice?
Challenging NLG datasets and tasks
Commercial and Academic Perspectives on NLG (and AI?)
Common Flaws in NLP Evaluation Experiments
Could there be fraud in NLP Research?
Does chatGPT make leaderboards less meaningful?
Doing Less
Good Papers are Hard to Publish
Engineering Perspective: Understand Issues, Find Simple Solution
How can I tell if a paper is scientifically solid?
How I Review Papers
I dont like leaderboards
I enjoy reviewing for TACL
I’m Impressed by Capetown Uni’s Diversity
Limits of pre-publication reviewing
Managing Research Projects is Painful but Necessary
More discussion, fewer papers at conferences?
My PhD Students: Where Are They Now (June 2017) (***)
NLP has become much more interesting!
Our 2022 Publications: NLG Evaluation, Requirements, Resources
Peer Review Has Improved My Papers
Please check the boring details in your paper!
Publication Requirements for PhD Students
Publish in Journals!
Real-World Impact of Academic Research
Reviewing has changed over the years; conferences need to change as well
What are Benefits of Physical Conferences?
What Should Academic NLP Researchers Focus on? (***)
Why I do not Want to be a Co-author on Your Paper (***)
“Will I Pass my PhD Viva” (***)

Other Topics

Adding Narrative to a Covid Dashboard
Bayesian vs Neural Networks (***)
Can LLMs make medicine safer?
ChatGPT error or human error?
chatGPT in Health: Exciting if we ignore the hype
chatGPT: Great science, unclear commercials, hate the hype (***)
Come Join Us in Aberdeen!
Conversational data-to-text
Could NLG systems injure or even kill people?
Do people “cheat” by overfitting test data (***)
Do We Encourage Researchers to Use Inappropriate Data Sets? (***)
Exciting NLG Research Topics (June 2017)
Farewell to Richard Kittredge, pioneer in applied NLG
Get You Hands Dirty!
Google: Please Stop Telling Lies About Me (***)
Has Neural NLG Become More Scientific?
How accurate do chatGPT texts need to be?
How do I Learn about NLG? (***)
How do Users React to NLG?
Human editing of NLG texts
In 2019 LM output was fluent but not trustworthy: still true in 2024
Language Grounding and Context (***)
Lessons from 25 Years of Information Extraction
Lets Use ML for Insights! (***)
LLM hype brings memories of IBM Watson
Lots about evaluation and methodology at INLG – Great!
Many Papers on Machine Learning in NLP are Scientifically Dubious
My Vision for SIGGEN
New book on NLG?
New project PhilHumans: Better interaction in personal health apps
NLG and Explainable AI (***)
NLG texts should not upset people
Non-Experts Struggle with Information Graphics
Notes from a Dev Conf: Sensible Attitude to Trendy AI Tech, Arria Presentations
PhD on using AI/NLG to help cancer patients at home
Product Descriptions
Project and Research Fellow Position in Reproducibility of Human Evaluations
Real-world usage of LLMs in Journalism
Summarising Messy Data
Tableau buys Narrative Science
Text or Graphics?
Response to Goldberg’s Blog on Deep Learning for NLG (***)
Vision: NLG Can Help Humanise Data and AI
We can learn from the past in AI/Medicine
What LLMs cannot do (***)
Where is NLG Most Successful Commercially? (***)
Why isnt Research Software such as BabyTalk Used?
Why isnt there More Open-Source NLG Software? (***)
Working in Universities vs Companies (***)
Writing NLG Pages for Wikipedia

Personal

Climate change makes me angry
Cycling through Northern England
Cycling through Southwest England
Cycling Through Wales, England, and my Wife’s Family History
Goodbye to a Synagogue
Life is “Flat” under Lockdown
My Father Takes Me to Mexico
My Son Visits Home
The Brexit Mess

Arria blogs written by me (these are intended for non-specialists)

Chatbots are a great way to present insights!
Choosing words to clearly describe data
Corpus Analysis: A great way to understand what your NLG system needs to do
Finding creative solutions to detect mistakes in neural-NLG narratives
Humans post-editing NLG-generated narratives
NLG drives consistent narratives
NLG is different from other language technologies
OpenAI GPT System: What does it do?
The Grand Challenge for NLG: Making Data Accessible to Humans
The power of words
This 22-year-old book on NLG is still relevant
Why neural language models don’t work well in NLG

Other blogs

“Lying” in computer-generated texts: hallucinations and omissions (OUP)

Share this:

Twitter
Facebook

Like Loading...

Share this:

Twitter
Facebook

Like Loading...

LinkedIn
Twitter

Search for:

Top Posts & Pages

What LLMs cannot do
Real-world usage of LLMs in Journalism
We can learn from the past in AI/Medicine
LLMs and Data-to-text
About
LLM hype brings memories of IBM Watson
Ten tips on doing a good evaluation
Blog Index
How to do an NLG Evaluation: Metrics
"Will I Pass my PhD Viva"

Blog at WordPress.com.

Subscribe Subscribed
- Ehud Reiter's Blog
- Already have a WordPress.com account? Log in now.

%d

%d