I recently tried to explain to a graduate student that I was concerned that regression to mean might impact his evaluation of his NLG system. This was not something he was very familiar with, as it doesnt seem to be discussed much in the NLP or AI literature. So I will try to explain the problem here, using an anonymised and simplified version of what my student was trying to do. Regression to mean isnt usually an issue in NLG evaluations, but it can be important in some cases, so people should be aware of it.
Imagine building an NLG system which gives people weekly advice and feedback on their diet, using data collected from a diet/fitness app. The NLG system analyses the user’s diet in light of weight, gender, age, and so forth, and identifies area in which diet should be improved, such as calories, sugar intake, salt intake, etc. It then gives feedback and motivational messages to the user to identify problem areas and encourage him or her to change their diet. Partially in order to keep weekly feedback varied (instead of repeating the same message again and again), each week the system focuses on a specific behaviour which seems to be getting worse, and produces text such as “usually your salt intake is OK, but last week it was much higher than usual. Remember that too much salt can cause high blood pressure!”
So far so good, indeed researchers have been experimenting with systems for dietary behaviour change for at least 20 years. But how should we evaluate the effectiveness of such a system?
What my student effectively wanted to do was to measure whether the behaviour mentioned in the week N feedback report had improved in week N+1. For example, if someone with normally sensible salt consumption had too much salt during week 10 and received the above message, my student wanted to measure whether salt intake was lower in week 11 than in week 10; if so, he would consider this to be evidence that his system was working.
And this is where regression to mean becomes an issue. Suppose that the user had high salt intake on week 10 because of unusual circumstances; perhaps he went to a party where lots of his friends were eating salty crisps, so he ate the crisps as well in order to be sociable. But this was a one-off event, and in week 11 he reverted to his normal eating habits. If so, his salt consumption would be lower in week 11 than week 10 regardless, even without any feedback or advice from a dietary NLG system, just because he was reverting to his usual behaviour. Which means that it makes no sense to evaluate the NLG system by testing whether the user reduced salt intake from week 10 to week 11, because we expect this reduction would happen anyways, regardless of the NLG system.
In other words, things like salt intake, sugar intake, etc fluctuate from week to week because “stuff happens”. And if one of these is unusually high one week, the laws of probability mean it will probably be lower the following week, just because it will probably be closer to its typical (mean) value. So while it makes perfect sense from a health perspective to target feedback at unusually bad or worsening behaviour, we cannot evaluate the effectiveness of the feedback simply by testing whether the behaviour in question improved the next time we measure it, because this will probably happen in any case.
Daniel Kahneman tells a great story about regression to mean in his book Thinking, Fast and Slow. He describes talking to flight instructors who claimed that shouting at trainees was effective, and praising them was useless, contrary to psychological findings that praise is more effective than punishment. This is because the instructors observed that when a trainee did badly and was shouted at, he did better the next time; whereas when a trainee did well and was praised, he did worse the next time. Kahneman pointed out that this was just regression to mean; a trainee who did badly was probably going to do better the next time regardless, because he would most likely return to perfoming at his usual level. Similarly a trainee who did well was probably going to worse the next time simply because he was returning to his usual level.
Anyways, returning to my student, I told him that the best way to assess effectiveness of his system was to measure long-term change in behaviour, not week-by-week changes. Long-term change is after all what we care about in behaviour change contexts, and also it is less influenced by “noise” (eg, one-off events such as the user going to a party with lots of salty crisps).