I recently attended a fascinating webinar by Cynthia Rudin on simple and complex models for tasks such as medical diagnosis, predicting criminal recidivism, and loan approval, where a model essentially produces a categorical output from structured input data. Prof Rudin argued that in her experience, there were usually a large number of models with similar performance, some of which are complex and some of which were simple; she gave great examples of real-world cases where a complex black-box model could be replaced by a simple model (rules, decision trees, scoring systems, etc) with minimal (if any) loss in accuracy.
For example, in Learning Certifiably Optimal Rule Lists for Categorical Data, Rudin and her colleagues show (amongst other examples) that the following very simple rule set for predicting criminal recidivism is as good or better than the complex COMPAS model
if (age = 18 − 20) and (sex = male) then predict yes
else if (age = 21 − 23) and (priors = 2 − 3) then predict yes
else if (priors > 3) then predict yes
else predict no
I actually talk about COMPAS in my class on evaluating AI, since its a classic example of bias/fairness problems in AI systems, so it was really interesting to see the above. Indeed, I ran across Prof Rudin’s work while I was updating the relevant lecture in my course; I tweeted a comment and someone on Twitter responded that I should attend the above webinar, which I did. Shows that academic Twitter can be really helpful!
Anyways, simple white-box models of course have great real-world advantages because they easy to explain, audit (including for bias/fairness), and modify. Also, because there are usually many simple models with similar performance, Rudin pointed out that we can combine ML and domain expertise by asking domain experts to choose the simple model which is most plausible to them.
Simple and complex models in NLG and NLP
Prof Rudin was not talking about NLP or NLG, but it made me wonder if there is anything similar in our field. One example that occurred to me is summarisation of news articles. We’ve known since the 1990s that a really good way to construct summaries of news articles is simply to take the first few sentences of the article, this is a strong and simple baseline which is hard to beat. But academic NLG/NLP ignores this. I’ve read lots of summarisation papers that compare new neural models to existing neural baselines; I dont think I have read any papers on summarisation published in the last 5 years which compare a new neural model against the “simple” baseline above (there probably are some, but I doubt there are many).
Which seems a real shame! NLG/NLP academics seem fixated on complex neural models and show little interest in improving simple approaches, even when they work very well. I think people working in commercial labs are more open to this kind of thing, which is one difference between academic and commercial NLP.
Given the power of simple techniques such as decision trees (especially when built with modern algorithms like Rudin’s GOSDT, not 30-year old algorithms like C4.5), its also surprising how rarely they are mentioned in NLG papers. Sameen Maruf had a paper on explaining decision trees (which I co-authored) in INLG 2021, I dont remember seeing other recent NLG papers which talk about decision trees.
In the real world, simple white-box models are preferred over complex black-box models if they have similar performance. So its important that researchers work on improving simple white-box models! Its great to see Rudin’s work in this area, I’d love to see more work with this focus in NLP and NLG.