Response to Goldberg’s Blog on Deep Learning for NLG

I wrote a comment in response to Yoav Goldberg’s An Adversarial Review of “Adversarial Generation of Natural Language” on Medium, which essentially critiques some research using deep learning in NLG, focusing on papers published in less prestigious venues.  Looks like quite a few people read this comment, so I am reposting my comment on this blog.


I saw that Mike White mentioned my name, so I thought I would comment directly. A lot of the discussion is about papers published in second-tier venues, but from my perspective there are also major problems with DL NLG papers published in top venues. Perhaps less drastic, but its a question of degree.

This was brought home to me last year when I attended NAACL 2016 (in order to give an invited talk on NLG evaluation), which was the first time I had been to an ACL event in several years. I went to listen to a NAACL paper about using DL for NLG, and was absolutely horrified.

(1) The evaluation was weak, because the authors just used BLEU, which is a questionable way to evaluate NLG systems (

(2) One of the main training corpora used was the output of a rule-based NLG system ( So were the authors trying to show that they could use DL to reverse engineer a rule-based system and steal the IP of someone who spent a lot of time carefully hand-crafting NLG rules?

(3) The presenting author was completely unaware of previous work in the NLG community on the problems he was solving (this was apparent in the Q&A session as well as in the paper). He claimed his system was better than state-of-the-art, but to me his output texts looked considerably worse than stuff we were producing 15 years ago.

I am willing to be convinced that DL is a good approach for NLG, but I need to see experiments and papers with solid evaluation, sensible and appropriate corpora, and go0d awareness of NLG state-of-the-art. Papers like the above NAACL one dont leave me with a good impression of DL for NLG.

I’d also like someone to explain to me how we can evaluate the worst-case (as well as the average case) performance of DL systems, because this is really important (

Finally, to echo some of the other opinions which people have expressed, there is a caricature of a DL (or indeed ML) NLP researcher as someone who just wants some corpora and a way to keep score, and has no interest whether the “score” means anything and also no interest in the provenance or suitability of the coprora. I realise this is a caricature, but I think it has some truth, and I dont think this is the right attitude for making progress in NLP.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s