If you are considering using a new dataset from a repository such as Kaggle, you should first check that the data in the dataset is of high quality and appropriate for your needs. A bit of “due diligence” at the beginning can stop you wasting lots of time and effort on an unsuitable data set.
Texts produced by NLG systems need to communicate valuable, useful, and accurate information. I would love to see more research on content production and selection in NLG.
If we want to use NLG to communicate information to all sorts of different people, then it would be really helpful if the NLG system can adapt its language to the reading skill, domain knowledge, emotional state, etc of the user. I think this kind of user adaptation is essential to achieving my vision of using NLG to humanise data.
Users want to be able to modify and customise NLG systems on their own, without needing to ask developers to make changes. Academic researchers mostly ignore this, which is a shame, since there are a lot of interesting and important challenges.
I was very impressed by a recent paper from a team at Facebook about a production-ready end-to-end neural NLG system. Especially interesting to me was the “engineering” approach to key issues such as accuracy, data collection, and latency.