15 years ago, I wrote a “grand challenge” for UKCRC about using NLG and indeed AI to communicate information to the general public in “their language”, so that they can better understand and use data to make decisions and otherwise improve their lives. The amount of data available to us is increasing exponentially, and its supposed to help us understand what is going on and make sensible decisions. But the reality is that an awful lot of people feel drowned rather than empowered by the amount of data available to them, especially if it is technical (eg medical or financial).
Empowering non-specialists to use data would be a fantastic achievement and use of NLG and AI, but unfortunately progress on this challenge has been slow and indeed disappointing. My new PhilHumans project is more or less in this space, which got me thinking again about the challenge more broadly. What areas do we need to make progress on, especially if our goal is a deep understanding of the underlying issues (which is certainly my perspective as a researcher)? I list some of these areas below, there are others.
Text and Graphics
Data can often be communicated using words, information graphics, or a combination of the two. We know that the choice depends on the user, the type of data, and the context, and we have some high-level guiding principles. We have also known for decades that understanding information graphics requires training, and that visually appealling graphics may not be effective decision aids. In the context of my challenge, this means that non-specialists in particular may struggle to make decisions based on graphics, because of lack of training, focus on whats visually appealling and salient, and mixed levels of numeracy. We also have some interesting case studies in specific areas such as medicine and weather.
But we dont have good theories and models which developers can use to make this decision (text vs graphics vs mixture) in specific contexts. If we could start developing such theories, this would really help to achieve my vision!
Dialogue and Interactivity
My research has focused on “report generation”, where an NLG system generates a report that may be customised for a user, but is not interactive or part of a dialogue system. If we are going to build systems which successfully help people understand and use data, however, these systems will almost certainly need to be interactive. And if the goal is to help the general public, then we almost certainly need to include dialogue as part of the solution!
So, not something I have worked on personally, but will undoubtably be an essential part of the solution!
Narrative
People understand information best when its presented as a story or narrative, which is why the commercial world places such a huge emphasis on story-telling in business intelligence and decision making. So it is really important, especially when communicating with the general public that our NLG systems produce narratives and stories. Unfortunately, our understanding of how to do this is pretty patchy, and tends to rely more on imitating the structure of human-written narratives/stories than on generating stories based on a deep understanding of what makes a good story.
Imitating human structures (by machine learning or otherwise) often works OK in practice, but I find this unsatisfactory. A proper understanding and good theories/models of generating stories and narratives would really help, especially in producing robust systems that work well in unusual circumstances.
Data Quality
In my experience, one of the biggest problems in understanding and using data is in dealing with data quality issues, including missing data, noisy data, and incorrect data. And this is especially true for non-specialists; indeed, one sign of a domain expert is that he or she is pretty good at dealing with data quality problems. Data quality is important because most real-world data sets have such problems! This issue has come up countless times in my work, including recently in a personal health context, where we were trying to communicate noisy health data to a user, and it was a real struggle to decide what to say and what not to say. Its also a real problem for numerical presentations, eg, I remember a diabetes doctor complaining to me a few years ago that a lot of his patients had glucose monitors and would panic and over-react to an unexpected reading even if the reading was just noise.
Communicate problematical data sets is a real challenge in NLG, but it is essential to achieving the vision of empowering people to use data effectively!
PS: Come Join Me!
As a PS, if the above interests and excites you, let me know! I have collaborations with people around the world on related topics. If you’re interested in joining my academic research group, the Aberdeen University CS Dept is hiring new faculty at all levels, and also expects to have some funding for (UK and EU) MSc and PhD students. If you’re interested in commercial NLG, Arria’s Aberdeen office (which is located on the university campus) is very interested in talking to people with the right skills. So lots of opporunities at Aberdeen, again feel free to contact me if you’re interested!