At Aberdeen, we have an NLP reading group which meets weekly. Most of the time we read published papers from elsewhere, although occasionally participants ask the group to review drafts of a paper or proposal they are writing. The group meets at 4PM Wedn, and visitors are welcome! If we’re reading a published paper, I usually tweet the paper to @EhudReiter (I dont do this if we are discussing a draft or unpublished paper).
Last month we read a paper which really made an impression of me, “25 Years of Information Extraction” by Ralph Grishman. In this paper, Grishman (who has been around even longer than me, he got his PhD in 1973) summarises what has happened between 1994 and 2019 in the NLP subfield of information extraction (IE), that is extracting structured information from natural language texts. Below I list a few of the points from this paper that I thought were really interesting, I encourage people to read the paper so they can benefit from its other insights.
Significant but not amazing progress
Grishman says (page 686)
performance (F score) after more than 25 years of development has only advanced from the low 60s to the low 70s on standard event classification benchmarks
In other words, all of the generic developments in NLP over the past 25 years (machine learning, deep learning, corpora, massive increases in computational power), plus 25 years of focused research on IE by lots of very smart and dedicated researchers, have resulted in significant performance improvements in IE. They have not, however, led to the kind of “order of magnitude” improvements which we have seen over this period in speech recognition and machine translation.
NLP is a broad area, and the set of techniques which the NLP community has developed over the past 25 years (including deep learning NLP) have had huge impact in some parts of NLP, but has not had this kind of impact in other areas. If our goal is to “crack” NLP as whole, we need to keep looking for new ideas, and avoid assuming that the latest trendy idea (fancy grammars in 1994, deep learning in 2019) will solve all problems.
Researchers don’t like complex evaluations
Information extraction has traditionally been evaluated based on precision, recall, and F-measure. On page 685, Grishman describes the fate of the ACE evaluation model, which was an attempt by the US government funding body to introduce an evaluation metric which was more closely aligned with real-world utility. However researchers refused to use ACE except in formal reports to the US government; in academic papers they stuck to recall, precision, and F-measure. Grishman speculates that this is because ACE was complex and hence not intuitive to researchers, and also perhaps because “the raw value scores were so low for events—below 15%—and participants felt embarrassed to report such a score”.
This reminded me of other attempts by US government funders, such as DARPA, to get researchers to use more complex and realistic evaluation measures. They tried to get summarisation researchers to use Pyramid evaluation; this is a complex evaluation involving some human annotation, which attempts to measure the quality of the content (not just the surface form) of the summary. I liked Pyramid, but the summarisation community did not, and my understanding is that Pyramid is rarely used in 2019, while the simplistic ROUGE metric is still going strong. A perhaps similar story could be told by attempts by funders to get machine translation researchers to evaluate MT systems on the amount of effort required to post-edit MT texts into acceptable translations (ie, an extrinsic task-based measure). I think the TER and HTER measures are still used a bit, but BLEU is used much more.
In short, even funding agencies such as DARPA struggle to get academic researchers to use evaluation techniques such as ACE, Pyramid, and HTER which are complex and often require human effort, but give results that are better predictors of real-world utility. There is a strong bias in NLP towards simple, easy, cheap evaluations which do not require human annotation, such as BLEU, ROUGE, and F-measure, even if these evaluations are less meaningful than the alternatives.
Corpora vs Rules
The last observation I’ll mention here is a comparison of building systems with rule vs ML, when there is no existing corpus, so that corpus-building must be included in the ML approach. Grishman says (p683)
Preparing patterns by hand requires considerable skill and insight but may yield a relatively clean (high precision) system. The preparation of an annotated corpus may require less skill but more time.
In short, writing rules is quicker and results in a better system, but it requires access to highly skilled individuals who can write rules. Creating a corpus for ML requires a lot more time and results in a buggier system, but we can do this with relatively unskilled labour.
NLG is another area where corpora are rarely available. I’ve usually thought that creating a corpus in such contexts was silly, since someone who knows what he is doing (like me) can write the necessary rules much faster than he can annotate a sufficiently large corpus for ML. But Grishman is right that there are a lot of contexts where skilled NLP labour is a scarce resource but unskilled annotators are cheap and available in large numbers, either via Mechanical Turk or (if we want higher quality) via a commercial annotation service which outsources the annotation work to a low-wage country.
There are lots of other interesting insights and observations in this paper, I encourage people to look at it themselves! Maybe I should wite something similar for NLG?? But I think I’ll wait until 2025, so I can review 25 years since my book was published.