Language Grounding and Context
Some thoughts on language grounding, especially choosing words to express data, and how this depends on context.
Some thoughts on language grounding, especially choosing words to express data, and how this depends on context.
Unfortunately I suspect many researchers make their results looks better by using poor baselines. I give some thoughts on this, based on a recent discussion with a PhD student.
Some thoughts about when I feel comfortable being a coauthor on a paper, expressed as a letter to someone who put me on a paper as a co-author without asking me frst,
Some musings on principled and theoretically sound techniques for automatically evaluating NLG systems.
My advice on how to perform a high-quality validation study, which assesses whether a metric (such as BLEU) correlates well with human evaluations.
BLEU works much better for MT systems and NLG systems. In this blog I present some speculations as to why this is the case.
My structured survey of BLEU suggests that BLEU-human correlations are worse in German than in many other languages. But there are many caveats, so we need to be cautious in interpreting this result.
The correlation between BLEU and human evaluations of MT systems seems to be increasing over time. Since BLEU has not changed, how is this possible, and what does it mean?
In response to a previous blog, many people expressed concerns to me about the quality of many papers they saw on ML in NLP. I summarise some of these concerns, which are worrying.
I was recently asked if machine learning requires evaluation metrics. The answer is no, and the fact that people are asking such questions suggests that some newcomers to the field may have a limited perspective on NLP research methodology.