evaluation

There are many types of human evaluation!

Many people asume that “human evaluation” means asking people to rate or rank outputs. However there are many other types of human evaluation, most of which give more meaningful results than rating or ranking! I discuss some of these, including task-based evaluation, annotation-based evaluation, and real-world evaluation.

evaluation

Evaluating chatGPT

I love getting questions about how to evaluate chatGPT, they are much more constructive than speculations about whether it is a threat to humanity. We need to understand what LLM technology can and cannot do, and rigorous experiments are the best way to do this. I give some advice and caveats about evaluating chatGPT in this blog, and am happy to answer questions from people who want to do high-quality evaluations.

evaluation

How effective is prompting?

I was very impressed by a recent paper that compared prompting-based MT to MT based on trained models. Results are very interesting; prompting-based MT generates fluent texts which however have accuracy problems. Also the paper itself is an excellent example of a high-quality NLP evaluation, and I recommd it to anyone who wants to do good NLP evaluations.