Peer Review Has Improved My Papers
At its best, peer review can significantly improve the quality of papers; its not just an accept/reject gate. I describe a few examples where peer review has led to major improvements in the quality of my papers.
At its best, peer review can significantly improve the quality of papers; its not just an accept/reject gate. I describe a few examples where peer review has led to major improvements in the quality of my papers.
Generated texts can (unnecessarily and unintentionally) upset people. I describe several examples I’ve seen of this. This blog is based on a talk I gave at a SigDial workshop.
Many people asume that “human evaluation” means asking people to rate or rank outputs. However there are many other types of human evaluation, most of which give more meaningful results than rating or ranking! I discuss some of these, including task-based evaluation, annotation-based evaluation, and real-world evaluation.
I see lots of big-picture talk about what LLMs can do, but at a practical level there are real challenges in using them in commercial applications. These include cost, stability, and need for human-in-loop, as well as use-case-specific challenges.
Four of my MSc students did projects evaluating chatGPT in different use cases: developers assist, translation, exam taking, emotional support. In every use case, chatGPT was often very impressive, but also had issues and limitations.
In 2011 the IBM Watson question-answering system was presented as breakthrough AI technology that “changes everything”. Huge amounts of hype, but in the end little real-world success. Are there lessons for LLMs?
An ACL paper from the PhilHuman project looks at using experts vs non-experts in human evaluation. It concludes that the agreement between experts and non-experts is worse for texts from GPT3 than texts from GPT2; in other words, non-expert evaluation is less useful for high-quality texts produced by recent LLMs.
At this moment in time, chatGPT and other LLMs seem to be much better at the “language” side of data-to-text than the “content” side, Even on the language side, there are important caveats about real-world usage. Of course, the above may change as the technology improves.
Evaluation advice in one sentence: plan your experiments in advance, including details; keep your experiment as simple and standard as possible; do a pilot experiment first to make sure everything works; and be very careful when you run the main experiment.
This year I was both a TACL Action Editor and an ACL Senior Area Chair. This experience has reinforced my belief that the journal review process is better!