evaluation

My MSc students evaluate chatGPT

Aug 16, 2023Aug 18, 2023 ehudreiterLeave a comment

Four of my MSc students did projects evaluating chatGPT in different use cases: developers assist, translation, exam taking, emotional support. In every use case, chatGPT was often very impressive, but also had issues and limitations.

other

LLM hype brings memories of IBM Watson

Jul 31, 2023 ehudreiter5 Comments

In 2011 the IBM Watson question-answering system was presented as breakthrough AI technology that “changes everything”. Huge amounts of hype, but in the end little real-world success. Are there lessons for LLMs?

evaluation

Are Experts Needed in Human Evaluation?

Jul 10, 2023 ehudreiter3 Comments

An ACL paper from the PhilHuman project looks at using experts vs non-experts in human evaluation. It concludes that the agreement between experts and non-experts is worse for texts from GPT3 than texts from GPT2; in other words, non-expert evaluation is less useful for high-quality texts produced by recent LLMs.

building NLG systems

LLMs and Data-to-text

Jun 29, 2023 ehudreiter1 Comment

At this moment in time, chatGPT and other LLMs seem to be much better at the “language” side of data-to-text than the “content” side, Even on the language side, there are important caveats about real-world usage. Of course, the above may change as the technology improves.

evaluation

Evaluation: Plan ahead, details matter, keep it simple, pilot, be careful

Jun 21, 2023 ehudreiter1 Comment

Evaluation advice in one sentence: plan your experiments in advance, including details; keep your experiment as simple and standard as possible; do a pilot experiment first to make sure everything works; and be very careful when you run the main experiment.

academics

ACL vs TACL Reviewing

Jun 6, 2023 ehudreiter2 Comments

This year I was both a TACL Action Editor and an ACL Senior Area Chair. This experience has reinforced my belief that the journal review process is better!

evaluation

Future of NLG evaluation: LLMs and high quality human eval?

May 22, 2023May 22, 2023 ehudreiter3 Comments

We may see a big change in NLG evaluation over the next few years, with LLM-based evaluation replacing metrics such as BLEU and BLEURT, and a renewed emphasis on high-quality human evaluation to assess semantic and pragmatic correctness. Would be a step forward if this happens!

academics

Limits of pre-publication reviewing

May 9, 2023 ehudreiter4 Comments

Many problems in NLP papers can *not* be detected by reviewers who are checking submissions to conferences and journals. In medicine and many other field of science, people can raise concerns about papers *after* they are published, and authors are expected to take this seriously. This is not the practice in NLP, which is a shame.

academics

Unresponsive Authors and Experimental Flaws

May 3, 2023May 3, 2023 ehudreiter8 Comments

In our ReproHum project, we have found that many NLP experiments are flawed, and many authors do not respond to requests for more information about their work. This is depressing and hinders scientific progress in NLP.

Uncategorized

chatGPT in Health: Exciting if we ignore the hype

Apr 9, 2023Apr 11, 2023 ehudreiterLeave a comment

I think there is a lot of potential in using chatGPT in healthcare, provided that we focus on real use cases instead of trying to debate whether chatGPT is somehow better than a doctor.

Ehud Reiter's Blog

Ehud's thoughts about Natural Language Generation. Also see my book on NLG.

My MSc students evaluate chatGPT

LLM hype brings memories of IBM Watson

Are Experts Needed in Human Evaluation?

LLMs and Data-to-text

Evaluation: Plan ahead, details matter, keep it simple, pilot, be careful

ACL vs TACL Reviewing

Future of NLG evaluation: LLMs and high quality human eval?

Limits of pre-publication reviewing

Unresponsive Authors and Experimental Flaws

chatGPT in Health: Exciting if we ignore the hype