Are Experts Needed in Human Evaluation?
An ACL paper from the PhilHuman project looks at using experts vs non-experts in human evaluation. It concludes that the agreement between experts and non-experts is worse for texts from GPT3 than texts from GPT2; in other words, non-expert evaluation is less useful for high-quality texts produced by recent LLMs.