Future of NLG evaluation: LLMs and high quality human eval?
We may see a big change in NLG evaluation over the next few years, with LLM-based evaluation replacing metrics such as BLEU and BLEURT, and a renewed emphasis on high-quality human evaluation to assess semantic and pragmatic correctness. Would be a step forward if this happens!