Do LLM benchmarks ignore NLG?
I was very disappointed to realise that the evaluation suite for Amazon Nova (and I assume for other LLMs) has poor coverage of NLG tasks. Which is surprising since LLMs are largely used to generate texts; shouldnt they be evaluated, at least in part, on their ability to do this well?