We need better LLM benchmarks
Current benchmark (suites) for evaluating LLMs are disappointing. I describe the properties that I think good benchmarks and benchmark suites should have, but often do not, such as being correct, challenging, diverse, and real-world.