Do LLM coding benchmarks measure real-world utility?
LLM benchmarks for coding are closer to real-world use than other LLM benchmarks, but they still do not measure real-world utility. I explain this by contrasting what is measured by SWE-bench with what is measured by a recent study of real-world utility in software development.