Of what is happening with AI the most bizarre thing, for me, is how these tools ...

croemer · 2026-03-12T09:40:06 1773308406

Benchmarks do a few things: 1. Help choose a model from the hundreds out there, or at least help create a shortlist to try. 2. Quantify progress/improvements (or lack thereof) over time. 3. Inform about relative strengths and weaknesses.

utopiah · 2026-03-12T13:12:52 1773321172

Assuming the benchmark can't be gamed.

utopiah · 2026-03-12T13:10:58 1773321058

> automatic programming can be evaluated as producing the same results regardless of the user using it.

That's something I've argued here several time and that's actually rarely done. Namely it's totally different when a non-developer use such tool for programming vs when a (senior) SWE does. That's a fundamental point which IMHO a potential for (non-riskfree) augmentation versus replacement. Replacement though makes for excellent narrative (if not scapegoat) yet if the tool is "productive" (with KPIs to agree on) only with skilled staff that it's not the reality, just a "wish".

Archit3ch · 2026-03-12T22:20:35 1773354035

I'm about to put up the 20 to see what everyone is raving for. But the real cost is time: if this doesn't work, I'm worse off than never trying.