Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Of what is happening with AI the most bizarre thing, for me, is how these tools are 20$ away from being tested. Yet, to form an idea about actual real world usefulness many folks seek some kind of indirect proxy.

This is combined with the incredible general feeling that automatic programming can be evaluated as producing the same results regardless of the user using it. Something true only with benchmarks, basically. Benchmarks are useful metrics because even if weak we need some guidance, but the current real world dynamic is that AI will completely change what it is capable of doing based on the programmer using it.

Maybe never in the history of programming there was a time where diverse programming skills were as important as today (but this may change as AI evolves).



Benchmarks do a few things: 1. Help choose a model from the hundreds out there, or at least help create a shortlist to try. 2. Quantify progress/improvements (or lack thereof) over time. 3. Inform about relative strengths and weaknesses.


Assuming the benchmark can't be gamed.


> automatic programming can be evaluated as producing the same results regardless of the user using it.

That's something I've argued here several time and that's actually rarely done. Namely it's totally different when a non-developer use such tool for programming vs when a (senior) SWE does. That's a fundamental point which IMHO a potential for (non-riskfree) augmentation versus replacement. Replacement though makes for excellent narrative (if not scapegoat) yet if the tool is "productive" (with KPIs to agree on) only with skilled staff that it's not the reality, just a "wish".


I'm about to put up the 20 to see what everyone is raving for. But the real cost is time: if this doesn't work, I'm worse off than never trying.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: