Great to know, but what was the cost both in terms of $$ and tokens used?
Not to invalidate these benchmark results because they are useful, but the real usefulness it what they are capable to do when real people interact with them at scale.
Regardless, these are good news, because now that Microsoft is basically giving up their all-in strategy with Github's Copilot and Anthropic is playing the "I'm too good for you" game, it's about time for them to get pressed into not making this AI world into a divide between the haves and the have-nots.
You’d be surprised with some long running complex tasks. I’ve seen Kimi spend 8 minutes (total) thinking on a task that Claude got done in 30 seconds. They both ultimately got it right, but Kimi spent ~$2.25 to Claude’s ~$0.20
I'm a paying customer and I did not receive ANY communication about this. Was using Opus this afternoon and then it disappeared.
Microsoft really can't stop being Microsoft. I don't dispute the need to charge more for those models, but there is a basic decency to do things and as usual the Big Tech fuckery and complete lack of morals makes them do this in a way that generates total mistrust where it could be just annoyance.
I'll see how Sonnet handles the most difficult problems but I'm foresee a subscription cancelation soon.
It appears what really ended their little scam was the $421 million of reported revenue based on complete lies.
Because lying to investors about product hasn't been really an issue lately, even Intel ~5 years ago did some presentations that were a complete fantasy back when they were desperate to keep their stock value but could not produce a chip smaller than 14nm.
If they prosecute CEOs based on lies to investors other than accounting, almost all AI startups would go down.
CEOs can say basically anything when it's talking about the future. They just have to include a safe harbor disclaimer about forward-looking statements.
The more I live the more I believe people at the top operated in some sort of cult mentality. The level of gullibleness, temporary lack of critical thinking is only matched by their sociopathy and Machiavellianism.
I'm sure it's a great big model, but the level of hype and dishonesty is something out of Sam Altman's book.
Of course it's because of the upcoming IPO, but that's the end game, for now it's critical to get those private equity guys and bank institutions to believe the gospel and hold the bag, only then the suckers from the secondary markets will be allowed to be suckers too.
> A good percentage of cybersecurity has always been theater
It is great to be in a "best-effort" business where there are no consequences for bad things happening. Cybersecurity is one of those businesses. Web search, feeds and ads are another.
Imagine you are selling locks to secure homes. A thief breaks the lock. The lock-maker is not held liable. In fact, they now start selling stronger locks, and lock sales actually improve with more thefts.
I'm definitely optimistic that the long-term trajectory is positive. All important software can undergo extensive penetration testing with cutting-edge vulnerability research techniques before launch? Sounds great. The problem is what goes wrong on the pathway to there.
There's a serious problem with being very popular/prominent/powerful and becoming surrounded by sycophants out of a sort of survival of the fittest and then developing a progressively more distorted view of reality as a result. When everything can appear to be made to work to the person at the center they start making progressively worse decisions which are consequence free because of the sway they already have. (this is a big reason why "disruptor" startups work)
Or, you're wrong. And the smartest AI Research Scientists and the top banking officials are both correctly worried about the ramifications. That's what you'd expect if there really was an issue here. Are you aware of the deep seated bugs in critical software that were already uncovered with Mythos? Are you able to steelman the issue here at all?
> Are you aware of the deep seated bugs in critical software that were already uncovered with Mythos
This. 100% this.
A large portion of the industry is under NDA right now, but most of the F500 have already already deployed or started deploying foundational models for AppSec usecases all the way back in 2023.
Sev1 vulns have already been detected using "older" foundation models like Opus 4.x
Of course the noise is significant, but that's something you already faced with DAST, SAST, and other products, and is why most security teams are also pairing models with experienced security professionals to adjudicate and treat foundation model results as another threat intel feed.
Historically bad security that people just got by with matched with powerful tools that aren't any better than the best people, but now can be deployed by mediocre people.
Which is exactly what Anthropic understands the situation to be. They state at the beginning of the Glasswing blogpost that Mythos is not better than the best vulnerability researchers. But it doesn't have to be to become a tremendously big deal.
There is not just a lower barrier to entry. The best use of a tool will still be made by the most knowledgeable users. So we’re looking at lowering the bar some, but another big deal is the scale at which the top experts can work. That might actually be the longer lever. Imagine a top expert burning tokens across whole repo histories of a few dozen projects looking for likely but unconfirmed flaws, then having the model flag and rank those suspects for their own review in triaged order.
People and by people I mean architects and lead devs at big account orgs ( $$$ ) have been using S3 as a filesystem as one of the backbones of their usually wacky mega complex projects.
So there always been a pressure to AWS make it work like that. I suspect the amount of support tickets AWS receives related to "My S3 backed project is slow/fails sometimes/run into AWS limits (like the max number of buckets per account)" and "Why don't.." questions in the design phase which many times AWS people are in the room, serve as enough of a long applied pressure to overcome technical limitations of S3.
I'm not a fan of this type of "let's put a fresh coat on top of it and pretend it's something that fundamentally is not" abstractions. But I suspect here is a case of social pressure turbo charged by $$$.
I'm of the opinion that while e.g. xAI is in a pump game, OpenAI is at least trying to make money. But even if they're not, even if the DCs are as you say "a financial/political vehicle to pump the markets", they can still be physically real things.
That said, I have no idea how close to complete the Stargate UAE site is.
Totally an organic and transparent marketplace that joins together publishers and consumers huh?
It has been down since the COVID boom for obvious reasons, and then it has gone even more.. Google needing the billions to put into the AI burner is just and unfortunate coincidence..
Please remind me: Is there any legitimate business venture that can operate outside the laws of the country they are registered?
If there is, why don’t these people who write blog posts and comments about how “this is all a scam!!” “It’s a psyop! “They” control it all!” If it’s all black and white, if there no real difference between a company like Proton and Google or Microsoft, then why don’t they create a business that provides a service where there’s no way to any government know anything at all, ever? They’ll be printing money..
But perhaps the conspiracy realm and public broadcast of ideals is more attractive than a real business.
Yes, you shouldn’t trust 100% in a person let alone a group of people that form a company. Grow up.
Relax, while mentioning the real world without any criticism for the soundness of the solution is absolute nonsense, some would say idiotic, thinking only in the absolute best solution given your narrow world view is not any better.
While I agree that my view is narrow, the "best solution" in question is what we used to do, and it was fine. There are still many places that manually manage dependencies. Fundamentally automatic software versioning is an under-developed area in need of attention, and technologies like semantic versioning which are ubiquitous are closer to suggestions, and not true indicators of breaking changes. My personal view is that fully automatic dependency version management is an ongoing experiment and should be treated as such.
Not to invalidate these benchmark results because they are useful, but the real usefulness it what they are capable to do when real people interact with them at scale.
Regardless, these are good news, because now that Microsoft is basically giving up their all-in strategy with Github's Copilot and Anthropic is playing the "I'm too good for you" game, it's about time for them to get pressed into not making this AI world into a divide between the haves and the have-nots.
reply