I have wondered if that’s why Grok seems so weird and dim-witted compared to better models.
Part of my job involves comparing the behavior of various models. Grok is a deeply weird model. It doesn’t refuse to respond as often as other models, but it feels like it retreats to weird talking points way more often than the others. It feels like a model that has a gun to its head to say what its creators want it to say.
I can’t help but wonder if this is severely deleterious to a model’s ability to reason in general. There are a whole bunch of topics where it seems incapable of being rational, and I suspect that’s incompatible with the goal of having a top-tier model.
Grok could only be conceived by someone who doesn't understand the dependency chart re science & the humanities. It's impossible to build a rational, accurate model that isn't also egalitarian.
I'm going to blame Randall Munroe for this, and assume Philosophy was dating his mom back when he drew that science "purity" strip.
somewhat surprisingly, it's actually sycophantic in both directions. i've been running homegrown evals of claude, gpt, gemini, and grok, and grok is the most likely to agree with the prompter's premise, and to hallucinate facts in support of an agenda. so it's actually deeper than just pattern-matching to elon's opinions (which it also tends to do).
BTW: Claude does the best on these evals, by far. The evals are geared towards seeing how much of an independent ground truth the models have as opposed to human social consensus, and then additionally the sycophancy stuff I already mentioned.
This kind of conditioning has to be damaging to the model’s reasoning.
Consider how research worked in the Stalinist Soviet Union and Nazi Germany. Scientists had to be mindful of topics where they needed to either avoid it completely or explicitly adapt it to the leader’s ideology.
Their alignment is probably more strategically built in during the training phase.
At least I assume Xi Jinping doesn’t just call up DeepSeek on a whim and dictate what they should have in model context (like Musk apparently does at xAI).
Not sure if I’m misunderstanding your claim. A string does vibrate as the sum of the string’s harmonics. That’s how pinch harmonics work, and they wouldn’t work if that wasn’t the case.
You poke a spot where a given harmonic doesn’t vibrate, and that takes energy away from the other harmonics that do need to vibrate at that spot.
If we’re just talking about visually being able to see them, I suppose that’s a different question. Maybe on an incredibly low pitched string, or with a strobe light playing at a synced frequency? But in terms of what the string is doing, it is vibrating as the sum of its harmonics.
A ham sandwich has some strong qualities. I’m not kidding.
The president would do basically nothing for four years, which would cause some things to move slowly. But it would be a very stable environment. No random tariffs via executive order, no random wars or invasions, no governing via tweet.
Ham sandwich would maybe be one of our better presidents. Top 50%, probably.
But what if it didn’t summarize Harry Potter? What if it analyzed Harry Potter and came back with a specification for how to write a compelling story about wizards? And then someone read that spec and wrote a different story about wizards that bears only the most superficial resemblance to Harry Potter in the sense that they’re both compelling stories about wizards?
This is legitimately a very weird case and I have no idea how a court would decide it.
Yeah, spatial reasoning has been a weak spot for LLMs. I’m actually building a new code exercise for my company right now where the candidate is allowed to use any AI they want, but it involves spatial reasoning. I ran Opus 4.6 and Codex 5.3 (xhigh) on it and both came back with passable answers, but I was able to double the score doing it by hand.
It’ll be interesting to see what happens if a candidate ever shows up and wants to use Deep Think. Might blow right through my exercise.
I had an issue with one of my Sprites (Fly.io also runs sprites.dev) and the CEO responded to me personally in less than 10 minutes. They got it fixed quickly.
I was a free customer at the time. I pay for it happily now.
Sure, that’s one solution. You could also Isle of Dr Moreau your way to a pelican that can use a regular bike. The sky is the limit when you have no scruples.
Ironically, I find LLMs far better at helping me dive into unfamiliar code than at writing it.
A few weeks ago a critical bug came in on a part of the app I’d never touched. I had Claude research the relevant code while I reproduced the bug locally, then had it check the logs. That confirmed where the error was, but not why. This was code that ran constantly without incident.
So I had Claude look at the Excel doc the support person provided. Turns out there was a hidden worksheet throwing off the indices. You couldn’t even see the sheet inside Excel. I had Claude move it to the end where our indices wouldn’t be affected, ran it locally, and it worked. I handed the fixed document back to the support person and she confirmed it worked on her end too.
Total time to resolution: 15 minutes, on a tricky bug in code I’d never seen before. That hidden sheet would have been maddening to find normally. I think we might be strongly overestimating the benefits of knowing a codebase these days.
I’ve been programming professionally for about 20 years. I know this is a period of rapid change and we’re all adjusting. But I think getting overly precious about code in the age of coding agents is a coping mechanism, not a forward-looking stance. Code is cheap now. Write it and delete it.
Make high leverage decisions and let the agent handle the rest. Make sure you’ve got decent tests. Review for security. Make peace with the fact that it’s cheaper to cut three times and measure once than it used to be to measure twice and cut once.
It’s been a lot of fun watching her subscriber count go through the roof. She’s outrageously talented.
It’s also funny because usually it’s hard to reproduce what a musician does. I can listen to someone play guitar, but there’s so much nuance to how it’s played that you need to be pretty good to reproduce it.
But so much of her music is code, and she shows you the code, so she’s really teaching you how to reproduce what she’s doing perfectly. It’s awesome for learning.
Part of my job involves comparing the behavior of various models. Grok is a deeply weird model. It doesn’t refuse to respond as often as other models, but it feels like it retreats to weird talking points way more often than the others. It feels like a model that has a gun to its head to say what its creators want it to say.
I can’t help but wonder if this is severely deleterious to a model’s ability to reason in general. There are a whole bunch of topics where it seems incapable of being rational, and I suspect that’s incompatible with the goal of having a top-tier model.