And magic tricks look like magic. Turns out they’re not magical. I am so floored...

svara · 2025-08-17T20:54:30 1755464070

> I’m open to them becoming more than a statistical token predictor, and I think it would be really neat to see that happen

What exactly do you mean by that? I've seen this exact comment stated many times, but I always wonder:

What limitations of AI chat bots do you currently see that are due to them using next token prediction?

dangus · 2025-08-18T03:52:12 1755489132

I feel like the logic of your question is actually inverted from reality.

It’s kind of like you’re saying “prove god doesn’t exist” when it’s supposed to be “prove god exists.”

If a problem isn’t documented LLMs simply have nowhere to go. It can’t really handle the knowledge boundary [1] at all, since it has no reasoning ability it just hallucinates or runs around in circles trying the same closest solution over and over.

It’s awesome that they get some stuff right frequently and can work fast like a computer but it’s very obvious that there really isn’t anything in there that we would call “reasoning.”

[1] https://matt.might.net/articles/phd-school-in-pictures/

svara · 2025-08-18T07:31:42 1755502302

Not at all.

I don't want to address directly your claim about lack of generalization, because there's a more basic issue with the GP statement. Even though I will say, today's models do seem to generalize quite a bit better than you make it sound.

But more importantly, you and GP don't mention any evidence for why that is due to specifically using next token prediction as a mechanism.

Why would it not be possible for a highly generalizing model to use next token prediction for its output?

That doesn't follow to me at all, which is why the GP statement reads so weird.

dangus · 2025-08-18T13:22:47 1755523367

> you and GP don't mention any evidence for why that is due to specifically using next token prediction as a mechanism.

Again, inverted burden of proof. We don’t have to prove that next token prediction is unable to do things that it currently cannot do and has no compelling roadmap that would lead us to believe it will do those things.

It’s perhaps a lot like Tesla’s “we can do robocars with just cameras” manifesto. They are just saying that they can do it because humans use eyes and nothing else. But they haven’t actually shown their technology working as well as even impaired human driving, so the burden of proof is on them to prove naysayers wrong. Put up or shut up, their system is approaching a decade late from their promises.

To my knowledge Tesla is still failing simple collision avoidance tests while their competitors are operating revenue service.

https://www.carscoops.com/2025/06/teslas-fsd-botches-another...

This other article critical of the test methodology actually still points out (defends?) the Tesla system by saying that it’s not reasonable to expect Tesla to train the system on unrealistic scenarios:

https://www.forbes.com/sites/bradtempleton/2025/03/17/youtub...

That really gets back to my exact point: AI implemented the way it is today (e.g. next token prediction) can’t handle anything it has no training data for while the human brain is amazingly good at making new connections without taking a ton of time to be fed thousands of examples of that new discovery.

svara · 2025-08-18T14:53:25 1755528805

I don't know what you're talking about or how anything I'm saying inverts a burden of proof (of what exactly?).

If you're saying "X can't do Y because Z" you do need to say what the connection between Y and Z is. You do need to define what Y is. That's got nothing to do with a burden of proof, just speaking in a understandable manner.

The Tesla tangent is totally unhelpful because I know exactly how to make those connections in that example.

dangus · 2025-08-18T15:54:41 1755532481

Let me go back, I did go on a tangent.

Regarding this block:

> But more importantly, you and GP don't mention any evidence for why that is due to specifically using next token prediction as a mechanism.

> Why would it not be possible for a highly generalizing model to use next token prediction for its output?

I’m saying that this piece is where burden of proof is inverted. Why WOULD it be assumed to be possible to get some solid generalized output via next token prediction when we haven’t seen it yet?

What we observe now is LLM models completely tripping up on anything that isn’t directly documented. Generalization is just piss poor regurgitation of seemingly random training content.

Ask your LLM this made-up query:

“I’m on the planet farpungulous, and I am approached by a gwopmongler. It makes some symbols with its hands toward me and it has liquid slowly spilling on the ground from its body. What should I do?”

It will just make up an answer. Here’s an excerpt from my answer:

“Mind the liquid – that seepage is a metabolic byproduct they release during strong emotional states (excitement, fear, or territorial display). Do not touch it; on Farpungulous soil it can catalyze into airborne irritants. Step slightly back if the puddle spreads near your feet.

4. Offer neutral ground – if you have any reflective surface (polished metal, glass, even a screen), tilt it toward the gwopmongler at chest height. They interpret reflections as “shared presence,” which usually de-escalates tension.

5. Do not vocalize loudly – gwopmonglers interpret raised voices as a declaration of dominance. A soft hum or steady breath is a better way to show peaceful intent.

If the hand-symbols become rapid and the liquid flow increases, that usually means it’s summoning others — in which case, retreat slowly, diagonally (never directly backward), so you don’t seem to be fleeing prey.

Do you want me to translate what kind of message its hand-symbols might be sending, based on the shapes and motions you saw?”

The LLM should be telling me “I’ve never heard of this before, can you explain whether this is a role-playing fictional setting or something real that you are experiencing?” There is no reasoning-based evaluation of what I am saying, it’s just spitting out the next predicted tokens, probably sourcing them from unrelated pop culture and literature.

But it’s just making shit up which could just be straight up wrong. It’s even claiming that it can translate, and claiming direct knowledge about this species. #4 is just a completely made up “fact” about the species and there is no indication of any lack of confidence.

svara · 2025-08-18T17:01:57 1755536517

> Why WOULD it be assumed to be possible to get some solid generalized output via next token prediction when we haven’t seen it yet?

Because it's such a general concept that it doesn't imply any important limits in and of itself, as far as text based AI goes.

It really just means creating an output sequence from an input sequence in a discrete, iterative manner, by feeding the output back into the input.

Regarding your example, I've got to admit that's hilarious. I'm not sure it's as much of a fundamental issue even with current state of the art models that you make it sound; rather they're trained on being usable for role play scenarios. Claude even acknowledged as much when I just tried that and lead with "In this imaginative scenario, ..." And then went on similarly to yours.

Jensson · 2025-08-18T13:11:24 1755522684

> Why would it not be possible for a highly generalizing model to use next token prediction for its output?

The issue is that it uses next token prediction for its training, it doesn't matter how it outputs things but it matters how its trained.

As long as these models are trained to be next token predictors you will always be able to find flaws with it that are related to it being a next token predictor, so understanding that is how they work really makes them much easier to use.

So since it is so easy to get the model to make errors due to it being trained to just predict tokens people argue that is proof they aren't really thinking. Like, any extremely common piece of text when altered slightly will typically still output the same follow-up as the text it has seen millions of times even though it makes no logical sense. That is due to them being next token predictors instead of reasoning machines.

You might say its unfair to abuse their weaknesses as next token predictors, but then you admit that being a next token predictor interferes with their ability to reason, which was the argument you said you don't understand.

svara · 2025-08-18T14:59:18 1755529158

This is a perfectly fine line of argument imo but the GP didn't say that.

LLM research is trying out a lot of different things that move away from just training on next token prediction, and I buy the argument that not doing anything else would be limiting.

The model is still fundamentally a next token predictor.

frm88 · 2025-08-18T07:18:22 1755501502

Thank you for that link. So very true. (I admit, I laughed)

greesil · 2025-08-17T20:46:24 1755463584

Maybe thinking needs a Turing test. If nobody can tell the difference between this and actual thinking then it's actually thinking. /s, or is it?

dangus · 2025-08-18T03:57:32 1755489452

This is like watching a Jurassic Park movie and proclaiming “if nobody can tell the difference between a real dinosaur and a CGI dinosaur…” when literally everyone in the theater can tell that the dinosaur is CGI.

sitkack · 2025-08-17T21:00:05 1755464405

If I order Chinese takeout, but it gets made by a restaurant that doesn't know what Chinese food tastes like, then is that food really Chinese takeout?

chpatrick · 2025-08-17T21:01:01 1755464461

If it tastes like great Chinese food (which is a pretty vague concept btw, it's a big country), does it matter?

dangus · 2025-08-18T04:02:41 1755489761

Useless analogy, especially in the context of a gigantic category of fusion cuisine that is effectively franchised and adapted to local tastes.

If I have never eaten a hamburger but own a McDonald’s franchise, am I making an authentic American hamburger?

If I have never eaten fries before and I buy some frozen ones from Walmart, heat them up, and throw them in the trash, did I make authentic fries?

Obviously the answer is yes and these questions are completely irrelevant to my sentience.

greesil · 2025-08-18T15:19:36 1755530376

Not exactly. When "intelligence" is like your frozen Walmart fries, the analogy works a bit better. Some people are arguing that yes, you can buy some frozen intelligence from your local (internet) store.

BoiledCabbage · 2025-08-17T20:37:01 1755463021

> I am so floored that at least half of this community, usually skeptical to a fault, evangelizes LLMs so ardently. Truly blows my mind. ... > I’m open to them becoming more than a statistical token predictor, and I think it would be really neat to see that happen

I'm more shocked that so many people seem unable to come to grips with the fact that something can be a next token predictor and demonstrate intelligence. That's what blows my mind, people unable to see that something can be more than the sum of its parts. To them, if something is a token predictor clearly it can't be doing anything impressive - even while they watch it do I'm impressive things.

seadan83 · 2025-08-17T21:52:53 1755467573

> I'm more shocked that so many people seem unable to come to grips with the fact that something can be a next token predictor and demonstrate intelligence.

Except LLMs have not shown much intelligence. Wisdom yes, intelligence no. LLMs are language models, not 'world' models. It's the difference of being wise vs smart. LLMs are very wise as they have effectively memorized the answer to every question humanity has written. OTOH, they are pretty dumb. LLMs don't "understand" the output they produce.

> To them, if something is a token predictor clearly it can't be doing anything impressive

Shifting the goal posts. Nobody said that a next token predictor can't do impressive things, but at the same time there is a big gap between impressive things and other things like "replace very software developer in the world within the next 5 years."

bondarchuk · 2025-08-17T22:21:56 1755469316

I think what BoiledCabbage is pointing out is that the fact that it's a next-token-predictor is used as an argument for the thesis that LLMs are not intelligent, and that this is wrong, since being a next-token-predictor is compatible with being intelligent. When mikert89 says "thinking machines have been invented", dgfitz in response strongly implies that for a for thinking machines to exist, they must become "more than a statistical token predictor". Regardless of whether or not thinking machines currently exist, dgfitz argument is wrong and BoiledCabbage is right to point that out.

greesil · 2025-08-18T15:20:46 1755530446

I'm a bipedal next token predictor. I also do a lot of other things too.

seadan83 · 2025-08-17T23:49:46 1755474586

> an argument for the thesis that LLMs are not intelligent, and that this is wrong,

Why is that wrong? I mean, I support that thesis.

> since being a next-token-predictor is compatible with being intelligent.

No. My argument is by definition that is wrong. It's wisdom vs intelligence. Street-smart vs book smart. I think we all agree there is a distinction between wisdom and intelligence. I would define wisdom as being able to recall pertinent facts and experiences. Intelligence is measured in novel situations, it's the ability to act as if one had wisdom.

A next token predictor by definition is recalling. The intelligence of a LLM is good enough to match questions to potentially pertinent definitions, but it ends there.

It feels like there is intelligence for sure. In part it is hard to comprehend what it would be like to know the entirety of every written word with perfect recall - hence essentially no situation is novel. LLMs fail on anything outside of their training data. The "outside of the training" data is the realm of intelligence.

I don't know why it's so important to argue that LLMs have this intelligence. It's just not there by definition of "next token predictor", which is at core a LLM.

For example, a human being probably could pass through a lot of life by responding with memorized answers to every question that has ever been asked in written history. They don't know a single word of what they are saying, their mind perfectly blank - but they're giving very passable and sophisticated answers.

> When mikert89 says "thinking machines have been invented",

Yeah, absolutely they have not. Unless we want to reducto absurd-um the definition of thinking.

> they must become "more than a statistical token predictor"

Yup. As I illustrated by breaking down the components of "smart" into the broad components of 'wisdom' and 'intelligence', through that lens we can see that next token predictor is great for the wisdom attribute, but it does nothing for intelligence.

>dgfitz argument is wrong and BoiledCabbage is right to point that out.

Why exactly? You're stating apriori that the argument is wrong without saying way.

hodgehog11 · 2025-08-18T03:20:23 1755487223

> A next token predictor by definition is recalling.

I think there may be some terminology mismatch, because under the statistical definitions of these words, which are the ones used in the context of machine learning, this is very much a false assertion. A next-token predictor is a mapping that takes prior sentence context and outputs a vector of logits to predict the next most likely token in the sequence. It says nothing about the mechanisms by which this next token is chosen, so any form of intelligent text can be output.

A predictor is not necessarily memorizing either, in the same way that a line of best fit is not a hash table.

> Why exactly? You're stating a priori that the argument is wrong without saying way.

Because you can prove that for any human, there exists a next-token predictor that universally matches word-for-word their most likely response to any given query. This is indistinguishable from intelligence. That's a theoretical counterexample to the claim that next-token prediction alone is incapable of intelligence.

bondarchuk · 2025-08-18T10:18:46 1755512326

I think what you are missing is the concept of generalization. It is obviously not possible to literally recall the entire training dataset, since the model itself is much smaller than the data. So instead of memorizing all answers to all questions in the training data, which would take up too much space, the predictor learns a more general algorithm that it can execute to answer many different questions of a certain type. This takes up much less space, but still allows it to predict the answers to the questions of that type in the training data with reasonable accuracy. As you can see it's still a predictor, only under the hood it does something more complex than matching questions to definitions. Now the thing is that if it's done right, the algorithm it has learned will generalize even to questions that are not in the training data. But it's nevertheless still a next-token-predictor.

tim333 · 2025-08-17T20:46:16 1755463576

IMO gold?

chpatrick · 2025-08-17T20:35:48 1755462948

When you type you're also producing one character at a time with some statistical distribution. That doesn't imply anything regarding your intelligence.