Is it ever useful to have a context window that full? I try to keep usage under 40%, or about 80k tokens, to avoid what Dex Horthy calls the dumb zone in his research-plan-implement approach. Works well for me so far.
I'd been on Codex for a while and with Codex 5.2 I:
1) No longer found the dumb zone
2) No longer feared compaction
Switching to Opus for stupid political reasons, I still have not had the dumb zone - but I'm back to disliking compaction events and so the smaller context window it has, has really hurt.
I hope they copy OpenAI's compaction magic soon, but I am also very excited to try the longer context window.
OpenAI has some magic they do on their standalone endpoint (/responses/compact) just for compaction, where they keep all the user messages and replace the agent messages or reasoning with embeddings.
> This list includes a special type=compaction item with an opaque encrypted_content item that preserves the model’s latent understanding of the original conversation.
Not sure if it's a common knowledge but I've learned not that long ago that you can do "/compact your instructions here", if you just say what you are working on or what to keep explicitly it's much less painful.
In general LLMs for some reason are really bad at designing prompts for themselves. I tested it heavily on some data where there was a clear optimization function and ability to evaluate the results, and I easily beat opus every time with my chaotic full of typos prompts vs its methodological ones when it is writing instructions for itself or for other LLMs.
You can also put guidance for when to compact and with what instructions into Claude.md. The model itself can run /compact, and while I try to remember to use it manually, I find it useful to have “If I ask for a totally different task and the current context won’t be useful, run /compact with a short summary of the new focus”
so you have to garbage collect manually for the AI?
also, i don't want to make a full parent post
1M tokens sounds real expensive if you're constantly at that threshold. There's codebases larger in LOC; i read somewhere that Carmack has "given to humanity" over 1 million lines of his code. Perhaps something to dwell on
I'm directly conveying my actual experience to you. I have tasks that fill up Opus context very quickly (at the 200k context) and which took MUCH longer to fill up Codex since 5.2 (which I think had 400k context at the time).
This is direct comparison. I spent months subscribed to both of their $200/mo plans. I would try both and Opus always filled up fast while Codex continued working great. It's also direct experience that Codex continues working great post-compaction since 5.2.
I don't know about Gemini but you're just wrong about Codex. And I say this as someone who hates reporting these facts because I'd like people to stop giving OpenAI money.
I agree even though I used to be a die hard Claude fan I recently switched back to ChatGPT and codex to try it out again and they’ve clearly pulled into the lead for consistency, context length and management as well as speed. Claude Code instilled a dread in me about keeping an eye on context but I’m slowly learning to let that go with codex.
When Anthropic said they wouldn't sell LLMs to the government for mass surveillance or autonomous killing machines, and got labeled a supply chain risk as a result, OpenAI told the public they have the same policy as Anthropic while inking a deal with the government that clearly means "actually we will sell you LLMs for mass surveillance or autonomous killing machines but only if you tell us it's legal".
If you already knew all that I'm not interested in an argument, but if you didn't know any of that, you might be interested in looking it up.
edit: Your post history has tons of posts on the topic so clearly I just responded to flambait, and regret giving my time and energy.
I appreciate both your taking an ethical stance on openai, and the way you're engaging in this thread. The parent was probably flame bait as you say, but other people in the thread might be genuinely curious.
I'm not some kind of OpenAI or Pentagon fanboy, but it's pretty easy to for me to understand why a buyer of a critical technology wants to be free to use it however they want, within the law, and not subject to veto from another entity's political opinions. It sounds perfectly reasonable to me for the military to want to decide its uses of technologies it purchases itself.
It's not like the military was specifically asking for mass surveillance, they just wanted "any legal use". Anthropic's made a lot of hay posturing as the moral defender here, but they would have known the military would never agree to their terms, which makes the whole thing smell like a bit of a PR stunt.
The supply chain risk designation is of course stupid and vindictive but that's more of an administration thing as far as I can tell.
As long as it's within the law? What if they politically control the law-making system? What if they've shown themselves to operate brazenly outside the law?
Why downplay the mass surveillance aspect by saying it's a request by "the military". It's a request by the department of defense, the parent organization of the NSA.
From what has been shared publicly, they absolutely did ask for contractual limits on domestic mass surveillance to be removed, and to my read, likely technical/software restrictions to be removed as well.
What the department of defense is legally allowed to do is irrelevant and a red herring.
“Any legal use” is an exceptionally broad framework, and after the FISA “warrants,” it would appear it is incumbent on private companies to prevent breaches of the US constitution, as the government will often do almost anything in the name of “national security,” inalienable rights against search and seizure be damned.
If it isn’t written in the contract, it can and will be worked around. You learn that very quickly in your first sale to a large enterprise or government customer.
Anthropic was defending the US constitution against the whims of the government, which has shown that it is happy to break the law when convenient and whenever it deems necessary.
Note: I used to work in the IC. I have absolutely nothing against the government. I am a patriot. It is precisely for those reasons, though, that I think Anthropic did the right thing here by sticking to their guns. And the idiotic “supply chain risk” designation will be thrown out in court trivially.
I hope you don't get this the wrong way. I sincerely mean it. Please, get some psychological help. Seek out a professional therapist and talk to them about your life.
I'm totally aware it's just a machine with no internal monologue. It's just a stateless text processing machine. That is not the point. The machine is able to simulate moral reasoning to an undefined level. It's not necessary to repeat this all the time. The simulation of moral reasoning and internal monologue is deep, unpredictable, not controllable and may or may not align with the interests of anyone who gives it "arms and legs" and full autonomy. If you are just interested in using these tools for glorified auto complete then you are naïve with regards to the usages other actors, including state actors are attempting to use them. Understanding and being curious about the behaviour without completely anthropomorphising them is reasonable science.
yeah gemini is dumb when you tell it to do stuff - but the things it finds (and critically confirms, including doing tool calls while validating hypotheses) in reviews absolutely destroy both gpt and opus.
if you're a one-model shop you're losing out on quality of software you deliver, today. I predict we'll all have at least two harness+model subscriptions as a matter of course in 6-12 months since every model's jagged frontier is different at the margins, and the margins are very fractal.
Using Codex more for now, and there is definitely some compaction magic.
I’m keeping the same conversation going and going for days, some at almost 1B tokens (per the codex cli counters), with seemingly no coherency loss
● RESEARCH. Don't code yet. Let the agent scan the files first. Docs lie. Code doesn't.
● PLAN. The agent writes a detailed step-by-step plan. You review and approve the plan, not just the output. Dex calls this avoiding "outsourcing your thinking." The plan is where intent gets compressed before execution starts.
● IMPLEMENT. Execute in a fresh context window.
The meta-principle he calls Frequent Intentional Compaction: don't let the chat run long. Ask the agent to summarize state, open a new chat with that summary, keep the model in the smart zone.
More recently I've been doing the implement phase without resetting the whole context when context is still < 60% full and must say I find it to be a better workflow in many cases (depends a bit on the size of the plan I suppose.)
It's faster because it has already read most relevant files, still has the caveats / discussion from the research phase in its context window, etc.
With the context clear the plan may be good / thorough but I've had one too many times that key choices from the research phase didn't persist because halfway through implementation Opus runs into an issue and says "You know what? I know a simpler solution." and continues down a path I explicitly voted down.
better to instruct it to write a plan .md file that is appropriately named so that it can be easily referenced/updated in multiple sessions. I've found that effective.
yes, but if you start a fresh session to continue working on your project, it's a lot easier if you already know which PLAN file you need for your project. Plus you can commit it.
My annoyance with plan mode is where it sticks the .md file, kind of hides it away which makes it annoying to clear context and start up a new phase from the PLAN file. But that might just be a skill issue on my end
Even worse, it just randomly blows away the plan file without asking for permission.
No idea what they were thinking when they designed this feature. The plan file names are randomly generated, so it could just keep making new ones forever for free (it would take a LONG time for the disk space to matter), but instead, for long plans, I have to back the plan file up if it gets stuck. Otherwise, I say "You should take approach X to fix this bug", it drops into plan mode, says "This is a completely unrelated plan", then deletes all record of what it was doing before getting stuck.
It’s not just me then! Hah good to know. It’s why I’ve started ignoring plan modes in most agent harnesses, and managing it myself through prompting and keeping it in the code base (but not committed)
My experience also. The claude code document feature is a real missed opportunity. As you can see in this discussion, we all have to do it manually if we want it to work.
After creating the plan in Plan mode (+Thinking) I ask Claude to move the plan .md file to /docs/plans folder inside the repo.
Open a new chat with Opus, thinking mode is off. Because no need when we have detailed plan.
Now the plan file is always reachable, so when the context limit is narrowing, mostly around 50%, I ask Claude to update the plan with the progress, and move to a new chat @pointing the plan file and it continue executing without any issue.
I have Codex and Gemini critique the plan and generate their plans. Then I have Claude review the other plans and add their good ideas. It frequently improves the plan. I then do my careful review.
This is exactly how I've found leads to most consistent high quality results as well. I don't use gemini yet (except for deep research, where it pulls WAY ahead of either of the other 'grounding' methods)
But Codex to plan big features and Claude to review the feature plan (often finds overlooked discrepancies) then review the milestones and plan implementation of them in planning mode, then clear context and code. Works great.
Add a REFLECT phase after IMPLEMENT. I’m finding it’s extremely useful to ask agents for implementation notes and for code reviews. These are different things, and when I ask for implementation notes I get very different output than the implementation summary it spits out automatically. I ask the agent to surface all design choices it had to make that we didn’t explicitly discuss in the plan, and then check in the plan + impl notes in order to help preload context for the next thing.
My team has been adopting a separation of plan & implement organically, we just noticed we got better output that way, plus Claude now suggests in plan mode to clear context first before implementing. We are starting to do team reviews on the plan before the implement phase. It’s often helpful to get more eyeballs on the plan and improve it.
How is that Plan strategy not "outsourcing your thinking" because that's exactly what it sounds like. AI does the heavy lifting and you are the editor.
Interesting take. Does that mean SWE's are outsourcing their thinking by relying on management to run the company, designers to do UX, support folks to handle customers?
Or is thinking about source code line by line the only valid form of thinking in the world?
I mean yes? That's like, the whole idea behind having a team. The art guy doesn't want to think about code, the coder doesn't want to think about finances, the accountant doesn't want to worry about customer support. It would be kind of a structural failure if you weren't outsourcing at least some of your thinking.
I’m with you, perhaps I just misread some kind of condescension into the “outsourcing your thinking” comment.
We all have limited context windows, the world’s always worked that way, just seemed odd to (mis)read someone saying there’s something wrong with focusing on when you add the greatest value and trusting others to do the same.
It is condescending when antis say AI users do it. It isn’t when a director or team leader does it.
But it’s the same process, which should tell you what’s really going on here. It’s about status, not functionality, and you don’t gain status without controlling other humans.
For me, it's less about being able to look back -800k tokens. It's about being able to flow a conversation for a lot longer without forcing compaction. Generally, I really only need the most recent ~50k tokens, but having the old context sitting around is helpful.
Also, when you hit compaction at 200k tokens, that was probably when things were just getting good. The plan was in its final stage. The context had the hard-fought nuances discovered in the final moment. Or the agent just discovered some tiny important details after a crazy 100k token deep dive or flailing death cycle.
Now you have to compact and you don’t know what will survive. And the built-in UI doesn’t give you good tools like deleting old messages to free up space.
I've found compactation kills the whole thing. Important debug steps completely missing and the AI loops back round thinking it's found a solution when we've already done that step.
I find it useful to make Claude track the debugging session with a markdown file. It’s like a persistent memory for a long session over many context windows.
Or make a subagent do the debugging and let the main agent orchestrate it over many subagent sessions.
For the same reason people use link shorteners at all. It’s much more pleasant to look at and makes people more likely to press it compared to a paragraph-long URL full of tracking garbage.
Please. The URL above is pretty short, this is not the kind of URL link shorteners were made for, in fact it’s already shortened, as @alecco pointed out.
Pleasant? I could not care less about the pleasantness of the video code, but a shortened URL in this case would not be more pleasant, and it would be functionally worse, and barely shorter; all you’d be able to trim is the “?si=“. I’m baffled by this thread.
My point is Google engineers go to the trouble of setting up a URL shortener service on one hand, but on the other hand it seems ad the business anti-privacy executives can override anything. This points out it's a dysfunctional company.
You’d rather have the video code and the tracking code baked into the same code just to save a couple of characters? Why? That would result in a longer code than the video code alone, you would save very few characters. It would not be nicer to look at or functionally any different, and it would obscure the fact that it’s being tracked and prevent people from being able to edit the URL to remove the tracking. I appreciate the fact that I can see that the URL has a tracking ID and that I can edit the URL and remove the tracking ID. I do not want a shorter URL if I lose that ability. What you’re complaining about and wishing for would be MUCH worse than what it currently is.
Then your point eludes me. You complained about the length. If you don’t want it shorter, then what do you want?
To me, the fact that the tracking code is visible and separate from the video code is evidence of the complete opposite of your conclusion - it’s evidence the ad business does not get to override either engineering nor what’s left of privacy control. Ad execs would surely prefer that the tracking code is not visible nor manually removeable.
When running long autonomous tasks it is quite frequent to fill the context, even several times. You are out of the loop so it just happens if Claude goes a bit in circles, or it needs to iterate over CI reds, or the task was too complex. I'm hoping a long context > small context + 2 compacts.
Yep I have an autonomous task where it has been running for 8 hours now and counting. It compacts context all the time. I’m pretty skeptical of the quality in long sessions like this so I have to run a follow on session to critically examine everything that was done. Long context will be great for this.
I get very useful code from long sessions. It’s all about having a framework of clear documentation, a clear multi-step plan including validation against docs and critical code reviews, acceptance criteria, and closed-loop debugging (it can launch/restsart the app, control it, and monitor logs)
I am heavily involved in developing those, and then routinely let opus run overnight and have either flawless or nearly flawless product in the morning.
I haven't figured out how to make use of tasks running that long yet, or maybe I just don't have a good use case for it yet. Or maybe I'm too cheap to pay for that many API calls.
My change cuts across multiple systems with many tests/static analysis/AI code reviews happening in CI. The agent keeps pushing new versions and waits for results until all of them come up clean, taking several iterations.
Right. At Opus 4.6 rates, once you're at 700k context, each tool call costs ~$1 just for cache reads alone. 100 tool calls = $100+ before you even count outputs. 'Standard pricing' is doing a lot of work here lol
All of those things are smells imo, you should be very weary of any code output from a task that causes that much thrashing to occur. In most cases it’s better to rewind or reset and adapt your prompt to avoid the looping (which usually means a more narrowly defined scope)
A person has a supervision budget. They can supervise one agent in a hands-on way or many mostly-hands-off agents. Even though theres some thrashing assistants still get farther as a team than a single micromanaged agent. At least that’s my experience.
Just curious, what kind of work are you doing where agentic workflows are consistently able to make notable progress semi-autonomously in parallel? Hearing people are doing this, supposedly productively/successfully, kind of blows my mind given my near-daily in-depth LLM usage on complex codebases spanning the full stack from backend to frontend. It's rare for me to have a conversation where the LLM (usually Opus 4.6 these days) lasts 30 minutes without losing the plot. And when it does last that long, I usually become the bottleneck in terms of having to think about design/product/engineering decisions; having more agents wouldn't be helpful even if they all functioned perfectly.
I've passed that bottleneck with a review task that produces engineering recommendations along six axis (encapsulation, decoupling, simplification, dedoupling, security, reduce documentation drift) and a ideation tasks that gives per component a new feature idea, an idea to improve an existing feature, an idea to expand a feature to be more useful. These two generate constant bulk work that I move into new chat where it's grouped by changeset and sent to sub agent for protecting the context window.
What I'm doing mostly these days is maintaining a goal.md (project direction) and spec.md (coding and process standards, global across projects). And new macro tasks development, I've one under work that is meant to automatically build png mockup and self review.
I work on 1M LOC 15 yr old repo. Like you it's across the full stack. Bugs in certain pieces of complex business logic would have catastrophic consequences for my employer. Basically I peel poorly-specific work items off my queue into its own worktree and session at high reasoning/effort and provide a well-specified prompt.
These things eat into my supervision budget:
* LLM loses the plot and I have to nudge (like you)
* Thinking hard to better specify prompts (like you)
* Reviewing all changes (I do not vibe code except for spikes or other low-risk areas)
* Manual thing I have to do (for things I have not yet automated with a agent-authored scripts)
* Meetings
* etc
So, yes, my supervision budget is a bottleneck. I can only run 5-8 agents at a time because I have only so much time in the day.
Compare that vs a single agent at high reasoning/effort: I am sitting waiting for it to think. Waiting for it to find the code area I'm talking about takes time. Compiling, running tests, fixing compile errors. A million other things.
Any time I find myself sitting and waiting, this is a signal to me to switch to a different session.
It's kind of like having a 16 gallon gas tank in your car versus a 4 gallon tank. You don't need the bigger one the majority of the time, but the range anxiety that comes with the smaller one and annoyance when you DO need it is very real.
It seems possible, say a year or two from now that context is more like a smart human with a “small”, vs “medium” vs “large” working memory. The small fellow would be able to play some popular songs on the piano , the medium one plays in an orchestra professionally and the x-large is like Wagner composing Der Ring marathon opera. This is my current, admittedly not well informed mental model anyway. Well, at least we know we’ve got a little more time before the singularity :)
It’s more like the size of the desk the AI has to put sheets of paper on as a reference while it builds a Lego set. More desk area/context size = able to see more reference material = can do more steps in one go. I’ve lately been building checklists and having the LLM complete and check off a few tasks at a time, compacting in-between. With a large enough context I could just point it at a PLAN.md and tell it to go to work.
Since I'm yet to seriously dive into vibe coding or AI-assisted coding, does the IDE experience offer tracking a tally of the context size? (So you know when you're getting close or entering the "dumb zone")?
The 2 I know, Cursor and Claude Code, will give you a percentage used for the context window. So if you know the size of the window, you can deduce the number of tokens used.
I never use these giant context windows. It is pointless. Agents are great at super focused work that is easy to re-do. Not sure what is the use case for giant context windows.
I've used it many times for long-running investigations. When I'm deep in the weeds with a ton of disassembly listings and memory dumps and such, I don't really want to interrupt all of that with a compaction or handoff cycle and risk losing important info. It seems to remain very capable with large contexts at least in that scenario.
After running a context window up high, probably near 70% on opus 4.6 High and watching it take 20% bites out of my 5hr quota per prompt I've been experimenting with dumping context after completing a task. Seems to be working ok. I wonder if I was running into the long context premium. Would that apply to Pro subs or is just relevant to api pricing?
I haven't hit the "dumb zone" any more since two months. I think this talk is outdated.
I'm using CC (Opus) thinking and Codex with xhigh on always.
And the models have gotten really good when you let them do stuff where goals are verifiable by the model. I had Codex fix a Rust B-rep CSG classification pipeline successfully over the course of a week, unsupervised. It had a custom STEP viewer that would take screenshots and feed them back into the model so it could verify the progress resp. the triangle soup (non progress) itself.
Codex did all the planning and verification, CC wrote the code.
This would have not been possible six months ago at all from my experience.
Maybe with a lot of handholding; but I doubt it (I tried).
I mean both the problem for starters (requires a lot of spatial reasoning and connected math) and the autonomous implementation. Context compression was never an issue in the entire session, for either model.
Maybe. But that’s what I focused on, for better or worse. I couldn’t concentrate on what he was saying because of it. Maybe bad mic placement, but the end results was like some sort of old school phone sex pest.
I mean, try using copilot on any substantial back-end codebase and watch it eat 90+% just building a plan/checklist. Of course copilot is constrained to 120k I believe? So having 10x that will blow open up some doors that have been closed for me in my work so far.
That said, 120k is pleeenty if you’re just building front-end components and have your API spec on hand already.
No vibes allowed: https://youtu.be/rmvDxxNubIg?is=adMmmKdVxraYO2yQ