Is it ever useful to have a context window that full? I try to keep usage under ...

furyofantares · 2026-03-14T02:08:39 1773454119

I'd been on Codex for a while and with Codex 5.2 I:

1) No longer found the dumb zone

2) No longer feared compaction

Switching to Opus for stupid political reasons, I still have not had the dumb zone - but I'm back to disliking compaction events and so the smaller context window it has, has really hurt.

I hope they copy OpenAI's compaction magic soon, but I am also very excited to try the longer context window.

pjerem · 2026-03-14T08:48:43 1773478123

If you use OpenCode (open source Claude Code implementation), you can configure compaction yourself : https://opencode.ai/docs/en/config/#compaction

furyofantares · 2026-03-14T12:39:03 1773491943

OpenAI has some magic they do on their standalone endpoint (/responses/compact) just for compaction, where they keep all the user messages and replace the agent messages or reasoning with embeddings.

> This list includes a special type=compaction item with an opaque encrypted_content item that preserves the model’s latent understanding of the original conversation.

Some prior discussion here https://news.ycombinator.com/item?id=46737630#46739209 regarding an article here https://openai.com/index/unrolling-the-codex-agent-loop/

comboy · 2026-03-14T10:53:55 1773485635

Not sure if it's a common knowledge but I've learned not that long ago that you can do "/compact your instructions here", if you just say what you are working on or what to keep explicitly it's much less painful.

In general LLMs for some reason are really bad at designing prompts for themselves. I tested it heavily on some data where there was a clear optimization function and ability to evaluate the results, and I easily beat opus every time with my chaotic full of typos prompts vs its methodological ones when it is writing instructions for itself or for other LLMs.

brookst · 2026-03-14T11:52:16 1773489136

You can also put guidance for when to compact and with what instructions into Claude.md. The model itself can run /compact, and while I try to remember to use it manually, I find it useful to have “If I ask for a totally different task and the current context won’t be useful, run /compact with a short summary of the new focus”

copperx · 2026-03-14T16:13:19 1773504799

I ofter wonder if I'm missing something, but shouldn't we be able to edit the context manually???

In that way we could erase prompts and responses that didn't yield anything useful or derailed the model.

Why can't we do that?

genewitch · 2026-03-14T10:56:35 1773485795

so you have to garbage collect manually for the AI?

also, i don't want to make a full parent post

1M tokens sounds real expensive if you're constantly at that threshold. There's codebases larger in LOC; i read somewhere that Carmack has "given to humanity" over 1 million lines of his code. Perhaps something to dwell on

karmasimida · 2026-03-14T04:51:19 1773463879

This is true.

When I am using codex, compaction isn’t something I fear, it feels like you save your gaming progress and move on.

For Claude Code compaction feels disastrous, also much longer

mgambati · 2026-03-14T02:33:45 1773455625

1m context in OpenAI and Gemini is just marketing. Opus is the only model to provide real usable bug context.

furyofantares · 2026-03-14T03:23:19 1773458599

I'm directly conveying my actual experience to you. I have tasks that fill up Opus context very quickly (at the 200k context) and which took MUCH longer to fill up Codex since 5.2 (which I think had 400k context at the time).

This is direct comparison. I spent months subscribed to both of their $200/mo plans. I would try both and Opus always filled up fast while Codex continued working great. It's also direct experience that Codex continues working great post-compaction since 5.2.

I don't know about Gemini but you're just wrong about Codex. And I say this as someone who hates reporting these facts because I'd like people to stop giving OpenAI money.

throwthrowuknow · 2026-03-14T11:16:15 1773486975

I agree even though I used to be a die hard Claude fan I recently switched back to ChatGPT and codex to try it out again and they’ve clearly pulled into the lead for consistency, context length and management as well as speed. Claude Code instilled a dread in me about keeping an eye on context but I’m slowly learning to let that go with codex.

HarHarVeryFunny · 2026-03-14T16:46:36 1773506796

Surely compaction is down to the agent rather than the model, so are you comparing Claude Code to Codex CLI?

alex_sf · 2026-03-14T20:45:03 1773521103

It's both.

sagarpatil · 2026-03-14T09:37:16 1773481036

This has been my experience too.

genewitch · 2026-03-14T10:58:00 1773485880

Have any of you heard of map reduce

dotancohen · 2026-03-14T03:39:47 1773459587

[flagged]

furyofantares · 2026-03-14T04:01:59 1773460919

When Anthropic said they wouldn't sell LLMs to the government for mass surveillance or autonomous killing machines, and got labeled a supply chain risk as a result, OpenAI told the public they have the same policy as Anthropic while inking a deal with the government that clearly means "actually we will sell you LLMs for mass surveillance or autonomous killing machines but only if you tell us it's legal".

If you already knew all that I'm not interested in an argument, but if you didn't know any of that, you might be interested in looking it up.

edit: Your post history has tons of posts on the topic so clearly I just responded to flambait, and regret giving my time and energy.

igor47 · 2026-03-14T04:37:48 1773463068

I appreciate both your taking an ethical stance on openai, and the way you're engaging in this thread. The parent was probably flame bait as you say, but other people in the thread might be genuinely curious.

sho · 2026-03-14T04:41:43 1773463303

I'm not some kind of OpenAI or Pentagon fanboy, but it's pretty easy to for me to understand why a buyer of a critical technology wants to be free to use it however they want, within the law, and not subject to veto from another entity's political opinions. It sounds perfectly reasonable to me for the military to want to decide its uses of technologies it purchases itself.

It's not like the military was specifically asking for mass surveillance, they just wanted "any legal use". Anthropic's made a lot of hay posturing as the moral defender here, but they would have known the military would never agree to their terms, which makes the whole thing smell like a bit of a PR stunt.

The supply chain risk designation is of course stupid and vindictive but that's more of an administration thing as far as I can tell.

lifeformed · 2026-03-14T08:12:14 1773475934

As long as it's within the law? What if they politically control the law-making system? What if they've shown themselves to operate brazenly outside the law?

stahtops · 2026-03-14T06:19:02 1773469142

Why downplay the mass surveillance aspect by saying it's a request by "the military". It's a request by the department of defense, the parent organization of the NSA.

From what has been shared publicly, they absolutely did ask for contractual limits on domestic mass surveillance to be removed, and to my read, likely technical/software restrictions to be removed as well.

What the department of defense is legally allowed to do is irrelevant and a red herring.

borski · 2026-03-14T06:56:38 1773471398

“Any legal use” is an exceptionally broad framework, and after the FISA “warrants,” it would appear it is incumbent on private companies to prevent breaches of the US constitution, as the government will often do almost anything in the name of “national security,” inalienable rights against search and seizure be damned.

If it isn’t written in the contract, it can and will be worked around. You learn that very quickly in your first sale to a large enterprise or government customer.

Anthropic was defending the US constitution against the whims of the government, which has shown that it is happy to break the law when convenient and whenever it deems necessary.

Note: I used to work in the IC. I have absolutely nothing against the government. I am a patriot. It is precisely for those reasons, though, that I think Anthropic did the right thing here by sticking to their guns. And the idiotic “supply chain risk” designation will be thrown out in court trivially.

injidup · 2026-03-14T07:52:01 1773474721

[flagged]

shafyy · 2026-03-14T08:16:14 1773476174

I hope you don't get this the wrong way. I sincerely mean it. Please, get some psychological help. Seek out a professional therapist and talk to them about your life.

injidup · 2026-03-14T10:11:44 1773483104

I'm totally aware it's just a machine with no internal monologue. It's just a stateless text processing machine. That is not the point. The machine is able to simulate moral reasoning to an undefined level. It's not necessary to repeat this all the time. The simulation of moral reasoning and internal monologue is deep, unpredictable, not controllable and may or may not align with the interests of anyone who gives it "arms and legs" and full autonomy. If you are just interested in using these tools for glorified auto complete then you are naïve with regards to the usages other actors, including state actors are attempting to use them. Understanding and being curious about the behaviour without completely anthropomorphising them is reasonable science.

hu3 · 2026-03-14T02:41:56 1773456116

Source? I ask because I use 500k+ context on these on a daily basis.

Big refactorings guided by automated tests eat context window for breakfast.

8note · 2026-03-14T02:58:26 1773457106

i find gemini gets real real bad when you get far into the context - gets into loops, forgets how to call tools, etc

baq · 2026-03-14T10:26:59 1773484019

yeah gemini is dumb when you tell it to do stuff - but the things it finds (and critically confirms, including doing tool calls while validating hypotheses) in reviews absolutely destroy both gpt and opus.

if you're a one-model shop you're losing out on quality of software you deliver, today. I predict we'll all have at least two harness+model subscriptions as a matter of course in 6-12 months since every model's jagged frontier is different at the margins, and the margins are very fractal.

petesergeant · 2026-03-14T03:55:52 1773460552

I find Gemini to be real bad. Are you just using it for price reasons, or?

girvo · 2026-03-14T03:45:03 1773459903

I find gemini does that normally, personally. Noticeably worse in my usage than either Claude or Codex.

Bolwin · 2026-03-14T06:12:12 1773468732

How many big refactorings are you doing? And why?

kimi · 2026-03-14T06:49:30 1773470970

How is that relevant? we are talking about models, now what you do with them.

johnebgd · 2026-03-14T03:22:01 1773458521

Codex high reasoning has been a legitimately excellent tool for generating feedback on every plan Claude opus thinking has created for me.

radicality · 2026-03-14T16:20:57 1773505257

Using Codex more for now, and there is definitely some compaction magic. I’m keeping the same conversation going and going for days, some at almost 1B tokens (per the codex cli counters), with seemingly no coherency loss

iknowstuff · 2026-03-14T02:33:40 1773455620

Hmm I’ve felt the dumb zone on codex

nomel · 2026-03-14T03:19:51 1773458391

From what I've seen, it means whatever he's doing is very statistically significant.

kaizenb · 2026-03-14T03:38:26 1773459506

Thanks for the video.

His fix for "the dumb zone" is the RPI Framework:

● RESEARCH. Don't code yet. Let the agent scan the files first. Docs lie. Code doesn't.

● PLAN. The agent writes a detailed step-by-step plan. You review and approve the plan, not just the output. Dex calls this avoiding "outsourcing your thinking." The plan is where intent gets compressed before execution starts.

● IMPLEMENT. Execute in a fresh context window. The meta-principle he calls Frequent Intentional Compaction: don't let the chat run long. Ask the agent to summarize state, open a new chat with that summary, keep the model in the smart zone.

Huppie · 2026-03-14T10:15:05 1773483305

More recently I've been doing the implement phase without resetting the whole context when context is still < 60% full and must say I find it to be a better workflow in many cases (depends a bit on the size of the plan I suppose.)

It's faster because it has already read most relevant files, still has the caveats / discussion from the research phase in its context window, etc.

With the context clear the plan may be good / thorough but I've had one too many times that key choices from the research phase didn't persist because halfway through implementation Opus runs into an issue and says "You know what? I know a simpler solution." and continues down a path I explicitly voted down.

girvo · 2026-03-14T03:44:19 1773459859

That's fascinating: that is identical to the workflow I've landed on myself.

hedora · 2026-03-14T04:09:53 1773461393

It's also identical to what Claude Code does if you put it in plan mode (bound to <tab> key), at least in my experience.

insane_dreamer · 2026-03-14T19:55:24 1773518124

better to instruct it to write a plan .md file that is appropriately named so that it can be easily referenced/updated in multiple sessions. I've found that effective.

cruffle_duffle · 2026-03-16T21:13:14 1773695594

Dunno if you know this but the plan in plan mode is a markdown file! Ask it for the file and it will give it to you.

insane_dreamer · 2026-03-17T23:14:05 1773789245

yes, but if you start a fresh session to continue working on your project, it's a lot easier if you already know which PLAN file you need for your project. Plus you can commit it.

girvo · 2026-03-14T04:24:58 1773462298

My annoyance with plan mode is where it sticks the .md file, kind of hides it away which makes it annoying to clear context and start up a new phase from the PLAN file. But that might just be a skill issue on my end

hedora · 2026-03-14T04:34:56 1773462896

Even worse, it just randomly blows away the plan file without asking for permission.

No idea what they were thinking when they designed this feature. The plan file names are randomly generated, so it could just keep making new ones forever for free (it would take a LONG time for the disk space to matter), but instead, for long plans, I have to back the plan file up if it gets stuck. Otherwise, I say "You should take approach X to fix this bug", it drops into plan mode, says "This is a completely unrelated plan", then deletes all record of what it was doing before getting stuck.

girvo · 2026-03-14T05:35:08 1773466508

It’s not just me then! Hah good to know. It’s why I’ve started ignoring plan modes in most agent harnesses, and managing it myself through prompting and keeping it in the code base (but not committed)

toddmerrill · 2026-03-14T12:19:14 1773490754

My experience also. The claude code document feature is a real missed opportunity. As you can see in this discussion, we all have to do it manually if we want it to work.

kaizenb · 2026-03-14T06:57:49 1773471469

After creating the plan in Plan mode (+Thinking) I ask Claude to move the plan .md file to /docs/plans folder inside the repo.

Open a new chat with Opus, thinking mode is off. Because no need when we have detailed plan.

Now the plan file is always reachable, so when the context limit is narrowing, mostly around 50%, I ask Claude to update the plan with the progress, and move to a new chat @pointing the plan file and it continue executing without any issue.

cortesoft · 2026-03-14T04:28:30 1773462510

It’s the style spec-kit uses: https://github.com/github/spec-kit

Working on my first project with it… so far so good.

iamacyborg · 2026-03-14T07:49:03 1773474543

> RESEARCH. Don't code yet. Let the agent scan the files first. Docs lie. Code doesn't.

I find myself often running validity checks between docs and code and addressing gaps as they appear to ensure the docs don’t actually lie.

silverlake · 2026-03-14T09:03:52 1773479032

I have Codex and Gemini critique the plan and generate their plans. Then I have Claude review the other plans and add their good ideas. It frequently improves the plan. I then do my careful review.

ArtRichards · 2026-03-14T11:29:17 1773487757

This is exactly how I've found leads to most consistent high quality results as well. I don't use gemini yet (except for deep research, where it pulls WAY ahead of either of the other 'grounding' methods)

But Codex to plan big features and Claude to review the feature plan (often finds overlooked discrepancies) then review the milestones and plan implementation of them in planning mode, then clear context and code. Works great.

dahart · 2026-03-14T15:19:57 1773501597

Add a REFLECT phase after IMPLEMENT. I’m finding it’s extremely useful to ask agents for implementation notes and for code reviews. These are different things, and when I ask for implementation notes I get very different output than the implementation summary it spits out automatically. I ask the agent to surface all design choices it had to make that we didn’t explicitly discuss in the plan, and then check in the plan + impl notes in order to help preload context for the next thing.

My team has been adopting a separation of plan & implement organically, we just noticed we got better output that way, plus Claude now suggests in plan mode to clear context first before implementing. We are starting to do team reviews on the plan before the implement phase. It’s often helpful to get more eyeballs on the plan and improve it.

greenchair · 2026-03-14T11:26:02 1773487562

How is that Plan strategy not "outsourcing your thinking" because that's exactly what it sounds like. AI does the heavy lifting and you are the editor.

brookst · 2026-03-14T11:55:21 1773489321

Is a VP of engineering “outsourcing their thinking” by having an org that can plan and write software?

Filligree · 2026-03-14T12:42:52 1773492172

brookst · 2026-03-14T15:21:18 1773501678

Interesting take. Does that mean SWE's are outsourcing their thinking by relying on management to run the company, designers to do UX, support folks to handle customers?

Or is thinking about source code line by line the only valid form of thinking in the world?

qualifck · 2026-03-14T18:13:06 1773511986

I mean yes? That's like, the whole idea behind having a team. The art guy doesn't want to think about code, the coder doesn't want to think about finances, the accountant doesn't want to worry about customer support. It would be kind of a structural failure if you weren't outsourcing at least some of your thinking.

brookst · 2026-03-15T01:15:17 1773537317

I’m with you, perhaps I just misread some kind of condescension into the “outsourcing your thinking” comment.

We all have limited context windows, the world’s always worked that way, just seemed odd to (mis)read someone saying there’s something wrong with focusing on when you add the greatest value and trusting others to do the same.

Filligree · 2026-03-15T16:47:35 1773593255

It is condescending when antis say AI users do it. It isn’t when a director or team leader does it.

But it’s the same process, which should tell you what’s really going on here. It’s about status, not functionality, and you don’t gain status without controlling other humans.

Eldt · 2026-03-14T14:30:15 1773498615

Delegation is generally all about outsourcing, so hard agree

SkyPuncher · 2026-03-14T01:22:33 1773451353

Yes. I've recently become a convert.

For me, it's less about being able to look back -800k tokens. It's about being able to flow a conversation for a lot longer without forcing compaction. Generally, I really only need the most recent ~50k tokens, but having the old context sitting around is helpful.

hombre_fatal · 2026-03-14T01:42:55 1773452575

Also, when you hit compaction at 200k tokens, that was probably when things were just getting good. The plan was in its final stage. The context had the hard-fought nuances discovered in the final moment. Or the agent just discovered some tiny important details after a crazy 100k token deep dive or flailing death cycle.

Now you have to compact and you don’t know what will survive. And the built-in UI doesn’t give you good tools like deleting old messages to free up space.

I’ll appreciate the 1M token breathing room.

roygbiv2 · 2026-03-14T02:05:47 1773453947

I've found compactation kills the whole thing. Important debug steps completely missing and the AI loops back round thinking it's found a solution when we've already done that step.

s900mhz · 2026-03-14T03:31:20 1773459080

I find it useful to make Claude track the debugging session with a markdown file. It’s like a persistent memory for a long session over many context windows.

Or make a subagent do the debugging and let the main agent orchestrate it over many subagent sessions.

roygbiv2 · 2026-03-14T04:21:24 1773462084

Yeah I use a markdown to put progress in. It gets kinda long and convoluted a manual intervention is required every so often. Works though.

garciasn · 2026-03-14T02:34:03 1773455643

For me, Claude was like that until about 2m ago. Now it rarely gets dumb after compaction like it did before.

8note · 2026-03-14T02:59:23 1773457163

oh, ive found that something about compaction has been dropping everything that might be useful. exact opposite experience

alecco · 2026-03-14T12:30:04 1773491404

Offtopic: I find it remarkable the shortened YT url has a tracking cost of 57% extra length. We live in stupid times.

dahart · 2026-03-14T14:52:02 1773499922

I care about the privacy implications, but not the length. Out of curiosity, why do you care about the URL length at all? What is the cost to you?

tarbyqualia · 2026-03-14T15:34:33 1773502473

For the same reason people use link shorteners at all. It’s much more pleasant to look at and makes people more likely to press it compared to a paragraph-long URL full of tracking garbage.

dahart · 2026-03-14T19:07:06 1773515226

Please. The URL above is pretty short, this is not the kind of URL link shorteners were made for, in fact it’s already shortened, as @alecco pointed out.

Pleasant? I could not care less about the pleasantness of the video code, but a shortened URL in this case would not be more pleasant, and it would be functionally worse, and barely shorter; all you’d be able to trim is the “?si=“. I’m baffled by this thread.

alecco · 2026-03-14T15:25:21 1773501921

My point is Google engineers go to the trouble of setting up a URL shortener service on one hand, but on the other hand it seems ad the business anti-privacy executives can override anything. This points out it's a dysfunctional company.

dahart · 2026-03-14T18:59:17 1773514757

You’d rather have the video code and the tracking code baked into the same code just to save a couple of characters? Why? That would result in a longer code than the video code alone, you would save very few characters. It would not be nicer to look at or functionally any different, and it would obscure the fact that it’s being tracked and prevent people from being able to edit the URL to remove the tracking. I appreciate the fact that I can see that the URL has a tracking ID and that I can edit the URL and remove the tracking ID. I do not want a shorter URL if I lose that ability. What you’re complaining about and wishing for would be MUCH worse than what it currently is.

essentia0 · 2026-03-15T22:21:16 1773613276

That's what both Pinterest and Tiktok do with their mobile short links.

alecco · 2026-03-14T19:15:23 1773515723

I didn't say that.

dahart · 2026-03-14T19:24:39 1773516279

Then your point eludes me. You complained about the length. If you don’t want it shorter, then what do you want?

To me, the fact that the tracking code is visible and separate from the video code is evidence of the complete opposite of your conclusion - it’s evidence the ad business does not get to override either engineering nor what’s left of privacy control. Ad execs would surely prefer that the tracking code is not visible nor manually removeable.

alecco · 2026-03-14T20:50:38 1773521438

I didn't complain about length per se. I pointed out Google's contradiction. As my previous comment clarified. Jesus.

dahart · 2026-03-15T13:55:26 1773582926

It’s not a contradiction, as my previous comments clarified. You’re rationalizing, making assumptions, and jumping to a conclusion.

inemesitaffia · 2026-03-14T16:07:47 1773504467

The point is whatever group controls the money controls the power.

Also, only the domain is shorter

alecco · 2026-03-14T17:12:15 1773508335

Actually, it's not just the domain:

https://youtu.be/X

https://www.youtube.com/watch?v=X

ogig · 2026-03-14T01:06:14 1773450374

When running long autonomous tasks it is quite frequent to fill the context, even several times. You are out of the loop so it just happens if Claude goes a bit in circles, or it needs to iterate over CI reds, or the task was too complex. I'm hoping a long context > small context + 2 compacts.

SequoiaHope · 2026-03-14T01:16:36 1773450996

Yep I have an autonomous task where it has been running for 8 hours now and counting. It compacts context all the time. I’m pretty skeptical of the quality in long sessions like this so I have to run a follow on session to critically examine everything that was done. Long context will be great for this.

lukan · 2026-03-14T09:09:58 1773479398

Are those long unsupervised sessions useful? In the sense, do they produce useful code or do you throw most of it away?

brookst · 2026-03-14T12:00:02 1773489602

I get very useful code from long sessions. It’s all about having a framework of clear documentation, a clear multi-step plan including validation against docs and critical code reviews, acceptance criteria, and closed-loop debugging (it can launch/restsart the app, control it, and monitor logs)

I am heavily involved in developing those, and then routinely let opus run overnight and have either flawless or nearly flawless product in the morning.

MikeNotThePope · 2026-03-14T01:18:11 1773451091

I haven't figured out how to make use of tasks running that long yet, or maybe I just don't have a good use case for it yet. Or maybe I'm too cheap to pay for that many API calls.

ashdksnndck · 2026-03-14T01:33:34 1773452014

My change cuts across multiple systems with many tests/static analysis/AI code reviews happening in CI. The agent keeps pushing new versions and waits for results until all of them come up clean, taking several iterations.

tudelo · 2026-03-14T01:40:52 1773452452

I mean if you don't have your company paying for it I wouldn't bother... We are talking sessions of 500-1000 dollars in cost.

takwatanabe · 2026-03-14T11:45:54 1773488754

Right. At Opus 4.6 rates, once you're at 700k context, each tool call costs ~$1 just for cache reads alone. 100 tool calls = $100+ before you even count outputs. 'Standard pricing' is doing a lot of work here lol

brookst · 2026-03-14T12:03:20 1773489800

Cache reads don’t count as input tokens you pay for lol.

https://www.claudecodecamp.com/p/how-prompt-caching-actually...

boredtofears · 2026-03-14T01:15:58 1773450958

All of those things are smells imo, you should be very weary of any code output from a task that causes that much thrashing to occur. In most cases it’s better to rewind or reset and adapt your prompt to avoid the looping (which usually means a more narrowly defined scope)

grafmax · 2026-03-14T01:31:08 1773451868

A person has a supervision budget. They can supervise one agent in a hands-on way or many mostly-hands-off agents. Even though theres some thrashing assistants still get farther as a team than a single micromanaged agent. At least that’s my experience.

not_kurt_godel · 2026-03-14T02:44:23 1773456263

Just curious, what kind of work are you doing where agentic workflows are consistently able to make notable progress semi-autonomously in parallel? Hearing people are doing this, supposedly productively/successfully, kind of blows my mind given my near-daily in-depth LLM usage on complex codebases spanning the full stack from backend to frontend. It's rare for me to have a conversation where the LLM (usually Opus 4.6 these days) lasts 30 minutes without losing the plot. And when it does last that long, I usually become the bottleneck in terms of having to think about design/product/engineering decisions; having more agents wouldn't be helpful even if they all functioned perfectly.

avereveard · 2026-03-14T02:59:46 1773457186

I've passed that bottleneck with a review task that produces engineering recommendations along six axis (encapsulation, decoupling, simplification, dedoupling, security, reduce documentation drift) and a ideation tasks that gives per component a new feature idea, an idea to improve an existing feature, an idea to expand a feature to be more useful. These two generate constant bulk work that I move into new chat where it's grouped by changeset and sent to sub agent for protecting the context window.

What I'm doing mostly these days is maintaining a goal.md (project direction) and spec.md (coding and process standards, global across projects). And new macro tasks development, I've one under work that is meant to automatically build png mockup and self review.

not_kurt_godel · 2026-03-14T03:17:20 1773458240

What are you using to orchestrate/apply changes? Claude CLI?

avereveard · 2026-03-14T04:52:57 1773463977

I prefer in IDE tools because I can review changes and pull in context faster.

At home I use roo code, at work kiro. Tbh as long as it has task delegation I'm happy with it.

grafmax · 2026-03-14T23:17:26 1773530246

I work on 1M LOC 15 yr old repo. Like you it's across the full stack. Bugs in certain pieces of complex business logic would have catastrophic consequences for my employer. Basically I peel poorly-specific work items off my queue into its own worktree and session at high reasoning/effort and provide a well-specified prompt.

These things eat into my supervision budget:

* LLM loses the plot and I have to nudge (like you) * Thinking hard to better specify prompts (like you) * Reviewing all changes (I do not vibe code except for spikes or other low-risk areas) * Manual thing I have to do (for things I have not yet automated with a agent-authored scripts) * Meetings * etc

So, yes, my supervision budget is a bottleneck. I can only run 5-8 agents at a time because I have only so much time in the day.

Compare that vs a single agent at high reasoning/effort: I am sitting waiting for it to think. Waiting for it to find the code area I'm talking about takes time. Compiling, running tests, fixing compile errors. A million other things.

Any time I find myself sitting and waiting, this is a signal to me to switch to a different session.

chrisweekly · 2026-03-14T01:31:12 1773451872

weary (tired) -> wary (cautious)

saaaaaam · 2026-03-14T01:48:29 1773452909

Wary, not weary. Wary: cautious. Weary: tired.

dentalnanobot · 2026-03-14T07:21:28 1773472888

This is really common, I think because there’s also “leery” - cautious, distrustful, suspicious.

dimitri-vs · 2026-03-14T01:14:00 1773450840

It's kind of like having a 16 gallon gas tank in your car versus a 4 gallon tank. You don't need the bigger one the majority of the time, but the range anxiety that comes with the smaller one and annoyance when you DO need it is very real.

steve-atx-7600 · 2026-03-14T01:32:36 1773451956

It seems possible, say a year or two from now that context is more like a smart human with a “small”, vs “medium” vs “large” working memory. The small fellow would be able to play some popular songs on the piano , the medium one plays in an orchestra professionally and the x-large is like Wagner composing Der Ring marathon opera. This is my current, admittedly not well informed mental model anyway. Well, at least we know we’ve got a little more time before the singularity :)

twodave · 2026-03-14T02:37:24 1773455844

It’s more like the size of the desk the AI has to put sheets of paper on as a reference while it builds a Lego set. More desk area/context size = able to see more reference material = can do more steps in one go. I’ve lately been building checklists and having the LLM complete and check off a few tasks at a time, compacting in-between. With a large enough context I could just point it at a PLAN.md and tell it to go to work.

scwoodal · 2026-03-14T01:30:37 1773451837

Except after 4 gallons it might as well be pure oil, mucking everything up.

ricksunny · 2026-03-14T02:03:25 1773453805

Since I'm yet to seriously dive into vibe coding or AI-assisted coding, does the IDE experience offer tracking a tally of the context size? (So you know when you're getting close or entering the "dumb zone")?

MikeNotThePope · 2026-03-14T02:56:27 1773456987

The 2 I know, Cursor and Claude Code, will give you a percentage used for the context window. So if you know the size of the window, you can deduce the number of tokens used.

brookst · 2026-03-14T12:05:22 1773489922

Claude code also gives you a granular breakdown of what’s using context window (system prompt, tools, conversation history, etc). /context

jfim · 2026-03-14T06:32:32 1773469952

In Claude code I believe it's /context and it'll give you a graphical representation of what's taking context space

nujabe · 2026-03-14T02:10:37 1773454237

> Since I'm yet to seriously dive into vibe coding or AI-assisted coding

Unless you’re using a text editor as an IDE you probably have already

8note · 2026-03-14T03:00:27 1773457227

Cline gives you such a thing. you dont really know where the dumb zone by numbers though, only by feel.

stevula · 2026-03-14T02:08:09 1773454089

Most tools do, yes.

quux · 2026-03-14T02:09:04 1773454144

OpenCode does this. Not sure about other tools

Barbing · 2026-03-14T03:53:34 1773460414

Looking at this URL, typo or YouTube flip the si tracking parameter?

  youtu.be/rmvDxxNubIg?is=adMmmKdVxraYO2yQ

MikeNotThePope · 2026-03-14T09:52:02 1773481922

I just cut & pasted the share URL provided by YouTube. Strip out the query param if you like.

Barbing · 2026-03-15T03:25:41 1773545141

Ooh it’s always ?si=

So this…

  ?is=

…that’s new.

Think you got A/B tested. Flipping the parameter breaks a lot of RegEx. Interesting!

dev_l1x_be · 2026-03-14T08:06:33 1773475593

I never use these giant context windows. It is pointless. Agents are great at super focused work that is easy to re-do. Not sure what is the use case for giant context windows.

wat10000 · 2026-03-14T12:12:43 1773490363

I've used it many times for long-running investigations. When I'm deep in the weeds with a ton of disassembly listings and memory dumps and such, I don't really want to interrupt all of that with a compaction or handoff cycle and risk losing important info. It seems to remain very capable with large contexts at least in that scenario.

maskull · 2026-03-14T01:59:49 1773453589

After running a context window up high, probably near 70% on opus 4.6 High and watching it take 20% bites out of my 5hr quota per prompt I've been experimenting with dumping context after completing a task. Seems to be working ok. I wonder if I was running into the long context premium. Would that apply to Pro subs or is just relevant to api pricing?

virtualritz · 2026-03-14T13:16:53 1773494213

I haven't hit the "dumb zone" any more since two months. I think this talk is outdated.

I'm using CC (Opus) thinking and Codex with xhigh on always.

And the models have gotten really good when you let them do stuff where goals are verifiable by the model. I had Codex fix a Rust B-rep CSG classification pipeline successfully over the course of a week, unsupervised. It had a custom STEP viewer that would take screenshots and feed them back into the model so it could verify the progress resp. the triangle soup (non progress) itself.

Codex did all the planning and verification, CC wrote the code.

This would have not been possible six months ago at all from my experience.

Maybe with a lot of handholding; but I doubt it (I tried).

I mean both the problem for starters (requires a lot of spatial reasoning and connected math) and the autonomous implementation. Context compression was never an issue in the entire session, for either model.

saaaaaam · 2026-03-14T01:47:53 1773452873

That video is bizarre. Such a heavy breather.

coldtea · 2026-03-14T04:24:02 1773462242

What a weird and inconsequential thing to focus on...

He's just fucking closely miced with compression + speaking fast and anxious/excited speaking to an audience

saaaaaam · 2026-03-15T16:25:45 1773591945

Maybe. But that’s what I focused on, for better or worse. I couldn’t concentrate on what he was saying because of it. Maybe bad mic placement, but the end results was like some sort of old school phone sex pest.

indigodaddy · 2026-03-14T04:26:16 1773462376

Most of that is just nervousness

bushbaba · 2026-03-14T03:22:27 1773458547

Yes. I’ve used it for data analysis

twodave · 2026-03-14T02:33:41 1773455621

I mean, try using copilot on any substantial back-end codebase and watch it eat 90+% just building a plan/checklist. Of course copilot is constrained to 120k I believe? So having 10x that will blow open up some doors that have been closed for me in my work so far.

That said, 120k is pleeenty if you’re just building front-end components and have your API spec on hand already.