It is 100% ARC-AGI-3 specific though, just read through the prompts https://gith...

boxed · 2026-03-27T06:01:21 1774591281

What a dick move. Making that prompt open source will probably mean that every other model that doesn't want to cheat will scrape that and accidentally cheat in the next models.

cxdorn · 2026-03-27T15:12:45 1774624365

(disclaimer: i worked on early versions of agentica_sdk; but wasn't involved in recent developments and the ARC solver)

As other comments point out this is about harness development and harness efficiency. Agentica SDK is a sort of meta harness, that makes things easy: plug any "internal API" (as defined natively in your codebase) directly into your agent. Agentica SDK itself is not application specifc; but the APIs of your application are... application specific.

Re: the linked prompt. A harness is a set of tools and descriptions how to best use those tools, and sometimes some external control flow based on the outcome of using those tools. How to "best use the tools" should always be part of the prompt (like in this case).

So this work tries to answer: "short of telling the agent any solutions, make a simple but efficient API to play the games, hand it to the agent, and see how it does". In the world of harness development I think that's an interesting question to answer!

DetroitThrow · 2026-03-27T18:41:10 1774636870

>In the world of harness development I think that's an interesting question to answer!

The challenge isn't about harness development though, and a sufficiently complex harness can solve these tasks rather easily.

And presenting it as if you've made a novel development for solving ARC-AGI-3 leads me to believe you're willing to waste all of our time for your benefit at every step in the future.

cxdorn · 2026-03-27T22:47:22 1774651642

> a sufficiently complex harness can solve these tasks rather easily.

I claim this is not so easily done, and earlier iterations of ARC-AGI did not have the constraint in the first place. You want something that generalizes across all puzzles (hopefully even the private ones), and these puzzles are extremely diverse ... and hard; telling the model the controls and some basic guidelines for the game is the only "obvious" thing you can do.

The other point of my reply was efficiency, both in terms of creating and using the harness; the discussed solution is something that anyone (in fact, likely even an LLM itself) can cook up in a few minutes; it's not much more than a game control wrapper so the agent can play around with the game in live python and some generalities as laid out in the prompt.

(But I'm always happy to be proven wrong. What harnesses did you have in mind?)

diwank · 2026-03-27T06:23:26 1774592606

this is so disingenuous on symbolica's part. these insincere announcements just make it harder for genuine attempts and novel ideas

DetroitThrow · 2026-03-27T05:11:50 1774588310

Um, yes this is a extremely specific as a benchmark harness. It has a ton of knowledge encoded about the tasks at hand. The tweet is dishonest even in the best light.

The hard part of these tests isn't purely reasoning ability ffs.