I’m curious about how you landed “git gud; prompt better” and not “maybe the domain I work in is a better fit for LLM code”. Or, to be a bit less generous, consider the possibility that the code you’re generating is boilerplate, marshaling, and/or API calls. A facade of perceived complexity over something that’s as complex as a filter-map or two.
In the past 2 months I've been using all the SOTA models to help me design a new DSL for narrative scripting (such as game story telling) and a c# runtime implementation o the script player engine.
The language spec and design is about 95% authored by me up to this point; I have the LLMs work on the 2nd layer: the implementation specs/guidelines and the 3rd layer: concrete c# implementation.
Since it's a new language, I consider it's somewhat new/novel tasks for LLMs (at least, not like boilerplate stuff like HTTP API or CRUD service). I'd say, these LLMs have been very helpful - you can tell they sometimes get confused and have trouble to comply to the foreign language spec and design - but they are mostly smart enough to carry out the objectives, and they get better and better after the project got on track and has plenty of files/resources to read and reference.
And I'd also say "prompt better" is a important factor, just much more nuanced/complicated. I started with 0 experience with LLM agents and have learned a lot about how to tame them, and developed a protocol to collaborate with agents, these all comes from countless trial and errors, but in the end get boiled down to "prompt better".
I wonder if my intuition here is correct; I would posit that “PL implementation” is a far more popular and well-explored field than it seems. How many toy/small/labor-of-love langs make it to Show HN? How many more simply don’t?
I’ve never personally caught the language implementation bug. I appreciate your perspective here.
I totally agree, and I was fully aware of how common people make language for fun when I replied.
But I feel like the rationale would still stands: Considering LLMs' natures, common boilerplate tasks are easy because they can kind of just "decompress" from training data. But for a new language design, unless the language is almost identical to some other captured by the model, "decompression" would just fail.
As someone who has implemented a fair few DSLs, lexical and syntactic analysis is pretty much the same anywhere, and the structure of the lexer/parser does not really depend on the grammar of the language.
And even semantic analysis is at least very similar in most PLs. Even DSLs. Assuming you're using concepts like variables and functions.
When it comes to codegen / interpreter runtimes, things start to diverge. But this also depends on the use case. More often than not a DSL is a one-to-one map to an existing language, with syntactic sugar on top.
The points you brought up all are valid. Lexer, parser and general concepts are not language-specific, yes, and I wasn't talking about how the implementation is different.
When I said "you can tell they sometimes get confused and have trouble to comply to the foreign language spec and design", I was thinking about the many times they just fail to write in my language even when provided will full language specs. LLMs don't "think" and boilerplate is easy for LLMs because highly similar syntax structure even identical code exist in their training data, they are kind of just copying stuff. But that doesn't work that well when they are tasked to write in a original language that is... too creative.
I am prompting better. It doesn't help the LLM be more productive than me on a regular tuesday.
Sure, I can get the task done by delegating everything to an agentic workflow, but it just adds a bunch of useless overhead to my work.
I still need to know what the code does at the end of the day, so I can document it and reason about it. If I write the code myself, it's easy. If an LLM does it, it's a chore.
And even without those concerns, the LLM is still slower than me. Unless it's trivial boilerplate, in which case other tools serve me better and cheaper.
I'll note that a compiler is one of the most well understood and implemented software projects, much of it open source, which means the LLM has a lot of prior art that it can copy.
When web search first arrived, the same thing happened. That is, some people didn't like using the tool because it wasn't finding what they wanted. This is still true for a lot of folks today, actually.
It's less "git gud; prompt better", and more, "be able to explain (well) what you want as the output". If someone messages the IT guy and says "hey my computer is broken" - what sort of helpful information can the IT guy offer beyond "turn it on and off again"?
So how do you rectify your anecdotal experience against those made by public figures in the industry who we can all agree are at least pretty good engineers? I think that's important because if we want to stay ~anonymous, neither you nor I can verify the reputation of one another (and therefore, one another's relative master of the "Craft").
Here are some well known names who are now saying they regularly use LLM's for development. For many of these folks, that wasn't true 1-2 years ago:
My point being - some random guy on the internet says LLM's have never been useful for them and they only output garbage vs. some of the best engineers in the field using the same tools, and saying the exact opposite of what you are.
>Here are some well known names who are now saying they regularly use LLM's for development. For many of these folks, that wasn't true 1-2 years ago:
This is a huge overstatement that isn't supported by your own links.
- Donald Knuth: the link is him acknowledging someone else solved one of his open problems with Claude. Quote: "It seems that I’ll have to revise my opinions about “generative AI” one of these days."
- Linus Torvalds: used it to write a tool in Python because "I know more about analog filters—and that’s not saying much—than I do about python" and he doesn't care to learn. He's using it as a copy-paste replacement, not to write the kernel.
- John Carmack: he's literally just opining on what he thinks will happen in the future.
You are overstating those sources. That alone makes me doubt that you're engaging in this discussion in good faith.
I read them all, and in none of them do any of the three say that they "regularly use LLMs for development".
Carmack is speculating about how the technology will develop. And Carmack has a vested interest in AI, so I would not put any value on this as an "engineers opinion".
Torvalds has vibe coded one visualizer for a hobby project. That's within what I might use to test out LLM output: simple, inconsequential, contained. There's no indication in that article that Linus is using LLMs for any serious development work.
Knuth is reporting about somebody else using LLMs for mathematical proofs. The domain of mathematical proofs is much more suitable for LLM work, because the LLM can be guided by checking the correctness of proofs.
And Knuth himself only used the partial proof sent in by someone else as inspiration for a handcrafted proof.
I don't mind arguing this case with you, but please don't fabricate facts. That's dishonest
> I’m curious about how you landed “git gud; prompt better” and not “maybe the domain I work in is a better fit for LLM code”.
1. Personal experience. Lazy prompting vs careful prompting.
2. They're coincidentally good at things I'm good at, and shit at things I don't understand.
3. Following from 2, when used by somebody who does understand a problem space which I do not, they easily succeed. That dog vibe coding games succeeded in getting claude to write games because his master knew a thing or two about it. I on the other hand have no game Dev experience, even almost no hobby experience with games specifically, so I struggle to get any game code that even remotely works.
Irrespective of the domain you specifically listed in 3 (game dev is, believe it or not, one of the “more complex” domains), you have completely failed to miss the point.
> 2. They're coincidentally good at things I'm good at, and shit at things I don't understand.
This may well be! In the perfect world this would be balanced with the knowledge that maybe “the things you’re good at” are objectively* easier than “things you don’t understand”. Speaking for myself, I’m proficient in many more easy things than hard things.
I have definitely considered the possibility that I'm simply good at easy things and the LLM is good at easy things, and that hard things are hard for both of us. And there certainly must be some element of that going on, but I keep noticing that different people get different quality results for the same kind of problems, and it seems to line up with how good they themselves would be at that task. If you know the problem space well, you can describe the problem (and approaches to it) with a precision that people unfamiliar with the problem space will struggle with.
I think you can observe this in action by making vague requests, seeing how it does, then roll back that work and make a more precise request using relevant jargon and compare the results. For example, I asked claude to make a system that recommends files with similar tags. It gave me a recommender that just orders files by how many tags they had in common with the query file. This is the kind of solution that somebody may think up quick but it doesn't actually work great in practice. Then I reverted all of that and instead specified that it should use a vector space model with cosine similarity. It did pretty good but there was something subtly off. That is however about the limit of my expertise in this direction, so I tabbed over to a session with ChatGPT and discussed the problem on a high level for about 20 minutes, then asked ChatGPT to write up a single terse technically precise paragraph describing the problem. I told ChatGPT to use no bullet points and write no psuedocode, telling it the coding agent was already an expert in the codebase so let it worry about the coding. I give that paragraph to claude and suddenly it clicks, it bangs out a working solution without any drama. So I conclude the quality of the prompting determined the quality of the results.
The parent is specifically talking about producing boilerplate code -a domain in which LLM excell at- and not having had any success at that. It's therefore not a leap of logic to assume they haven't put (enough) effort into getting better at prompting first, which is perfectly fine per se but leans towards a skill issue and not an immutable property of gen AI.
The uncomfortable fact remains that one cannot really expect to get much better results from an LLM without putting some work themselves. They aren't magical oracles.
It is straightfoward to build systems which derive their state from the audit trail instead of building the audit trail in parallel. That is what event sourcing is.
I was attempting to emphasize the absurdity of any software system being “absolutely correct at all times”. I don’t believe such a system can exist, at least not in such strong terms.
What's important is that the audit trail can be replayed to derive the state of the system - and preferably in such a way that investigators can determine what _would_ have been seen by someone using it on a specific day at a specific time. Whether the system is free from bugs is a different matter - no system is, which is why deriving state from the audit trail instead of a parallel process which is guaranteed to diverge is so important!