An LLM company using *regexes* for sentiment analysis? That's like a truck compa...

lopsotronic · 2026-03-31T14:29:16 1774967356

The difference in response time - especially versus a regex running locally - is really difficult to express to someone who hasn't made much use of LLM calls in their natural language projects.

Someone said 10,000x slower, but that's off - in my experience - by about four orders of magnitude. And that's average, it gets much worse.

Now personally I would have maybe made a call through a "traditional" ML widget (scikit, numpy, spaCy, fastText, sentence-transformer, etc) but - for me anyway - that whole entire stack is Python. Transpiling all that to TS might be a maintenance burden I don't particularly feel like taking on. And on client facing code I'm not really sure it's even possible.

cyanydeez · 2026-03-31T14:33:48 1774967628

So, think of it as a business man: You don't really care if your customers swear or whatever, but you know that it'll generate bad headlines. So you gotta do something. Just like a door lock isn't designed for a master criminal, you don't need to design your filter for some master swearer; no, you design it good enough that it gives the impression that further tries are futile.

So yeah, you do what's less intesive to the cpu, but also, you do what's enough to prevent the majority of the concerns where a screenshot or log ends up showing blatant "unmoral" behavior.

true_religion · 2026-03-31T14:39:32 1774967972

This door lock doesn’t even work against people speaking French, so I think they could have tried a mite harder.

ben_w · 2026-03-31T15:37:54 1774971474

The up-side of the US market is (almost) everyone there speaks English. The down side is, that includes all the well-networked pearl-clutchers. Europe (including France) will have the same people, but it's harder to coordinate a network of pearl-clutching between some saying "Il faut protéger nos enfants de cette vulgarité!" and others saying "Η τηλεόραση και τα μέσα ενημέρωσης διαστρεβλώνουν τις αξίες μας!" even when they care about the exact same media.

For headlines, that's enough.

For what's behind the pearl-clutching, for what leads to the headlines pandering to them being worth writing, I agree with everyone else on this thread saying a simple word list is weird and probably pointless. Not just for false-negatives, but also false-positives: the Latin influence on many European languages leads to one very big politically-incorrect-in-the-USA problem for all the EU products talking about anything "black" (which includes what's printed on some brands of dark chocolate, one of which I saw in Hungary even though Hungarian isn't a Latin language but an Ugric language and only takes influences from Latin).

sebastiennight · 2026-03-31T17:59:57 1774979997

I just went through quite an adventure trying to translate back and forth from/to Hungarian to/from different languages to figure out which Hungarian word you meant, and arrived at the conclusion that this language is encrypted against human comprehension.

bandie91 · 2026-03-31T19:18:57 1774984737

dark chocolate is "étcsokoládé" literally edible-chocolate in Hungarian.

i heared the throat-cleaning "Negró" candy (marketed by a chimney sweeper man with soot-covered face) was usually which hurt English-speaking people's self-deprecating sensitivities.

sebastiennight · 2026-03-31T15:19:03 1774970343

En toute honnêteté, je pense avoir dit "damn it" plus d'une fois à chat gépété avant de fermer la fenêtre dans un accès de rage

tomaskafka · 2026-03-31T18:46:09 1774982769

Nom de dieu de putain de bordel de merde de saloperie de connard d'enculé de ta mère.

z500 · 2026-03-31T18:54:13 1774983253

It's like wiping your arse with silk.

bigbuppo · 2026-03-31T16:31:46 1774974706

There are only Americans on the internet.

themafia · 2026-03-31T19:10:38 1774984238

Yea.. but.. in English only.

Fortunately I can swear pretty well in Spanish.

senderista · 2026-04-01T05:34:32 1775021672

Only a native speaker can tell if you swear well in a foreign language.

themafia · 2026-04-01T06:13:49 1775024029

And Claude can't tell at all.

jacquesm · 2026-03-31T20:31:16 1774989076

That's like saying you can use a chisel for woodworking.

wcrossbow · 2026-03-31T16:46:56 1774975616

If it’s good enough it’s good enough, but just like there are many more options than going full blown LLM or just use a regex there are more options than transpile a massive Python stack to TS or give up.

mlmonkey · 2026-03-31T16:02:43 1774972963

> Someone said 10,000x slower, but that's off - in my experience - by about four orders of magnitude.

You do know that 10,000x _is_ four orders of magnitude, right? :-D

jonbwhite · 2026-03-31T16:12:27 1774973547

OP is saying that in their experience it is more like eight orders of magnitude

mlmonkey · 2026-03-31T16:57:13 1774976233

I guess I need reading glasses ... :-D

noprof6691 · 2026-03-31T16:21:56 1774974116

They're sending it to an llm anyway tho? Not sure why they wouldn't just add a sentiment field to the requested response shape.

FuckButtons · 2026-03-31T16:28:24 1774974504

because a regex on the client is free vs gpu compute is absolutely not.

noprof6691 · 2026-04-01T08:15:31 1775031331

BUT THEY'RE ALREADY RUNNING IT THROUGH THE LLM.

stingraycharles · 2026-03-31T11:05:59 1774955159

Because they want it to be executed quickly and cheaply without blocking the workflow? Doesn’t seem very weird to me at all.

_fizz_buzz_ · 2026-03-31T11:47:19 1774957639

They probably have statistics on it and saw that certain phrases happen over and over so why waste compute on inference.

crem · 2026-03-31T14:53:28 1774968808

More likely their LLM Agent just produced that regex and they didn't even notice.

mycall · 2026-03-31T12:08:29 1774958909

The problem with regex is multi-language support and how big the regex will bloat if you to support even 10 languages.

doublesocket · 2026-03-31T12:51:29 1774961489

Supporting 10 different languages in regex is a drop in the ocean. The regex can be generated programmatically and you can compress regexes easily. We used to have a compressed regex that could match any placename or street name in the UK in a few MB of RAM. It was silly quick.

cogman10 · 2026-03-31T14:18:41 1774966721

I think it will depend on the language. There are a few non-latin languages where a simple word search likely won't be enough for a regex to properly apply.

mycall · 2026-04-01T13:22:12 1775049732

Exactly this. Unicode is a big beast to consider in regex concats.

astrocat · 2026-03-31T14:29:05 1774967345

woah. This is a regex use I've never heard of. I'd absolutely love to see a writeup on this approach - how its done and when it's useful.

benlivengood · 2026-03-31T15:09:42 1774969782

You can literally | together every street address or other string you want to match in a giant disjunction, and then run a DFA/NFA minimization over that to get it down to a reasonable size. Maybe there are some fast regex simplification algorithms as well, but working directly with the finite automata has decades of research and probably can be more fully optimized.

doublesocket · 2026-03-31T18:22:32 1774981352

This was many moons ago, written in perl. From memory we used Regexp::Trie - https://metacpan.org/release/DANKOGAI/Regexp-Trie-0.02/view/...

We used it to tokenize search input and combined it with a solr backend. Worked really remarkably well.

TeMPOraL · 2026-03-31T12:19:18 1774959558

We're talking about Claude Code. If you're coding and not writing or thinking in English, the agents and people reading that code will have bigger problems than a regexp missing a swear word :).

MetalSnake · 2026-03-31T12:35:17 1774960517

I talk to it in non-English. But have rules to have everything in code and documentation in english. Only speaking with me should use my native language. Why would that be a problem?

ekropotin · 2026-03-31T13:26:21 1774963581

Because 90% of training data was in English and therefore the model perform best in this language.

foldr · 2026-03-31T13:34:41 1774964081

In my experience these models work fine using another language, if it’s a widely spoken one. For example, sometimes I prompt in Spanish, just to practice. It doesn’t seem to affect the quality of code generation.

adamsb6 · 2026-03-31T13:57:51 1774965471

They literally just have to subtract the vector for the source language and add the vector for the target.

It’s the original use case for LLMs.

curioussquirrel · 2026-03-31T18:24:04 1774981444

Thank you. +1. There are obviously differences and things getting lost or slightly misaligned in the latent space, and these do cause degradation in reasoning quality, but the decline is very small in high resource languages.

ekropotin · 2026-03-31T15:11:10 1774969870

It’s just a subjective observation.

It just can’t be a case simply because how ML works. In short, the more diverse and high quality texts with reasoning reach examples were in the training set, the better model performs on a given language.

So unless Spanish subset had much more quality-dense examples, to make up for volume, there is no way the quality of reasoning in Spanish is on par with English.

I apologise for the rambling explanation, I sure someone with ML expertise here can it explain it better.

omcnoe · 2026-03-31T18:52:16 1774983136

I saw a curious post recently that explored this idea, and showed that it isn’t really the case. The internal layers of the model aren’t really reasoning in English, or in any human language.

Translation in/out of human languages only happens at the edges of the model.

Internal layer activations for the same concept are similar regardless of language, while activations at the top/bottom layers diverge. Meanwhile the pattern is reversed for same language different content.

ekropotin · 2026-03-31T19:13:31 1774984411

So we do at least agree on the fact that quality of human language <-> embeddings transition depends on how good target language is represented in the training dataset?

Even if it happens at the edge, on every conversation turn, I may assume non captured small subtleties of meaning over time can accumulate into significant error.

foldr · 2026-03-31T18:00:44 1774980044

The agent may be translating to English behind the scenes. There are all kinds of hacks layered on top of the underlying models. But however exactly it works, it’s clearly not the case that these things only make use of Spanish training data when you prompt them in Spanish.

Lets take a language with orders of magnitude less training data, like Nahuatl. The following is what happens if I ask Claude Code in Nahuatl to give me an example of a simple program that I can execute inside a web browser (avoiding any direct keyword prompts like 'Javascript' or 'Hello World'). Interestingly, it responds in Spanish. I doubt that it has any relevant training data in Nahuatl. (Apologies if the Nahuatl is wrong – I just used Google translate.)

    ¿Uelis techmakas se neskayotl tlen se programa tlen amo ouij tlen ueliskia nijchiuas ipan se navegador web?

    -----

    Entiendo tu pregunta en náhuatl. Aquí un ejemplo sencillo — una calculadora en HTML puro que puedes abrir directo en el navegador, sin servidor:

    <!DOCTYPE html>
    <html>
    <head>
        <meta charset="UTF-8">
        <title>Calculadora</title>
        <style>
        body { font-family: sans-serif; display: flex; justify-content: center; padding: 40px; }
        .calc { border: 1px solid #ccc; padding: 20px; border-radius: 8px; width: 200px; }
        input { width: 100%; margin-bottom: 10px; font-size: 1.2em; text-align: right; }
        button { width: 22%; margin: 2px; padding: 8px; font-size: 1em; cursor: pointer; }
        .wide { width: 47%; }
        </style>
    </head>
    <body>
    <div class="calc">
        <input type="text" id="pantalla" readonly>
        <div>
        <button onclick="tecla('7')">7</button>
        <button onclick="tecla('8')">8</button>
        <button onclick="tecla('9')">9</button>
        <button onclick="tecla('/')">÷</button>
    
        <button onclick="tecla('4')">4</button>
        <button onclick="tecla('5')">5</button>
        <button onclick="tecla('6')">6</button>
        <button onclick="tecla('*')">×</button>
    
        <button onclick="tecla('1')">1</button>
        <button onclick="tecla('2')">2</button>
        <button onclick="tecla('3')">3</button>
        <button onclick="tecla('-')">−</button>
    
        <button onclick="tecla('0')">0</button>
        <button onclick="tecla('.')">.</button>
        <button onclick="tecla('+')">+</button>
        <button onclick="limpiar()">C</button>
    
        <button class="wide" onclick="calcular()">=</button>
        <button class="wide" onclick="borrar()">⌫</button>
        </div>
    </div>
    
    <script>
        const pantalla = document.getElementById('pantalla');
    
        function tecla(val) {
        pantalla.value += val;
        }
    
        function calcular() {
        try {
            pantalla.value = eval(pantalla.value);
        } catch {
            pantalla.value = 'Error';
        }
        }
    
        function limpiar() { pantalla.value = ''; }
        function borrar() { pantalla.value = pantalla.value.slice(0, -1); }
    </script>
    </body>
    </html>
    
    Guarda esto como calculadora.html y ábrelo en cualquier navegador — no necesita servidor ni dependencias. Es un buen punto de partida para aprender HTML,
    CSS y JavaScript.

ekropotin · 2026-03-31T18:26:54 1774981614

> it’s clearly not the case that these things only make use of Spanish training data when you prompt them in Spanish.

It’s not! And I’ve never said that.

Anyways, I’m not even sure what we are arguing about, as it’s 100% fact that SOTA models perform better in English, the only interesting question here how much better, is it negligible or actually makes a difference in real world use-cases.

foldr · 2026-03-31T18:35:51 1774982151

It’s negligible as far as I can tell. If the LLM can “speak” the language well then you can prompt it in that language and get more or less the same results as in English.

formerly_proven · 2026-03-31T12:55:29 1774961729

In my experience agents tend to (counterintuitively) perform better when the business language is not English / does not match the code's language. I'm assuming the increased attention mitigates the higher "cognitive" load.

cryptonector · 2026-03-31T15:42:45 1774971765

Claude handles human languages other than English just fine.

crimsonnoodle58 · 2026-03-31T13:06:45 1774962405

They only need to look at one language to get a statistically meaningful picture into common flaws with their model(s) or application.

If they want to drill down to flaws that only affect a particular language, then they could add a regex for that as well/instead.

b112 · 2026-03-31T12:15:57 1774959357

Did you just complain about bloat, in anything using npm?

Foobar8568 · 2026-03-31T11:47:49 1774957669

Why do you need to do it at the client side? You are leaking so much information on the client side. And considering the speed of Claude code, if you really want to do on the client side, a few seconds won't be a big deal.

plorntus · 2026-03-31T12:59:50 1774961990

Depends what its used by, if I recall theres an `/insights` command/skill built in whatever you want to call it that generates a HTML file. I believe it gives you stats on when you're frustrated with it and (useless) suggestions on how to "use claude better".

Additionally after looking at the source it looks like a lot of Anthropics own internal test tooling/debug (ie. stuff stripped out at build time) is in this source mapping. Theres one part that prompts their own users (or whatever) to use a report issue command whenever frustration is detected. It's possible its using it for this.

matkoniecz · 2026-03-31T12:11:20 1774959080

> a few seconds won't be a big deal

it is not that slow

orphea · 2026-03-31T11:58:31 1774958311

It looks like it's just for logging, why does it need to block?

jflynn2 · 2026-03-31T12:16:46 1774959406

Better question - why would you call an LLM (expensive in compute terms) for something that a regex can do (cheap in compute terms)

Regex is going to be something like 10,000 times quicker than the quickest LLM call, multiply that by billions of prompts

orphea · 2026-03-31T12:55:52 1774961752

This is assuming the regex is doing a good job. It is not. Also you can embed a very tiny model if you really want to flag as many negatives as possible (I don't know anthropic's goal with this) - it would be quick and free.

gf000 · 2026-03-31T13:17:14 1774963034

I think it's a very reasonable tradeoff, getting 99% of true positives at the fraction of cost (both runtime and engineering).

Besides, they probably do a separate analysis on server side either way, so they can check a true positive to false positive ratio.

nojs · 2026-03-31T15:37:37 1774971457

Oh it’s worse than that. This one ended up getting my account banned: https://github.com/anthropics/claude-code/issues/22284

lanbin · 2026-03-31T16:04:02 1774973042

This is a tricky problem, I mean, Pinyin also uses the English alphabet.

foodevl · 2026-03-31T18:17:43 1774981063

It is not a tricky problem because it has a simple and obvious solution: do not filter or block usage just because the input includes a word like "gun".

cryptonector · 2026-03-31T15:45:41 1774971941

Wow, that's horrible.

toraway · 2026-03-31T21:26:34 1774992394

... and closed for inactivity like basically every issue in the repo, of course.

blks · 2026-03-31T11:56:08 1774958168

Because they actually want it to work 100% of the time and cost nothing.

mohsen1 · 2026-03-31T13:55:14 1774965314

Maybe hard to believe but not everyone is speaking English to Claude

orphea · 2026-03-31T12:02:31 1774958551

Then they made it wrong. For example, "What the actual fuck?" is not getting flagged, neither is "What the *fuck*".

arcfour · 2026-03-31T13:42:51 1774964571

It is exceedingly obvious that the goal here is to catch at least 75-80% of negative sentiment and not to be exhaustive and pedantic and think of every possible way someone could express themselves.

Zamaamiro · 2026-03-31T13:55:25 1774965325

Classic over-engineering. Their approach is just fine 90% of the time for the use case it’s intended for.

orphea · 2026-03-31T14:19:55 1774966795

75-80% [1], 90%, 99% [2]. In other words, no one has any idea.

I doubt it's anywhere that high because even if you don't write anything fancy and simply capitalize the first word like you'd normally do at the beginning of a sentence, the regex won't flag it.

Anyway, I don't really care, might just as well be 99.99%. This is not a hill I'm going to die on :P

[1]: https://news.ycombinator.com/item?id=47587286

[2]: https://news.ycombinator.com/item?id=47586932

zwirbl · 2026-03-31T14:56:02 1774968962

It compares to lowercase input, so doesn't matter. The rest is still valid

morkalork · 2026-03-31T15:23:22 1774970602

Except that it's a list of English keywords. Swearing at the computer is the one thing I'll hear devs switch back to their native language for constantly

vntok · 2026-03-31T13:33:44 1774964024

They evidently ran a statistical analysis and determined that virtually no one uses those phrases as a quick retort to a model's unsatisfying answer... so they don't need to optimize for them.

codegladiator · 2026-03-31T11:15:02 1774955702

what you are suggesting would be like a truck company using trucks to move things within the truck

argee · 2026-03-31T11:25:01 1774956301

That’s what they do. Ever heard of a hand truck?

eadler · 2026-03-31T11:44:43 1774957483

I never knew the name of that device.

Thanks

freedomben · 2026-03-31T12:21:23 1774959683

Depending on the region you live in, it's also frequently called a "dolly"

SmellTheGlove · 2026-03-31T15:05:09 1774969509

Isn’t a dolly a flat 4 wheeled platform thingy? A hand truck is the two wheeled thing that tilts back.

eszed · 2026-03-31T15:43:45 1774971825

Ha! Where I'm from a "dolly" was the two-wheeled thing. The four-wheeler thing wasn't common before big-boxes took over the hardware business, but I think my dad would have called it a "cart", maybe a "hand-cart".

sethammons · 2026-04-01T09:16:12 1775034972

Grew up with two wheeled: dolly and four wheeled: piano dolly. Was an adult when I heard hand-truck. I prefer dolly. Nicer mouth feel.

istoleabread · 2026-03-31T11:48:04 1774957684

Do we have a hand llm perchance?

svnt · 2026-03-31T14:54:35 1774968875

Yeah it’s called a regex. With a lot of human assistance it can do less but fits in smaller spaces and doesn’t break down.

apgwoz · 2026-03-31T15:02:47 1774969367

It’s also deterministic, unlike llms…

floralhangnail · 2026-03-31T13:02:27 1774962147

Well, regex doesn't hallucinate....right?

raw_anon_1111 · 2026-03-31T15:51:35 1774972295

I just went to expertSexChange.com…

geon · 2026-03-31T15:23:28 1774970608

buttbuttination

mmh0000 · 2026-03-31T15:51:26 1774972286

The Clbuttical problem[1]

[1] https://en.wikipedia.org/wiki/Scunthorpe_problem

draxil · 2026-03-31T11:17:51 1774955871

Good to have more than a hammer in your toolbox!

lazysheepherd · 2026-03-31T13:50:40 1774965040

Because they are engineers? The difference between an engineer and a hobbyist is an engineer has to optimize the cost.

As they say: any idiot can build a bridge that stands, only an engineer can build a bridge that barely stands.

raw_anon_1111 · 2026-03-31T15:48:58 1774972138

Cloud hosted call centers using LLMs is one of my specialties. While I use an LLM for more nuanced sentiment analysis, I definitely use a list of keywords as a first level filter.

nitekode · 2026-03-31T16:22:59 1774974179

A lot if things dont make sense until you involve scale. Regex could be good enough do give a general gist.

j45 · 2026-03-31T14:13:17 1774966397

Asking a non deterministic software to act like a deterministic one (regex) can be a significantly higher use of tokens/compute for no benefit.

Some things will be much better with inference, others won’t be.

__alexs · 2026-03-31T13:42:43 1774964563

Using some ML to derive a sentiment regex seems like a good actually?

ldobre · 2026-03-31T16:33:00 1774974780

It's more like a truck company using people to transport some parts. I could be wrong here, but I bet this happens in Volvo's fabrics a lot.

makeitrain · 2026-03-31T15:12:57 1774969977

Don’t worry, they used an llm to generate the regex.

irthomasthomas · 2026-03-31T14:13:12 1774966392

This just proves its vibe coded because LLMs love writing solutions like that. I probably have a hundred examples just like it in my history.

irthomasthomas · 2026-03-31T17:27:00 1774978020

Actually, this could be a case where its useful. Even it only catches half the complaints, that's still a lot of data, far more than ordinary telemetry used to collect.

throwaw12 · 2026-03-31T12:27:56 1774960076

because impact of WTF might be lost in the result of the analysis if you solely rely on LLM.

parsing WTF with regex also signifies the impact and reduces the noise in metrics

"determinism > non-determinism" when you are analysing the sentiment, why not make some things more deterministic.

Cool thing about this solution, is that you can evaluate LLM sentiment accuracy against regex based approach and analyse discrepancies

arnarbi · 2026-03-31T18:17:50 1774981070

It's more like workers on a large oil tanker using bicycles to move around it, rather than trying to use another oil tanker.

harikb · 2026-03-31T14:10:20 1774966220

Not everything done by claude-code is decided by LLM. They need the wrapper to be deterministic (or one-time generated) code?

ojr · 2026-03-31T11:46:36 1774957596

I used regexes in a similar way but my implementation was vibecoded, hmmm, using your analysis Claude Code writes code by hand.

intended · 2026-03-31T14:15:05 1774966505

The amount of trust and safety work that depends on google translate and the humble regex, beggars the imagination.

apgwoz · 2026-03-31T15:01:36 1774969296

> That's like a truck company using horses to transport parts. Weird choice.

Easy way to claim more “horse power.”

mghackerlady · 2026-03-31T13:08:46 1774962526

More like a car company transporting their shipments by truck. It's more efficient

pdntspa · 2026-03-31T15:50:45 1774972245

LLMs cost money, regular expressions are free. It really isn't so strange.

scotty79 · 2026-03-31T19:01:27 1774983687

As far as I can tell they do nothing with it. They just log it.

pfortuny · 2026-03-31T12:22:11 1774959731

They had the problem of sentiment analysis. They use regexes.

You know the drill.

artrockalter · 2026-03-31T20:39:01 1774989541

LLMs are good at writing complex regex, from my experience

kjshsh123 · 2026-03-31T12:30:37 1774960237

Using regex with LLMs isn't uncommon at all.

slashdave · 2026-04-01T05:41:57 1775022117

Maybe. Could just be a pre filter.

lou1306 · 2026-03-31T11:02:01 1774954921

They're searching for multiple substrings in a single pass, regexes are the optimal solution for that.

noosphr · 2026-03-31T11:09:28 1774955368

The issue isn't that regex are a solution to find a substring. The issue is that you shouldn't be looking for substrings in the first place.

This has buttbuttin energy. Welcome to the 80s I guess.

lou1306 · 2026-03-31T16:20:17 1774974017

> The issue is that you shouldn't be looking for substrings in the first place.

Why? They clearly just want to log conversations that are likely to display extreme user frustration with minimal overhead. They could do a full-blown NLP-driven sentiment analysis on every prompt but I reckon it would not be as cost-effective as this.

noosphr · 2026-03-31T20:14:00 1774988040

>Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

The only time to use a regex is when searching with a human in the loop. All other uses are better handled some other way.

>They could do a full-blown NLP-driven sentiment analysis on every prompt but I reckon it would not be as cost-effective as this.

Every conversation is sent to an llm at least a thousand times the size of gpt2 which could one shot this nearly a decade ago.

lou1306 · 2026-04-01T06:58:44 1775026724

> Every conversation is sent to an llm at least a thousand times the size of gpt2 which could one shot this nearly a decade ago.

Yes, but that is _what the product does_. What we are talking about is _telemetry_.

8cvor6j844qw_d6 · 2026-03-31T11:37:41 1774957061

Very likely vibe coded.

I've seen Claude Code went with a regex approach for a similar sentiment-related task.

mr_00ff00 · 2026-03-31T15:34:11 1774971251

My understanding of vibe coding is when someone doesn’t look at the code and just uses prompts until the app “looks and acts” correct.

I doubt you are making regex and not looking at it, even if it was AI generated.

rdiddly · 2026-03-31T14:44:33 1774968273

Clbuttic!

BoppreH · 2026-03-31T11:10:45 1774955445

It's fast, but it'll miss a ton of cases. This feels like it would be better served by a prompt instruction, or an additional tiny neural network.

And some of the entries are too short and will create false positives. It'll match the word "offset" ("ffs"), for example. EDIT: no it won't, I missed the \b. Still sounds weird to me.

hk__2 · 2026-03-31T11:16:25 1774955785

It’s fast and it matches 80% of the cases. There’s no point in overengineering it.

NitpickLawyer · 2026-03-31T14:49:53 1774968593

> There’s no point in overengineering it.

I swear this whole thread about regexes is just fake rage at something, and I bet it'd be reversed had they used something heavier (omg, look they're using an LLM call where a simple regex would have worked, lul)...

vharuck · 2026-03-31T11:18:29 1774955909

The pattern only matches if both ends are word boundaries. So "diffs" won't match, but "Oh, ffs!" will. It's also why they had to use the pattern "shit(ty|tiest)" instead of just "shit".

BoppreH · 2026-03-31T11:21:08 1774956068

You're right, I missed the \b's. Thanks for the correction.

feketegy · 2026-03-31T14:05:20 1774965920

It's all regex anyways

make3 · 2026-03-31T15:28:05 1774970885

it's like a faster than light spaceship company using horses. There's been infinite solutions to do this better even CPU only for years lol.

sumtechguy · 2026-03-31T11:56:42 1774958202

hmm not a terrible idea (I think).

You have a semi expensive process. But you want to keep particular known context out. So a quick and dirty search just in front of the expensive process. So instead of 'figure sentiment (20seconds)'. You have 'quick check sentiment (<1sec)' then do the 'figure sentiment v2 (5seconds)'. Now if it is just pure regex then your analogy would hold up just fine.

I could see me totally making a design choice like that.

sfn42 · 2026-03-31T14:37:33 1774967853

It's almost as if LLMs are unreliable