I guess I have the opposite experience. I have a post-graduate level of mathemat...

porridgeraisin · 2026-03-30T17:10:19 1774890619

The real use is in actually seeing connections. Every field has their own maths and their own terminologies, their own assumptions for theorems, etc.

More often than not this is duplicated work (mathematically speaking) and there is a lot to be gained by sharing advances in either field by running it through a "translation". This has happened many times historically - a lot of the "we met at a cafe and worked it out on a napkin" inventions are exactly that.

Math proficiency helps a lot at that. The level of abstraction you deal with is naturally high.

Recently, the problem of actually knowing every field enough, just cursorily, to make connections is easier with AI. Modern LLMs do approximate retrieval and still need a planner + verifier, the mathematician can be that.

This is somewhat adjacent to what terry tao spoke about, and the setup is sort of what alpha evolve does.

You get that impression because such advances are high impact and rare (because they are difficult). Most advances come as a sequence of field-specific assumption, field-specific empirical observation, field-specific theorem, and so on. We only see the advances that are actually made, leading to an observation bias.

srean · 2026-03-30T15:01:14 1774882874

Don't worry when stochastic grads get stuck math grads get going.

(One of) The value(s) that a math grad brings is debugging and fixing these ML models when training fails. Many would not have an idea about how to even begin debugging why the trained model is not working so well, let alone how to explore fixes.

p1esk · 2026-03-30T15:29:15 1774884555

Debugging ML models (large part of my job) requires very little math. Engineering experience and mindset is a lot more relevant for debugging. Complicated math is typically needed when you want invent new loss functions, or new methods for regularization, normalization or model compression.

srean · 2026-03-30T16:29:57 1774888197

You are perhaps talking about some simple plumbing bugs. There are other kinds:

Why didn't the training converge

Validation/test errors are great but why is performance in the wild so poor

Why is the model converging so soon

Why is this all zero

Why is this NaN

Model performance is not great, do I need to move to something more complicated or am I doing something wrong

Did the nature of the upstream data change ?

Sometimes this feature is missing, how should I deal with this

The training set and the data on which the model will be deployed are different. How to address this problem

The labelers labelled only the instances that are easy to label, not chosen uniformly from the data. How to train with such skewed label selection

I need to update model but with a few thousand data points but not train from scratch. How do I do it

Model too large which doubles can I replace with float32

So on and so forth. Many times models are given up on prematurely because the expertise to investigate lackluster performance does not exist in the team.

p1esk · 2026-03-30T18:50:26 1774896626

Literally every single example you provided does not require much math fundamentals. Just basic ML engineering knowledge. Are you saying that understanding things like numerical overflow or exploding gradients require sophisticated math background?

srean · 2026-03-30T20:12:30 1774901550

Numerical overflow mostly no, but in case of exploding gradient, yes especially about coming up with a way to handle it, on your own, from scratch. After all, it took the research community some time to figure out a fix for that.

But the examples you quoted were not my examples, at least not their primary movers (the NaNs could be caused by overflow but that overflow can have a deeper cause). The examples I gave have/had very different root causes at play and the fixes required some facility with maths, not to the extent that you have to be capable of discovering new math, or something so complicated as the geometry and topology of strings, but nonetheless math that requires grad school or advanced and gifted undergrad level math.

Coming back to numeric overflow that you mention. I can imagine a software engineer eventually figuring out that overflow was a root cause (sometimes they will not). However there's quite a gap between overflow recognition and say knowledge of numerical analysis that will help guide a fix.

You say > "literally every single example"... can be dealt without much math. I would be very keen to learn from you about how to deal with this one, say. Without much math.

   The labelers labelled only
   the instances that are
   easy to label, not chosen
   uniformly from the data.
   How to train with such
   skewed label selection 
   (without relabeling properly)

This is not a gotcha, a genuine curiosity here because it is always useful to understand a solution different from your own(mine).

p1esk · 2026-03-30T22:14:30 1774908870

Maybe I don’t understand this data labeling issue - are you talking about imbalanced classification dataset? Are hard classes under-represented or missing labels completely?

srean · 2026-03-30T22:37:15 1774910235

None of those (but they could be added to the mix to complicate matters).

Consider the case that the labelers creates the labelled training set by cherry picking those examples that are easy to label. He labels many, but selects the items to label according to his preference.

First question, is this even a problem. Yes, most likely. But why ? How to fix it ? When are such fixes even possible.

p1esk · 2026-03-31T17:58:04 1774979884

Yes, this is a problem - the most challenging samples might not even be present in your training data. This means your model will not perform well if real world data has lots of challenging samples.

This can be partially solved if we make some assumptions about your labeller:

1. they have still picked enough challenging samples.

2. their preferences are still based on features you care about.

3. he labelled the challenging samples correctly.

And probably some other assumptions should hold for distribution of labels, etc. But what we can do in this situation is first try to model that labeller preferences, by training a binary classifier - how likely he would choose this sample for labelling from the real-world distribution? If we train that classifier, we can then assign its confidence as a sample weight when preparing our training dataset (less likely samples get more weight). This would force our main classifier to pay more attention to the challenging samples during training.

This could help somewhat if all assumptions hold, but in practice I would not expect much improvement, and the solution above can easily make it worse - this problem needs to be solved by better labelling.

How did you solve it?

srean · 2026-04-01T08:06:06 1775030766

By using the (estimated) Radon Nikodym derivative between the the two measures -- the measure from which the labelers samples and the deployed to measure from which the on-deployment items are presumably sampled.

For this to work the two measures need to be absolutely continuous with each other.

This is close to your pre-penultimate paragraph and that's mathy enough. This done right can take care of bias but may do so at the expense of variance, so this Radon Nikodym derivative that is estimated needs to be done so under appropriate regularization in the function space.

Thinking of the solution in these terms requires mathematical thinking.

Now let's consider the case where some features may be missing on instances at the time of deployment but always present in training and the features are uncorrelated with each other (by construction).