More

andbberger · 2026-04-25T03:33:18 1777087998

https://en.wikipedia.org/wiki/Universal_approximation_theore...

the better question is why does gradient descent work for them

jmalicki · 2026-04-25T04:05:39 1777089939

The properties that the uniform approximation theorem proves are not unique to neural networks.

Any models using an infinite dimensional Hilbert space, such as SVMs with RBF or polynomial kernels, Gaussian process regression, gradient boosted decision trees, etc. have the same property (though proven via a different theorem of course).

So the universal approximation theorem tells us nothing about why should expect neural networks to perform better than those models.

hodgehog11 · 2026-04-25T04:31:36 1777091496

Extremely well said. Universal approximation is necessary but not sufficient for the performance we are seeing. The secret sauce is implicit regularization, which comes about analogously to enforcing compression.

jimmypk · 2026-04-25T09:31:14 1777109474

@hodgehog11 The grokking phenomenon (Power et al. 2022) is a puzzle for the compression view: models trained on algorithmic tasks like modular arithmetic memorize training data first (near-zero training loss, near-random test accuracy) and then, after many more gradient steps, suddenly generalize. The transition happens long after any obvious compression pressure would have fired. Do you think grokking is consistent with implicit regularization as compression, or does it require a separate mechanism - something more like a phase transition in the weight norms or the Fourier frequency structure?

hackinthebochs · 2026-04-25T10:10:03 1777111803

>Do you think grokking is consistent with implicit regularization as compression

Pretty sure it's been shown that grokking requires L1 regularization which pushes model parameters towards zero. This can be viewed as compression in the sense of encoding the distribution in the fewest bits possible, which happens to correspond to better generalization.

hodgehog11 · 2026-04-25T12:19:18 1777119558

Couldn't have said it better, although this is only for grokking with the modular addition task on networks with suitable architectures. L1 regularization is absolutely a clear form of compression. The modular addition example is one of the best cases to see the phenomenon in action.

soVeryTired · 2026-04-25T10:40:20 1777113620

Whenever people bring this up I like to remind them that linear interpolation is a universal function approximator.

Mithriil · 2026-04-27T14:14:29 1777299269

I don't think that this is true. You need an infinite number of dimensions for this (think Taylor's expansion, Fourier expansion, infinitely wide or deep NNs..)

jmalicki · 2026-04-27T18:36:40 1777315000

Yes, you do linear interpolation between an infinite number of data points.

bilsbie · 2026-04-25T12:45:48 1777121148

Can you expand on that?

hansvm · 2026-04-25T15:17:39 1777130259

I'll use 1NN as the interpolation strategy instead since I think it illustrates the same point and saves a few characters.

Recap: 1NN says that given a query Q you choose any pair (X,Y) from your learned "model" (a finite set of (X,Y) pairs) M minimizing |Q-X|. Your output is Y.

The following kind of argument works for linear interpolation too (you can even view 1NN as 1-point interpolation), but it's ever so slightly messier since definitions vary a fair bit, you potentially need to talk about the existence of >1 discrete "nearest" or "enclosing" set of neighbors, and proving that you can get away with fewer points than 1NN or have lower error than 1NN is itself also messier.

Pick your favorite compact-domain, continuous function embedded in some Euclidean space. For any target error you'd like to hit, the uniform continuity of that function guarantees that if your samples cover the domain well enough (no point in the domain is greater than some fixed distance, needing smaller distances for lower errors, from some point in your model) then the maximum error from a 1NN strategy is bounded by the associated error given by uniform continuity (which, again, you can make as small as you'd like by increasing the sampling resolution). The compact domain means you can physically achieve those error bounds with finite sample sizes.

For a simple example, imagine fitting more and more, smaller and smaller, line segments to y=x^2 on [-1,1].

NooneAtAll3 · 2026-04-25T06:45:34 1777099534

Universal approximation is like saying that a problem is computable

sure, that gives some relief - but it says nothing in practice unlike f.e. which side of P/NP divide the problem is on

ngruhn · 2026-04-25T08:02:53 1777104173

> unlike f.e. which side of P/NP divide the problem is on

Actually the P/NP divide is a similar case in my opinion. In practice a quadratic algorithm is sometimes unacceptably slow and an NP problem can be virtually solved. E.g. SAT problems are routinely solved at scale.

imtringued · 2026-04-25T09:26:04 1777109164

An NP problem can contain subproblems that are not worst case problems.

It's similar to the gap between pushdown automata and Turing machines. You can check if pushdown automata will terminate or not. You can't do it for Turing machines, but this doesn't stop you from running a pushdown automata algorithm on the turning machine with decidable termination.

Mithriil · 2026-04-27T14:16:53 1777299413

Asymptotics has been used to validate tons of statistical tools. This is just another tool being validated.

If you have a tool that you don't know works when data increases (n-> infinity), then you shouldn't use it.

So practicaly, I believe it has serious implications.

jmalicki · 2026-04-27T18:39:00 1777315140

It's very much necessary but not sufficient. In real life the sample complexity matters a lot too, which is also asymptotics, but a more important one. E.g. how the central limit theorem is far more powerful than the law of large numbers.

fc417fc802 · 2026-04-25T04:22:58 1777090978

I don't follow. Why wouldn't it work? It seems to me that a biased random walk down a gradient is about as universal as it gets. A bit like asking why walking uphill eventually results in you arriving at the top.

hodgehog11 · 2026-04-25T04:30:25 1777091425

It wouldn't work if your landscape has more local minima than atoms in the known universe (which it does) and only some of them are good. Neural networks can easily fail, but there's a lot of things one can do to help ensure it works.

anvuong · 2026-04-25T05:20:47 1777094447

A funny thing is, in very high-dimensional space, like millions and billions of parameters, the chance that you'd get stuck in a local minima is extremely small. Think about it like this, to be stuck in a local minima in 2D, you only need 2 gradient components to be zero, in higher dimension, you'd need every single one of them, millions up millions of them, to be all zero. You'd only need 1 single gradient component to be non-zero and SGD can get you out of it. Now, SGD is a stochastic walk on that manifold, not entirely random, but rather noisy, the chance that you somehow walk into a local minima is very very low, unless that is a "really good" local minima, in a sense that it dominates all other local minimas in its neighborhood.

hodgehog11 · 2026-04-25T07:48:48 1777103328

You are essentially correct, which is why stochastic gradient optimizers induce a low-sharpness bias. However, there is an awful lot more that complicates things. There are plenty of wide minima that it can get stuck in far away from where people typically initialise, so the initialisation scheme proves extremely important (but is mostly done for you).

Perhaps more important, just because it is easy to escape any local minimum does not mean that there is necessarily a trend towards a really good optimum, as it can just bounce between a bunch of really bad ones for a long time. This actually happens almost all the time if you try to design your entire architecture from scratch, e.g. highly connected networks. People who are new to the field sometimes don't seem to understand why SGD doesn't just always fix everything; this is why. You need very strong inductive biases in your architecture design to ensure that the loss (which is data-dependent so you cannot ascertain this property a priori) exhibits a global bowl-like shape (we often call this a 'funnel') to provide a general trajectory for the optimizer toward good solutions. Sometimes this only works for some optimizers and not others.

This is why architecture design is something of an art form, and explaining "why neural networks work so well" is a complex question involving a ton of parts, all of which contribute in meaningful ways. There are often plenty of counterexamples to any simpler explanation.

leoc · 2026-04-25T07:46:38 1777103198

(‘Minimum’ is the singular of ‘minima’.)

charcircuit · 2026-04-25T07:23:30 1777101810

>you'd need every single one of them, millions up millions of them, to be all zero

If they were all correlated with each other that does not seem far fetched.

imtringued · 2026-04-25T09:29:40 1777109380

Ok but it's already known that you shouldn't initialize your network parameters to a single constant and instead initialize the parameters with random numbers.

charcircuit · 2026-04-25T10:35:00 1777113300

The model can converge towards such a state even if randomly initialized.

hodgehog11 · 2026-04-25T12:16:11 1777119371

Both you and the comment above are correct; initializing with iid elements ensures that correlations are not disastrous for training, but strong correlations are baked into the weights during training, so pretty much anything could potentially happen.

appplication · 2026-04-25T04:57:10 1777093030

Not a mathematician so I’m immediately out of my depth here (and butchering terminology), but it seems, intuitively, like the presence of a massive amount of local minima wouldn’t really be relevant for gradient descent. A given local minimum would need to have a “well” at least be as large as your step size to reasonably capture your descent.

E.g. you could land perfectly on a local minima but you won’t stay the unless your step size was minute or the minima was quite substantial.

fc417fc802 · 2026-04-25T05:57:22 1777096642

I believe what was meant was that assuming local minima of a sufficient size to capture your probe, given a sufficiently high density of those, you become extremely likely to get stuck. A counterpoint regarding dimensionality is made by the comment adjacent to yours.

sdenton4 · 2026-04-25T14:42:13 1777128133

The randomness (and exploration) encouraged by batch training also helps avoid 'real' minima, if they exist.

hansvm · 2026-04-25T15:18:57 1777130337

Interestingly, there exist problems which provably can't be learned via gradient descent for them.

andbberger · 2026-04-23T03:30:30 1776915030

or we could build like 200 more fission plants and rewild all that ethanol land. not really a compelling point in favor of solar land use this factoid...

iSnow · 2026-04-23T09:56:33 1776938193

That, however, would be vastly more expensive. Maybe worth it from an overall ecological PoV, but I doubt power companies have an appetite for the CAPEX involved.

andbberger · 2026-04-01T22:57:32 1775084252

ok sure but are humans a full decade of NSF budget better than robots

andbberger · 2026-03-20T06:03:24 1773986604

there is. bioarxiv.

andbberger · 2026-01-15T20:10:03 1768507803

a good mission burrito is like $10 and makes my life better

smurda · 2026-01-15T20:22:49 1768508569

$100 on sushi and I’m still hungry. $10 on a burrito and I’m full for 24 hours

homeonthemtn · 2026-01-16T03:13:14 1768533194

Yeah because it's the size of a small child

timmmmmmay · 2026-01-15T20:28:49 1768508929

lmao this was also going to be my answer

andbberger · 2025-12-29T13:11:59 1767013919

yes the US is laughably corrupt

andbberger · 2025-12-08T06:12:02 1765174322

certainly there exist nuclear ships with enough power but I fail to see how you would feasibly tug a train of containerships

andbberger · 2025-12-08T00:12:56 1765152776

wiki article states "Up to 10,000 TWh/yr of power could be generated from OTEC without affecting the ocean's thermal structure". which converts to about 500GW which... isn't that much

pezezin · 2025-12-08T01:29:46 1765157386

10 000 TWh/yr is one third of the current total electric energy generation of the whole planet, is not a small amount.

Source, page 39 of the full report:

https://www.iea.org/reports/global-energy-review-2025/electr...

nine_k · 2025-12-08T02:40:21 1765161621

This can't be correct.

10,000 TWh/y = 1e+7 GWh/y, divide it by 365.25 days/y to produce daily output of 27,379 GWh/day, then by 24 h/day to get pure power of 1,141 GW. It's still more than a terawatt, three orders of magnitude larger than the largest nuclear reactors.

andbberger · 2025-12-08T03:07:41 1765163261

oops. yes. still not that much though. i mean it's a lot but it's "one more large industrialized country" a lot not "kardashev 2" a lot

IAmBroom · 2025-12-08T16:14:22 1765210462

Those goalposts of yours are on a FTL ship...

WJW · 2025-12-08T11:42:35 1765194155

Kardashev 2 has a Dyson sphere. Of course anything on a single planet can never have that much.

andbberger · 2025-11-25T07:54:54 1764057294

this... is not very good for an hour? i would expect an undergrad to be able to cook this up in an hour

poulpy123 · 2025-11-25T11:23:44 1764069824

I would not expect someone without a good knowledge of both javascript and the Ising model of ferromagnetism to make that in one hour. Especially now that google search is more and more crap, just looking for info would take longer.

sublinear · 2025-11-25T14:31:55 1764081115

Is this not just a cellular automaton? That's well within the usual range of college sophomore lab assignments.

To be honest the student may not necessarily care what the Isling model is, but they don't have to and neither does an LLM. It takes a very modest amount of code to apply some rules and update a grid of pixels. At least when I was in school it was totally normal to expect students to make something like this in an hour.

It's actually kind of ironic that in this case such a simple project now means the opposite of what it did back then. Students got these assignments as a form of encouragement to show that their skills were immediately useful and that more "serious" science need not be so scary.

andbberger · 2025-12-08T00:35:39 1765154139

> Is this not just a cellular automaton?

no

webdevver · 2025-11-25T10:29:26 1764066566

theyd cook it up in an hour and then spend the next 2 days stuck on some stupid issue. not so with ai tools.

andbberger · 2025-11-18T04:30:06 1763440206

that seems fine

Spooky23 · 2025-11-19T00:58:59 1763513939

Forever is a long time.

andbberger · 2025-11-25T08:52:29 1764060749

they're in big concrete containers rated to withstand everything its fine