The properties that the uniform approximation theorem proves are not unique to neural networks.
Any models using an infinite dimensional Hilbert space, such as SVMs with RBF or polynomial kernels, Gaussian process regression, gradient boosted decision trees, etc. have the same property (though proven via a different theorem of course).
So the universal approximation theorem tells us nothing about why should expect neural networks to perform better than those models.
Extremely well said. Universal approximation is necessary but not sufficient for the performance we are seeing. The secret sauce is implicit regularization, which comes about analogously to enforcing compression.
@hodgehog11 The grokking phenomenon (Power et al. 2022) is a puzzle for the compression view: models trained on algorithmic tasks like modular arithmetic memorize training data first (near-zero training loss, near-random test accuracy) and then, after many more gradient steps, suddenly generalize. The transition happens long after any obvious compression pressure would have fired. Do you think grokking is consistent with implicit regularization as compression, or does it require a separate mechanism - something more like a phase transition in the weight norms or the Fourier frequency structure?
>Do you think grokking is consistent with implicit regularization as compression
Pretty sure it's been shown that grokking requires L1 regularization which pushes model parameters towards zero. This can be viewed as compression in the sense of encoding the distribution in the fewest bits possible, which happens to correspond to better generalization.
Couldn't have said it better, although this is only for grokking with the modular addition task on networks with suitable architectures. L1 regularization is absolutely a clear form of compression. The modular addition example is one of the best cases to see the phenomenon in action.
I don't think that this is true. You need an infinite number of dimensions for this (think Taylor's expansion, Fourier expansion, infinitely wide or deep NNs..)
I'll use 1NN as the interpolation strategy instead since I think it illustrates the same point and saves a few characters.
Recap: 1NN says that given a query Q you choose any pair (X,Y) from your learned "model" (a finite set of (X,Y) pairs) M minimizing |Q-X|. Your output is Y.
The following kind of argument works for linear interpolation too (you can even view 1NN as 1-point interpolation), but it's ever so slightly messier since definitions vary a fair bit, you potentially need to talk about the existence of >1 discrete "nearest" or "enclosing" set of neighbors, and proving that you can get away with fewer points than 1NN or have lower error than 1NN is itself also messier.
Pick your favorite compact-domain, continuous function embedded in some Euclidean space. For any target error you'd like to hit, the uniform continuity of that function guarantees that if your samples cover the domain well enough (no point in the domain is greater than some fixed distance, needing smaller distances for lower errors, from some point in your model) then the maximum error from a 1NN strategy is bounded by the associated error given by uniform continuity (which, again, you can make as small as you'd like by increasing the sampling resolution). The compact domain means you can physically achieve those error bounds with finite sample sizes.
For a simple example, imagine fitting more and more, smaller and smaller, line segments to y=x^2 on [-1,1].
> unlike f.e. which side of P/NP divide the problem is on
Actually the P/NP divide is a similar case in my opinion. In practice a quadratic algorithm is sometimes unacceptably slow and an NP problem can be virtually solved. E.g. SAT problems are routinely solved at scale.
An NP problem can contain subproblems that are not worst case problems.
It's similar to the gap between pushdown automata and Turing machines. You can check if pushdown automata will terminate or not. You can't do it for Turing machines, but this doesn't stop you from running a pushdown automata algorithm on the turning machine with decidable termination.
It's very much necessary but not sufficient. In real life the sample complexity matters a lot too, which is also asymptotics, but a more important one. E.g. how the central limit theorem is far more powerful than the law of large numbers.
I don't follow. Why wouldn't it work? It seems to me that a biased random walk down a gradient is about as universal as it gets. A bit like asking why walking uphill eventually results in you arriving at the top.
It wouldn't work if your landscape has more local minima than atoms in the known universe (which it does) and only some of them are good. Neural networks can easily fail, but there's a lot of things one can do to help ensure it works.
A funny thing is, in very high-dimensional space, like millions and billions of parameters, the chance that you'd get stuck in a local minima is extremely small. Think about it like this, to be stuck in a local minima in 2D, you only need 2 gradient components to be zero, in higher dimension, you'd need every single one of them, millions up millions of them, to be all zero. You'd only need 1 single gradient component to be non-zero and SGD can get you out of it. Now, SGD is a stochastic walk on that manifold, not entirely random, but rather noisy, the chance that you somehow walk into a local minima is very very low, unless that is a "really good" local minima, in a sense that it dominates all other local minimas in its neighborhood.
You are essentially correct, which is why stochastic gradient optimizers induce a low-sharpness bias. However, there is an awful lot more that complicates things. There are plenty of wide minima that it can get stuck in far away from where people typically initialise, so the initialisation scheme proves extremely important (but is mostly done for you).
Perhaps more important, just because it is easy to escape any local minimum does not mean that there is necessarily a trend towards a really good optimum, as it can just bounce between a bunch of really bad ones for a long time. This actually happens almost all the time if you try to design your entire architecture from scratch, e.g. highly connected networks. People who are new to the field sometimes don't seem to understand why SGD doesn't just always fix everything; this is why. You need very strong inductive biases in your architecture design to ensure that the loss (which is data-dependent so you cannot ascertain this property a priori) exhibits a global bowl-like shape (we often call this a 'funnel') to provide a general trajectory for the optimizer toward good solutions. Sometimes this only works for some optimizers and not others.
This is why architecture design is something of an art form, and explaining "why neural networks work so well" is a complex question involving a ton of parts, all of which contribute in meaningful ways. There are often plenty of counterexamples to any simpler explanation.
Ok but it's already known that you shouldn't initialize your network parameters to a single constant and instead initialize the parameters with random numbers.
Both you and the comment above are correct; initializing with iid elements ensures that correlations are not disastrous for training, but strong correlations are baked into the weights during training, so pretty much anything could potentially happen.
Not a mathematician so I’m immediately out of my depth here (and butchering terminology), but it seems, intuitively, like the presence of a massive amount of local minima wouldn’t really be relevant for gradient descent. A given local minimum would need to have a “well” at least be as large as your step size to reasonably capture your descent.
E.g. you could land perfectly on a local minima but you won’t stay the unless your step size was minute or the minima was quite substantial.
I believe what was meant was that assuming local minima of a sufficient size to capture your probe, given a sufficiently high density of those, you become extremely likely to get stuck. A counterpoint regarding dimensionality is made by the comment adjacent to yours.
or we could build like 200 more fission plants and rewild all that ethanol land. not really a compelling point in favor of solar land use this factoid...
That, however, would be vastly more expensive. Maybe worth it from an overall ecological PoV, but I doubt power companies have an appetite for the CAPEX involved.
wiki article states "Up to 10,000 TWh/yr of power could be generated from OTEC without affecting the ocean's thermal structure". which converts to about 500GW which... isn't that much
10,000 TWh/y = 1e+7 GWh/y, divide it by 365.25 days/y to produce daily output of 27,379 GWh/day, then by 24 h/day to get pure power of 1,141 GW. It's still more than a terawatt, three orders of magnitude larger than the largest nuclear reactors.
I would not expect someone without a good knowledge of both javascript and the Ising model of ferromagnetism to make that in one hour. Especially now that google search is more and more crap, just looking for info would take longer.
Is this not just a cellular automaton? That's well within the usual range of college sophomore lab assignments.
To be honest the student may not necessarily care what the Isling model is, but they don't have to and neither does an LLM. It takes a very modest amount of code to apply some rules and update a grid of pixels. At least when I was in school it was totally normal to expect students to make something like this in an hour.
It's actually kind of ironic that in this case such a simple project now means the opposite of what it did back then. Students got these assignments as a form of encouragement to show that their skills were immediately useful and that more "serious" science need not be so scary.
the better question is why does gradient descent work for them
reply