Revisiting the Volume Hypothesis
Abstract
Modern deep neural networks often contain far more parameters than needed to fit their training data, yet they achieve impressive generalization. A common explanation for this success is the implicit bias of stochastic gradient descent (SGD). An alternative volume hypothesis posits that, within low training-loss regions, loss-landscape basins leading to strong generalization occupy much larger regions of weight space than basins that generalize poorly, and therefore SGD is simply more likely to land in the former. Recent experimental explorations of this idea present seemingly contradictory results. While in one set of experiments randomly sampling the network weights until achieving zero training error yielded poor generalization, molecular dynamics density estimates supported the volume hypothesis. We observe that these experiments were performed at different dataset size regimes, and explore an intermediate regime using the Replica Exchange Wang–Landau algorithm to estimate the joint density of states over training and test accuracies in binary networks. Across several architectures and datasets, we show that the generalization advantage of gradient learning over random sampling training generally diminishes as the training data size grows, suggesting a resolution of the paradox.
1 Introduction
From the perspective of classical learning theory (Shalev-Shwartz & Ben-David, 2014), it is surprising that modern neural networks generalize so well to new data. These models usually contain many more parameters than are necessary to fit the training set, and making them even larger tends to improve rather than harm generalization (Hestness et al., 2017). In an over-parameterized network, countless parameter configurations can drive the training loss to zero—some generalize, others do not. Yet, for reasons that remain unclear, stochastic gradient descent (SGD) routinely lands on parameter configurations that do generalize (Zhang et al., 2017).
The prevailing view attributes this success to an implicit bias that SGD introduces when it trains an over-parameterized model (Soudry et al., 2018; Gunasekar et al., 2018; Arora et al., 2019; Vardi, 2023). However, a recent work by Chiang et al. (2022) proposed a different explanation, the volume hypothesis, according to which within the low training error regions of the weight space, strong generalization regions simply occupy a much larger volume than those that generalize poorly. This idea echoes related proposals in (Pérez et al., 2019; Berchenko, 2024) and has been explored in several theoretical works in simplified cases (Hanin & Zlokapa, 2023; Buzaglo et al., 2024; Harel et al., 2024; Alexander et al., 2025). If correct, one could argue that architectural bias is the main driver of generalization, while SGD’s implicit bias is secondary.
This claim has been recently tested through two different approaches, with opposite outcomes. On the one hand, the work by Peleg & Hein (2024) used a Guess and Check (G&C) procedure: randomly sampling weight values until achieving zero training error. The resulting networks have worse generalization error than SGD.111Peleg & Hein (2024) argue that earlier claims to the contrary by Chiang et al. (2022) can be explained away by proper initialization and loss normalization. On the other hand, the work by Yang et al. (2026) uses molecular dynamics techniques to estimate the density of the network parameters as a function of (i) the training loss and (ii) the generalization loss (for regression) or accuracy (for classification). For low training loss, in many classification and regression models the density has a peak at generalization values similar or better than those reached by highly-optimized SGD. This implies the validity of the volume hypothesis, dubbed high-entropy advantage by Yang et al. (2026), though in one of their examples the phenomenon disappears for wide enough networks.
In order to reconcile these diverging results, we note that the experiments reported in Peleg & Hein (2024) were restricted to binary classification tasks in the low sample regime (up to 32 training samples). Extending G&C to multiclass classification or bigger training datasets is unfeasible due to the high computational cost, which increases quickly with training sample size and the number of classification categories. On the other hand, the experiments in Yang et al. (2026) were performed in the high sample regime. This suggests that the validity of the volume hypothesis might depend strongly on the data size. In this work we explore this idea using the Wang-Landau algorithm (Wang & Landau, 2001), a technique developed in the statistical physics community to compute the probability density of the energy or other macroscopic quantities in systems in which this density can vary across many orders of magnitude. In our case, we use it to estimate the joint density of states over training accuracy and test accuracy in selected ranges of interest, for multiclass classification networks and training datasets of up to 600 samples. In three different architectures and two datasets, we find the generalization advantage of SGD over random sampling generally diminishes as the training data size grows. This result bridges the different results previously reported. Our results demonstrate a data-dependent transition in which optimization-induced bias dominates in small-data regimes, while architectural volume effects emerge and concentrate with increasing data. This clarifies the respective roles of optimization, architecture, and data in overparameterized generalization.
2 Related works
Wang-Landau in machine learning.
An estimation of the density of states in binary neural networks using the Wang-Landau method was recently performed in (Mele et al., 2025). However, this work only considered the density of the training error, and thus does not yield insights on generalization performance. Another recent use of the Wang-Landau method in machine learning is the work (Liu et al., 2023), which computed the density of output values of a network over the input space, for fixed, previously-trained weights. The volume hypothesis was recently explored by Yang et al. (2026) using a variant of the Wang-Landau method that combines molecular dynamics with non-parametric density estimation. The same technique was recently applied in (Zhang et al., 2025) to the grokking phenomenon.
Overparameterization and generalization.
The question of why deep neural networks generalize well despite severe overparameterization has been investigated for several decades from a variety of theoretical and empirical perspectives (Wolpert, 1995; Bartlett & Mendelson, 2001; Hoffer et al., 2017; Jakubovitz et al., 2019). Classical learning theory relates generalization to capacity control via complexity measures such as the VC dimension (Vapnik & Chervonenkis, 1971), suggesting that highly expressive models should overfit (Shalev-Shwartz & Ben-David, 2014; Hastie et al., 2001). From this viewpoint, networks with enough parameters to interpolate arbitrary labels would be expected to generalize poorly. Empirically, however, modern neural networks defy this prediction. Even models capable of perfectly fitting the training data often exhibit strong test performance (Haeffele & Vidal, 2017; Nguyen & Hein, 2018), and increasing model size can further improve generalization (Belkin et al., 2019; Neal et al., 2019; Bartlett et al., 2020). A striking demonstration by Zhang et al. (2017) showed that convolutional networks can memorize random labels while still generalizing well on structured data, highlighting a disconnect between expressivity and generalization. Additional work indicates that neural networks tend to learn simple or low-frequency patterns before memorizing noise (Arpit et al., 2017) and that both bias and variance may decrease as model size grows (Neal et al., 2019).
Implicit bias induced by optimization.
A prominent line of research attributes successful generalization in deep learning to the implicit bias of gradient-based optimization methods (Neyshabur et al., 2015). For linearly separable problems, Soudry et al. (2018) showed that gradient descent converges to maximum-margin solutions, and Arora et al. (2019) argued that the regularization induced by gradient-based training cannot be captured by explicit penalties alone. Subsequent studies have explored how optimization dynamics shape learned representations, including the influence of batch size (Galanti & Poggio, 2022), gradient noise (Liu et al., 2020), effective dimensionality reduction during training (Advani et al., 2020), and simplified dynamics in wide networks (Lee et al., 2020). More recently, Andriushchenko et al. (2023) demonstrated that large learning rates in SGD promote low-rank feature learning.
Loss landscape and flatness viewpoints.
Another family of approaches links generalization to geometric properties of the loss landscape, particularly notions of flatness or sharpness (Dziugaite & Roy, 2017; Keskar et al., 2017; Jiang et al., 2020; Foret et al., 2021). While such measures are often correlated with generalization, their causal relevance remains debated (Andriushchenko et al., 2023). Related ideas appear in the Bayesian literature, where wide optima are associated with higher posterior mass and improved predictive performance (Izmailov et al., 2018; Wilson & Izmailov, 2020).
Architectural bias and the volume hypothesis.
Beyond optimization, several works emphasize biases intrinsic to the network architecture itself. Huang et al. (2020) hypothesized that poorly generalizing minima occupy comparatively small regions in parameter space. Building on this intuition, Mingard et al. (2021) argued that, under strong assumptions such as infinite width, SGD behaves similarly to Bayesian sampling, suggesting that architectural bias dominates optimization effects. However, these approximations do not directly apply to finite, practical networks. Other work has questioned the necessity of SGD stochasticity altogether, showing that deterministic gradient descent with explicit regularization can achieve comparable performance (Geiping et al., 2022).
Most directly related to our study, Chiang et al. (2022) proposed the volume hypothesis, arguing that generalization is primarily governed by the relative volume of well-generalizing solutions induced by the architecture, with the implicit bias of SGD playing only a secondary role. Our definition of the volume hypothesis states that, among weights with zero training error, the volume of regions with low generalization error is big. This differs from the definitions espoused recently in (Scherlis & Belrose, 2025; Fan et al., 2025), which focus on the volume of low training error basins.
Simplicity bias.
A recurring theme in recent work is that modern overparameterized architectures do not behave as if they were selecting a hypothesis uniformly at random from an enormous function class; instead, common parameterizations and initialization/training pipelines induce a highly non-uniform distribution over functions that is sharply skewed toward simple (structured, compressible, low-complexity) predictors. Pérez et al. (2019) make this point concrete by analyzing the parameter-function map and the induced function-space prior: this map is many-to-one with the probability of realizing a given function varying by orders of magnitude, and empirical as well as theoretical evidence supports an exponential relation between function probability and descriptional complexity, yielding an explicit simplicity bias mechanism. Teney et al. (2024) reinforce this picture from an architectural standpoint, showing that ”random networks are not random functions”: already at initialization, an overwhelming fraction of parameter space corresponds to functions of characteristic (often low) complexity. Mingard et al. (2025) connect these observations to an Occam/algorithmic information theory perspective, by studying the distribution of functions induced by random networks and showing that probabilities can decay approximately exponentially with suitable complexity proxies, while also demonstrating that modifying the regime (e.g., toward ”chaotic” behavior) weakens the bias and harms generalization.
In particular, both Berchenko (2024) and Buzaglo et al. (2024) make the Guess-and-Check mechanism explicit and connect it to a classic PAC-style interpretation (Berchenko (2024) studies a ”naive algorithm”, which is equivalent to Guess-and-Check). The grand picture from both is the following. Let denote the success probability of a single guess to come up with the target function. Then the stopping time is bounded by a geometric random variable with mean . Under simplicity bias induced by parameter redundancy, ”simple” target-functions have comparatively large prior mass, while complex hypotheses have exponentially smaller mass; hence can be non-negligible when the target is simple, implying only few trials until interpolation. Consequently, this converts the analysis into a standard finite-class template: the expected number of distinct hypotheses tried until Guess-and-Check stops is on the order of , so one may view the procedure as implicitly selecting from an effective hypothesis class of size , yielding familiar PAC sample-complexity behavior scaling as . In this sense, the simplicity-biased induced prior supplies the analogue of a finite hypothesis class, and generalization bounds follow the same logic as PAC-learning with a finite class—only with complexity controlled by the probability mass assigned by the construction-induced distribution rather than by the raw number of parameters.
In this function-space framing, the so-called volume hypothesis (informally: “generalizing solutions occupy large volume”) is best viewed as a consequence of simplicity bias rather than a competing primitive. Moreover, as a foundational explanation volume is either tautological or ill-defined, with ”generalizing well” not an intrinsic attribute of a training set alone (it depends on the out-of-sample distribution/evaluated target), whereas simplicity bias is a well-defined property of the learning setup through its induced probability space over hypotheses.
3 Background: the Wang-Landau algorithm
The Wang-Landau algorithm, introduced in (Wang & Landau, 2001), revolutionized Monte Carlo simulations by enabling efficient sampling of the entire energy spectrum of statistical mechanics systems, including rare high-energy states that conventional methods struggle to reach. In statistical mechanics, the density of states counts how many microstates exist at each energy level . This quantity is fundamental because all thermodynamic quantities can be derived from it (e.g. partition function, internal energy, heat capacity). However, calculating directly is usually computationally intractable, because it requires enumerating all possible microstates. In our setting the macroscopic quantities of interest are the train and test accuracies, which we denote as and respectively, for a fixed dataset. We seek to estimate , the volume in the space of network parameters (weights and biases) with given train and test accuracies.
Let us consider a neural network with binary parameters. The values of provide a partition of the possible networks, i.e., . Computing via direct counting, i.e. evaluating and for all parameter combinations is infeasible. Instead, the Wang-Landau algorithm typically yields an accurate estimate of . Note that to verify the volume hypothesis we are interested in comparing the value of which maximizes with the test accuracy obtained from SGD. But in general it might be of interest to estimate the also for a range of values.
Assume a prior uniform distribution over the binary parameters of the network . If we knew the value of , reweighting this distribution with a factor would yield a uniform distribution on the space. Of course is unknown. The key insight of the Wang-Landau algorithm is to perform a random walk in the space reweighted by , while continuously updating the current estimate of until we reach a flat histogram in the space. The update to consists in simply multiplying by a factor , , to discourage future visits to the current state .
At each step, the algorithm maintains (i) : the current estimate of the log-density of states, (ii) : the modification factor (typically initialized at ), and (iii) : a histogram that counts visits to each pair of train/test errors, from which a criterion to occasionally diminish is obtained. After initializing (up to an overall normalizing factor), , , each step of the random walk consists of
-
1.
Propose a move in the space of parameters (e.g., flip a random subset of binary parameters), .
-
2.
Calculate the accuracies associated with the proposed state.
-
3.
Accept with probability
(3.1) If accepted, set . Note that (3.1) coincides with the Metropolis acceptance rate for a target distribution proportional to .
-
4.
Update histogram: . Update the log-density: .
Along the random walk, we periodically check whether the histogram is close to being “flat”, for example by checking the condition
| (3.2) |
with typical . When the flatness condition (3.2) is satisfied, the histogram is reset to , the modification factor reduced as and the random walk continues. Convergence is reached when for some small . Several parameters can be tuned for optimal speed, such as the size of the bins for , the number of spins flipped in the proposal moves and other schedules for the factor .
3.1 Parallelization via Replica Exchanges
While the Wang-Landau algorithm is ergodic and provably converges (Jacob & Ryder, 2014), for models with large ranges of values it may require extremely long runs. An effective solution, inspired by the parallel tempering method (Geyer et al., 1991), is to split the accuracy space into several small overlapping regions, each running an independent Wang-Landau random walk (Vogel et al., 2013, 2014). Periodically, a proposal is made for overlapping regions to exchange configurations, and it is accepted with a standard Metropolis-Hastings probability for an interchange proposal. This approach is called Replica Exchange Wang-Landau (REWL), and we adopted it in our experiments.
| Base Model | |||
|---|---|---|---|
| Layer | Input Output | Kernel/Stride /Padding | Params |
| Conv2D | 54 | ||
| ReLU | |||
| MaxPool | |||
| FC1 no bias | 75,264 | ||
| ReLU | |||
| FC2 no bias | 640 | ||
| Total | 75,958 |
| Deeper Model | |||
|---|---|---|---|
| Layer | Input Output | Kernel/Stride /Padding | Params |
| Conv2D | 54 | ||
| ReLU | |||
| MaxPool | |||
| Conv2D | 324 | ||
| ReLU | |||
| FC1 no bias | 75,264 | ||
| ReLU | |||
| FC2 no bias | 640 | ||
| Total | 76,282 |
| Wider Model | |||
|---|---|---|---|
| Layer | Input Output | Kernel/Stride /Padding | Params |
| Conv2D | 54 | ||
| ReLU | |||
| MaxPool | |||
| FC1 no bias | 150,528 | ||
| ReLU | |||
| FC2 no bias | 1280 | ||
| Total | 151,862 |
4 Experiments
We performed experiments using three different architectures detailed in subsection 3.1, on the MNIST and Fashion-MNIST datasets. For each dataset and architectures we considered three train data sizes . In all the cases, we considered balanced training sets with equal sizes for all ten categories.
We implemented the REWL algorithm with four and six random walkers running on workstations with four and six NVIDIA RTX 600 ADA GPUs, in each case assigning one random walker per GPU. The use of the latter is crucial for speed, since for each proposed state the accuracies over the full train and test sets must be evaluated. For the transition proposals , we tuned during initial runs the number of binary weights to be flipped to obtain acceptance rates (3.1) of around . This typically resulted in . We assumed convergence when the modification factor of all the random walkers reached .
Restricting the estimation range.
For computational tractability, we restricted the ranges of and over which we estimated . For the training accuracy we assumed maximum granularity, and computed over six values , where corresponds to all training points correctly predicted, to all except one, etc. For the test accuracy we aggregated the values into bins containing ten values of the accuracy count. Thus for data sizes the full test sets had samples and takes about values in its full range. The actual range of was restricted in order to capture a width of 10% accuracy while including the location of maximum density, as found by trial-and-error initial estimates. Proposals that lead to pairs outside the designated range are rejected (but and are still updated, as with any rejected proposal).
5 Results
The results of all the estimated density curves for interpolation solutions, , are presented in Figure 3. Note the rapid decay of the probabilities as moves away from the maximum. Thus random sampling training would land with high probability on .
Diminishing advantage of gradient descent.
The generalization accuracies for all the cases are presented in Table 2, where they are compared with results from (stochastic) gradient descent learning. This table shows one of our central observations: in most cases, the advantage of gradient-based training diminishes as the data size grows, thus reconciling the observations of (Peleg & Hein, 2024) and (Yang et al., 2026) into a unifying data-size dependent framework. Figure 1 presents the generalization accuracy gap between random sampling and gradient learning in all the cases.
Sharpening of the density curves.
Another important result is illustrated in Figure 4: across all architectures and datasets, the curvature of the generalization accuracy density sharpens with increasing dataset size. This is consistent with the idea that the probability volume of well generalizing solutions shrinks as the data size grows, and with recent results on the data-size dependence of the volume of low training-loss basins (Fan et al., 2025).
The role of width and depth.
We observe in Figure 3 a consistent increase of generalization accuracy for wider networks. This differs from the results in (Peleg & Hein, 2024), where network width improves SGD results but not those of Guess-and-Check. We also observe mixed effects of deeper networks, whereas in (Peleg & Hein, 2024) the effect of depth is reported to be negative overall. These discrepancies seem to be a function of the particular networks or datasets chosen and do not seem to be central to the analysis of the volume hypothesis.
Variability of estimates.
Figure 2 illustrates the typical variability of the REWL density estimates across different runs of the algorithm.
| Dataset: MNIST | ||||||
|---|---|---|---|---|---|---|
| Train Size | Model | Random | GD | Gap | SGD | Gap |
| 30 | Base | 28.5 | 58.7 ± 3.5 | 30.2 | 55.9 ± 5.3 | 27.4 |
| Deeper | 29.5 | 55.2 ± 4.2 | 25.7 | 46.2 ± 10.1 | 16.7 | |
| Wider | 30.9 | 59.5 ± 3.8 | 28.6 | 58.0 ± 7.8 | 27.1 | |
| 300 | Base | 66.6 | 83.2 ± 0.8 | 16.6 | 78.3 ± 8.0 | 11.7 |
| Deeper | 70.5 | 78.8 ± 5.8 | 8.3 | 77.0 ± 5.1 | 6.5 | |
| Wider | 68.7 | 84.1 ± 0.9 | 15.4 | 83.6 ± 2.9 | 14.9 | |
| 600 | Base | 75.0 | 88.5 ± 0.5 | 13.5 | 81.7 ± 4.1 | 6.7 |
| Deeper | 72.9 | 85.5 ± 4.3 | 12.6 | 80.0 ± 5.3 | 7.1 | |
| Wider | 77.1 | 89.0 ± 0.5 | 11.9 | 87.3 ± 3.9 | 10.2 |
| Dataset: Fashion-MNIST | ||||||
|---|---|---|---|---|---|---|
| Train Size | Model | Random | GD | Gap | SGD | Gap |
| 30 | Base | 31.7 | 58.8 ± 3.1 | 27.1 | 56.3 ± 2.7 | 24.6 |
| Deeper | 32.4 | 53.6 ± 4.2 | 21.2 | 49.9 ± 3.9 | 17.5 | |
| Wider | 33.4 | 61.2 ± 1.2 | 27.8 | 59.0 ± 1.2 | 25.6 | |
| 300 | Base | 62.1 | 76.5 ± 1.6 | 14.4 | 69.4 ± 5.6 | 7.3 |
| Deeper | 61.5 | 72.8 ± 1.1 | 11.3 | 67.6 ± 6.6 | 6.1 | |
| Wider | 62.0 | 77.5 ± 0.9 | 15.5 | 76.9 ± 1.6 | 14.9 | |
| 600 | Base | 67.3 | 76.4 ± 4.0 | 9.1 | 75.7 ± 2.5 | 8.4 |
| Deeper | 71.0 ± 5.2 | 5.4 | 72.4 ± 3.7 | 6.8 | ||
| Wider | 68.1 | 79.6 ± 0.4 | 11.5 | 79.8 ± 0.5 | 11.7 |
6 Discussion
Our results provide a unified explanation for recent contradictory results regarding the volume hypothesis and the role of optimization in generalization. By explicitly probing intermediate training set sizes, we show that the relationship between random sampling and gradient-based training is strongly data-dependent. When training data are scarce, interpolating solutions are abundant but highly heterogeneous in test performance. In this regime, SGD consistently reaches atypical regions of parameter space that generalize substantially better than typical random interpolating solutions. As the dataset grows, however, the density of interpolating states becomes increasingly concentrated around a narrow range of test accuracies. In this regime, the typical interpolating solution approaches the performance achieved by SGD, and the apparent advantage of optimization diminishes.
These findings clarify the interpretation of the volume hypothesis. Rather than a universal explanation independent of optimization, volume effects emerge progressively as data constraints increase. Optimization-induced bias is therefore essential in small-data regimes, while architectural bias increasingly shapes the geometry of solution space as more data are observed. From this perspective, architectural and optimization biases are complementary mechanisms whose relative importance depends on dataset size.
Our results are also consistent with recent work on simplicity bias in overparameterized models. The observed sharpening of density curves with increasing data can be interpreted as a concentration of probability mass onto a restricted subset of functions compatible with the training set. In this sense, volume effects reflect a macroscopic consequence of architectural simplicity bias under growing constraints, rather than an independent primitive.
Finally, we emphasize that our study focuses on binary-weight networks and moderate-scale architectures, which enable explicit density estimation but differ from typical continuous-weight models. Our goal is not to directly model modern large-scale training, but to isolate fundamental geometric effects that are otherwise difficult to observe. Extending density-of-states methods to broader settings remains an important direction for future work.
Acknowledgements
A.P. is supported by the Israel Science Foundation (grant No. 1138/23) and by the Israel Ministry of Innovation, Science and Technology (Israel-France collaboration 2025-2028).
Impact Statement
This paper presents work whose goal is to advance the field of machine learning. There are many potential societal consequences of our work, none of which we feel need to be specifically highlighted here.
References
- Advani et al. (2020) Advani, M. S., Saxe, A. M., and Sompolinsky, H. High-dimensional dynamics of generalization error in neural networks. Neural Networks, 132:428–446, 2020.
- Alexander et al. (2025) Alexander, Y., Slutzky, Y., Ran-Milo, Y., and Cohen, N. Do neural networks need gradient descent to generalize? a theoretical study. arXiv preprint arXiv:2506.03931, 2025.
- Andriushchenko et al. (2023) Andriushchenko, M., Varre, A. V., Pillaud-Vivien, L., and Flammarion, N. SGD with large step sizes learns sparse features. In ICML, 2023.
- Arora et al. (2019) Arora, S., Cohen, N., Hu, W., and Luo, Y. Implicit regularization in deep matrix factorization. Advances in neural information processing systems, 32, 2019.
- Arpit et al. (2017) Arpit, D., Jastrzebski, S., Ballas, N., Krueger, D., Bengio, E., Kanwal, M. S., Maharaj, T., Fischer, A., Courville, A., Bengio, Y., and Lacoste-Julien, S. A closer look at memorization in deep networks. In ICML, 2017.
- Bartlett et al. (2020) Bartlett, P., Long, P. M., Lugosi, G., and Tsigler, A. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences of the United States of America,, 117(48):30063–30070, 2020.
- Bartlett & Mendelson (2001) Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. International Conference on Computational Learning Theory, 2111:224–240, 2001.
- Belkin et al. (2019) Belkin, M., Hsu, D., Ma, S., and Mandal, S. Reconciling modern machine-learning practice and the classical bias - variance trade-off. Proceedings of the National Academy of Sciences of the United States of America, 116(32):15849–15854, 2019.
- Berchenko (2024) Berchenko, Y. Simplicity bias in overparameterized machine learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 11052–11060, 2024.
- Buzaglo et al. (2024) Buzaglo, G., Harel, I., Nacson, M. S., Brutzkus, A., Srebro, N., and Soudry, D. How uniform random weights induce non-uniform bias: typical interpolating neural networks generalize with narrow teachers. In Proceedings of the 41st International Conference on Machine Learning, pp. 5035–5081, 2024.
- Chiang et al. (2022) Chiang, P.-y., Ni, R., Miller, D. Y., Bansal, A., Geiping, J., Goldblum, M., and Goldstein, T. Loss landscapes are all you need: Neural network generalization can be explained without the implicit bias of gradient descent. In The Eleventh International Conference on Learning Representations, 2022.
- Courbariaux et al. (2015) Courbariaux, M., Bengio, Y., and David, J.-P. Binaryconnect: Training deep neural networks with binary weights during propagations. Advances in neural information processing systems, 28, 2015.
- Dziugaite & Roy (2017) Dziugaite, G. K. and Roy, D. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In UAI, 2017.
- Fan et al. (2025) Fan, R., Sandlund, B., and Ko, L. M. Sharp minima can generalize: A loss landscape perspective on data. arXiv preprint arXiv:2511.04808, 2025.
- Foret et al. (2021) Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B. Sharpness-aware minimization for efficiently improving generalization. In ICLR, 2021.
- Galanti & Poggio (2022) Galanti, T. and Poggio, T. SGD noise and implicit low-rank bias in deep neural networks. Technical Report, CBMM, 2022.
- Geiping et al. (2022) Geiping, J., Goldblum, M., Pope, P., Moeller, M., and Goldstein, T. Stochastic training is not necessary for generalization. In ICLR, 2022.
- Geyer et al. (1991) Geyer, C. J. et al. Markov chain monte carlo maximum likelihood. In Computing science and statistics: Proceedings of the 23rd Symposium on the Interface. New York, 1991.
- Gunasekar et al. (2018) Gunasekar, S., Lee, J., Soudry, D., and Srebro, N. Characterizing implicit bias in terms of optimization geometry. In International Conference on Machine Learning, pp. 1832–1841. PMLR, 2018.
- Haeffele & Vidal (2017) Haeffele, B. D. and Vidal, R. Global optimality in neural network training. In CVPR, 2017.
- Hanin & Zlokapa (2023) Hanin, B. and Zlokapa, A. Bayesian interpolation with deep linear networks. Proceedings of the National Academy of Sciences, 120(23):e2301345120, 2023.
- Harel et al. (2024) Harel, I., Hoza, W., Vardi, G., Evron, I., Srebro, N., and Soudry, D. Provable tempered overfitting of minimal nets and typical nets. Advances in Neural Information Processing Systems, 37:53458–53524, 2024.
- Hastie et al. (2001) Hastie, T., Tibshirani, R., and Friedman, J. The Elements of statistical learning : data mining, inference, and prediction. New York, NY : Springer, 2001.
- Hestness et al. (2017) Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and Zhou, Y. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
- Hoffer et al. (2017) Hoffer, E., Hubara, I., and Soudry, D. Train longer, generalize better: Closing the generalization gap in large batch training of neural networks. Advances in Neural Information Processing Systems, pp. 1729–1739, 2017.
- Huang et al. (2020) Huang, W. R., Emam, Z., Goldblum, M., Fowl, L., Terry, J. K., Huang, F., and Goldstein, T. Understanding generalization through visualizations. In ICBINB 2020 Spotlight, 2020.
- Izmailov et al. (2018) Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., and Wilson, A. G. Averaging weights leads to wider optima and better generalization. In UAI, 2018.
- Jacob & Ryder (2014) Jacob, P. E. and Ryder, R. J. The Wang-Landau algorithm reaches the flat histogram criterion in finite time. Annals of applied probability: an official journal of the Institute of Mathematical Statistics, 24(1):34–53, 2014.
- Jakubovitz et al. (2019) Jakubovitz, D., Giryes, R., and Rodrigues, M. R. D. Generalization error in deep learning. In Compressed Sensing and Its Applications, pp. 153–193. Springer International Publishing, 2019.
- Jiang et al. (2020) Jiang, Y., Neyshabur, B., Mobahi, H., Krishnan, D., and Bengio, S. Fantastic generalization measures and where to find them. In ICLR, 2020.
- Keskar et al. (2017) Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On large-batch training for deep learning: Generalization gap and sharp minima. In ICLR, 2017.
- Lee et al. (2020) Lee, J., Xiao, L., Schoenholz, S. S., et al. Wide neural networks of any depth evolve as linear models under gradient descent. Journal of Statistical Mechanics: Theory and Experiment, 2020.
- Liu et al. (2020) Liu, J., Jiang, G., Bai, Y., Chen, T., and Wang, H. Understanding why neural networks generalize well through gsnr of parameters. In ICLR, 2020.
- Liu et al. (2023) Liu, W., You, Y.-Z., Li, Y. W., and Shang, J. Gradient-based Wang-Landau algorithm: A novel sampler for output distribution of neural networks over the input space. In International Conference on Machine Learning, pp. 22338–22351. PMLR, 2023.
- Mele et al. (2025) Mele, M., Menichetti, R., Ingrosso, A., and Potestio, R. Density of states in neural networks: an in-depth exploration of learning in parameter space. Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https://openreview.net/forum?id=BLDtWlFKhn.
- Mingard et al. (2021) Mingard, C., Valle-Pérez, G., Skalse, J., and Louis, A. A. Is sgd a bayesian sampler? well, almost. Journal of Machine Learning Research, 22, 2021.
- Mingard et al. (2025) Mingard, C., Rees, H., Valle-Pérez, G., and Louis, A. A. Deep neural networks have an inbuilt occam’s razor. Nature Communications, 16(1):220, 2025.
- Neal et al. (2019) Neal, B., Mittal, S., Baratin, A., Tantia, V., Scicluna, M., Lacoste-Julien, S., and Mitliagkas, I. A modern take on the bias-variance tradeoff in neural networks. In Ithaca. Cornell University Library, arXiv.org, 2019.
- Neyshabur et al. (2015) Neyshabur, B., Tomioka, R., and Srebro, N. In search of the real inductive bias: On the role of implicit regularization in deep learning. In ICLR, 2015.
- Nguyen & Hein (2018) Nguyen, Q. and Hein, M. Optimization landscape and expressivity of deep cnns. In ICML, 2018.
- Peleg & Hein (2024) Peleg, A. and Hein, M. Bias of stochastic gradient descent or the architecture: disentangling the effects of overparameterization of neural networks. In Proceedings of the 41st International Conference on Machine Learning, pp. 40154–40184, 2024.
- Pérez et al. (2019) Pérez, G. V., Louis, A. A., and Camargo, C. Q. Deep learning generalizes because the parameter-function map is biased towards simple functions. In 7th International Conference on Learning Representations, ICLR 2019, 2019.
- Scherlis & Belrose (2025) Scherlis, A. and Belrose, N. Estimating the probability of sampling a trained neural network at random. arXiv preprint arXiv:2501.18812, 2025.
- Shalev-Shwartz & Ben-David (2014) Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
- Soudry et al. (2018) Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., and Srebro, N. The implicit bias of gradient descent on separable data. Journal of Machine Learning Research, 19(70):1–57, 2018.
- Teney et al. (2024) Teney, D., Nicolicioiu, A. M., Hartmann, V., and Abbasnejad, E. Neural redshift: Random networks are not random functions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4786–4796, 2024.
- Vapnik & Chervonenkis (1971) Vapnik, V. N. and Chervonenkis, A. Y. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 16(2):264–280, 1971.
- Vardi (2023) Vardi, G. On the implicit bias in deep-learning algorithms. Communications of the ACM, 66(6):86–93, 2023.
- Vogel et al. (2013) Vogel, T., Li, Y. W., Wüst, T., and Landau, D. P. Generic, hierarchical framework for massively parallel Wang-Landau sampling. Physical review letters, 110(21):210603, 2013.
- Vogel et al. (2014) Vogel, T., Li, Y. W., Wüst, T., and Landau, D. P. Scalable replica-exchange framework for Wang-Landau sampling. Physical Review E, 90(2):023302, 2014.
- Wang & Landau (2001) Wang, F. and Landau, D. P. Efficient, multiple-range random walk algorithm to calculate the density of states. Physical review letters, 86(10):2050, 2001.
- Wilson & Izmailov (2020) Wilson, A. G. and Izmailov, P. Bayesian deep learning and a probabilistic perspective of generalization. In NeurIPS, 2020.
- Wolpert (1995) Wolpert, D. H. The mathematics of generalization. In Proceedings of the SFI/CNLS Workshop on Formal Approaches to Supervised Learning, 1995.
- Yang et al. (2026) Yang, E., Zhang, X., Shang, Y., and Zhang, G. High-entropy advantage in neural networks’ generalizability. npj Artificial Intelligence, 2(44), 2026.
- Zhang et al. (2017) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.
- Zhang et al. (2025) Zhang, X., Shang, Y., Yang, E., and Zhang, G. Is grokking a computational glass relaxation? Advances in neural information processing systems, 2025.
Appendix A More details on the Replica Exchange Wang Landau algorithm
A.1 Running times
In our implementation, each random walker in the REWL algorithm executed about iterations per hour. Table 3 indicates the number of iterations and wall clock time for the more demanding models until convergence . Note that models with more parameters or more training data take longer to converge, in some cases taking up to three weeks.
| Model | Iterations per Walker | Wall clock time (hours) |
|---|---|---|
| MNIST 300 Base model | 412,240,000 | 274 |
| MNIST 300 Deeper model | 461,880,000 | 307 |
| MNIST 300 Wider model | 599,880,000 | 399 |
| Fashion-MNIST 300 Base model | 466,520,000 | 311 |
| Fashion-MNIST 300 Deeper model | 624,880,000 | 416 |
| Fashion-MNIST 300 Wider model | 647,560,000 | 431 |
| MNIST 600 Base model | 456,560,000 | 304 |
| MNIST 600 Deeper model | 753,880,000 | 502 |
| MNIST 600 Wider model | 968,600,000 | 645 |
| Fashion-MNIST 600 Base model | 549,080,000 | 366 |
| Fashion-MNIST 600 Deeper model | 681,120,000 | 454 |
| Fashion-MNIST 600 Wider model | 910,280,000 | 606 |
A.2 Random-walkers aggregation.
To aggregate the log densities estimated by different random walkers, we added to each density a different constant in order to minimize the squared differences between pairs of estimated in overlapping regions. This freedom follows from the fact that is defined up to an overall normalization constant. The resulting combined curve is illustrated in Figure 5. The final log-density curve is obtained by the mean of the ’s in overlapping regions.