This question is somewhat specific to Rust's Criterion, but I have kept it general so that anybody with knowledge about benchmarking can help.
In my Rust codebase, I have a struct Model that is very complex.
I want to benchmark running a simulation with said model for 1000 time steps.
However, I am having a hard time finding documentation on the conditions under which the statistical analyses Criterion performs are meaningful.
Specifically, I want to know if an "iteration" needs to be exactly the same code running every time, or just a statistically equivalent piece of code.
For example, I currently have this closure in my benchmark:
// For each sample:
|b: &mut Bencher, parameters| {
// For each iteration:
b.iter_batched_ref(
// Setup function
|| {
let mut model = Model::new(parameters.clone());
model.setup().unwrap();
model.run(100);
model
},
|model| {
model.run(black_box(1000))
},
BatchSize::SmallInput
)
}
As you can tell, this builds a new Model for every iteration, which is costly and drives down the number of total iterations performed.
To try to fix this, I could change my code to:
// For each sample:
|b, parameters| {
// Setup model at the beginning of a sample
let mut model = Model::new(parameters.clone());
model.setup().unwrap();
// Brings the simulation to statistical equilibrium
model.run(100);
// For each iteration (sequentially, without resetting the model):
b.iter(
|| {
model.run(black_box(1000))
}
)
}
Making each sample in the benchmark exactly the same, but not each iteration (since in between iterations we are modifying model in place).
Despite that, each individual time step of the model after it's brought into equilibrium is statistically equivalent, and should take around the same amount of time to execute (let's say the maximum takes maybe 20% longer than the minimum, a difference that regression to the mean eliminates when summed across the 1000).
And in principle, comparing across samples is still significant, because those are the same.
Will this break criterion's statistical analyses by violating the concept of what an "iteration" is, or is it ok?
Another alternative would be to forgo completely the idea of running the simulation for 1000 timesteps, and instead only benchmark it for 1 time-step, which brings down the cost of iterations. However, this makes me lose a bit of trust in the benchmark, because it could be that the specific time-step we are measuring is one of those that takes slightly more time to run, which would ruin my estimates of how long it takes to run the simulation for 100 million time steps.
model.clone().run (black_box (1_000))