Statistical assumptions for Criterion benchmarks

Ask Question

Asked 4 months ago

Modified 4 months ago

Viewed 48 times

This question is somewhat specific to Rust's Criterion, but I have kept it general so that anybody with knowledge about benchmarking can help.

In my Rust codebase, I have a struct Model that is very complex. I want to benchmark running a simulation with said model for 1000 time steps.

However, I am having a hard time finding documentation on the conditions under which the statistical analyses Criterion performs are meaningful.

Specifically, I want to know if an "iteration" needs to be exactly the same code running every time, or just a statistically equivalent piece of code.

For example, I currently have this closure in my benchmark:

// For each sample:
|b: &mut Bencher, parameters| {
    // For each iteration:
    b.iter_batched_ref(
        // Setup function
        || {
            let mut model = Model::new(parameters.clone());
            model.setup().unwrap();
            model.run(100);
            model
        },
        |model| {
            model.run(black_box(1000))
        },
        BatchSize::SmallInput
    )
}

As you can tell, this builds a new Model for every iteration, which is costly and drives down the number of total iterations performed.

To try to fix this, I could change my code to:

// For each sample:
|b, parameters| {
    // Setup model at the beginning of a sample
    let mut model = Model::new(parameters.clone());
    model.setup().unwrap();
    // Brings the simulation to statistical equilibrium
    model.run(100);
    // For each iteration (sequentially, without resetting the model):
    b.iter(
        || {
            model.run(black_box(1000))
        }
    )
}

Making each sample in the benchmark exactly the same, but not each iteration (since in between iterations we are modifying model in place).

Despite that, each individual time step of the model after it's brought into equilibrium is statistically equivalent, and should take around the same amount of time to execute (let's say the maximum takes maybe 20% longer than the minimum, a difference that regression to the mean eliminates when summed across the 1000).

And in principle, comparing across samples is still significant, because those are the same.

Will this break criterion's statistical analyses by violating the concept of what an "iteration" is, or is it ok?

Another alternative would be to forgo completely the idea of running the simulation for 1000 timesteps, and instead only benchmark it for 1 time-step, which brings down the cost of iterations. However, this makes me lose a bit of trust in the benchmark, because it could be that the specific time-step we are measuring is one of those that takes slightly more time to run, which would ruin my estimates of how long it takes to run the simulation for 100 million time steps.

asked Jun 27 at 1:11

aleferna

1411 silver badge8 bronze badges

If your model can be cloned, the easiest solution is probably to model.clone().run (black_box (1_000))

Jmb
– Jmb

2025-06-27 13:45:45 +00:00
Commented Jun 27 at 13:45
Would be hard to clone the Model struct unfortunately... I would need to write a lot of code to make it work (and it would be useful just for this benchmark).

aleferna
– aleferna

2025-06-27 21:05:34 +00:00
Commented Jun 27 at 21:05

Add a comment |

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Statistical assumptions for Criterion benchmarks

0

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.