Benchmark builder

Compare benchmark suites (mean + variance)

All computation runs locally in your browser

Last updated: February 9, 2026
Frank Zhao - Creator
CreatorFrank Zhao
Loading benchmark builder…

Introduction / overview

The Benchmark Builder is a lightweight tool for comparing multiple sets of timing measurements (called “suites”). You paste or type the numbers you measured (for example, execution time in milliseconds), and it instantly shows a ranked table with each suite’s mean and variance.

What problem does it solve?

It helps you answer questions like: “Which implementation is faster on average?” and “Which one is more stable?” without manually calculating statistics.

Developers

Compare two versions of a function, query, or API call.

Teams

Share a standardized summary in a PR discussion.

Learners

Build intuition for mean vs variance using real numbers.

If you often need to share structured results, you may also like our List Converter for quickly formatting copied data.

How to use / quick start

  1. 1Name each suite (e.g., “Baseline”, “New implementation”).
  2. 2Enter your measurements under Suite values (press Enter to jump/add another row).
  3. 3Optionally set the unit (e.g., ms, s).
  4. 4Read the table: lower mean ranks first; variance shows stability.
  5. 5Export with “Copy as markdown table” or “Copy as bullet list”.

Worked example (two measurements)

Suppose Suite A has two runs: 5 ms5\ \mathrm{ms} and 10 ms10\ \mathrm{ms}.

μ\mu==5+102\frac{5 + 10}{2}==7.5 ms7.5\ \mathrm{ms}
σ2\sigma^2==(57.5)2+(107.5)22\frac{(5-7.5)^2 + (10-7.5)^2}{2}==6.25 ms26.25\ \mathrm{ms}^2

The mean tells you the typical runtime. The variance tells you how much the runs fluctuate. If two suites have similar means, prefer the one with smaller variance (more stable performance).

Real-world examples / use cases

Comparing two implementations

Background: You rewrote a function and want a quick sanity-check before merging.

Inputs: Suite “Old”: 8, 9, 10 ms; Suite “New”: 7, 7, 8 ms

Result: New has a lower mean (closer to 7.33 ms7.33\ \mathrm{ms}) and smaller variance.

How it helps: Use the exported Markdown table directly in the PR description.

Tracking a regression

Background: A change made the API slower and you want to quantify the impact.

Inputs: Suite “Before”: 120, 118, 125 ms; Suite “After”: 160, 155, 170 ms

Result: The mean increases (positive delta), and the ratio ×1.3\times 1.3 highlights magnitude.

How it helps: Paste the bullet list into a ticket so anyone can read it quickly.

Comparing multiple strategies

Background: You have 3–5 approaches (cache, batching, different queries).

Inputs: One suite per approach, 10–30 samples each

Result: The table ranks suites by mean (best first) while variance reveals consistency.

How it helps: Pick the best mean unless high variance would hurt user experience.

If you want to quickly normalize or reformat your raw measurements before pasting them here, try Text to ASCII Binary for quick transformations and copy-friendly output.

Common scenarios / when to use

PR performance notes

Summarize benchmark runs in a consistent table.

Micro-optimizations

Check if a change actually moved the needle.

Choosing a default strategy

Compare multiple approaches and pick the best mean.

Stability checks

Spot noisy measurements via high variance.

Sharing results

Use Share + Favorites to keep common configs.

Quick exports

Copy Markdown for docs or bullet list for tickets.

When it may not be a good fit

If your benchmark methodology is inconsistent (different machines, warmup not accounted for, background tasks running), the numbers can be misleading. Use the tool to summarize data — but make sure your experiment design is sound.

Tips & best practices

  • Use enough samples: Two samples are fine for a quick check, but 10–30 runs usually give a more reliable mean.

  • Keep units consistent: If one suite is in ms and another in µs, your comparison will be meaningless.

  • Watch variance, not just mean: High variance often means noise, GC pauses, or unstable environment.

  • Export the exact numbers: Copy as Markdown for PRs; bullet list for issues and tickets.

Calculation method / formula explanation

For each suite, the tool computes the mean and the population variance. If you have measurements x1,x2,,xnx_1, x_2, \dots, x_n, then:

μ=1ni=1nxi\mu = \frac{1}{n} \sum_{i=1}^{n} x_i
σ2=1ni=1n(xiμ)2\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2

The results table ranks suites by mean (lowest mean first). For any suite that is not the best, the tool also shows a comparison against the best mean: the absolute delta and the multiplicative ratio.

Δ\Delta==μsuiteμbest\mu_{suite} - \mu_{best},,rr==μsuiteμbest\frac{\mu_{suite}}{\mu_{best}}

Related concepts / background info

Mean summarizes typical performance. Variance summarizes stability. In benchmarking, both matter — users feel slow spikes even if the average looks good.

If you want to share results, use the Share button to generate a link and Favorite to save a configuration. When you’re done, Reset clears everything.

Frequently asked questions (FAQs)

Why is the “best” suite the one with the lowest mean?

For timing benchmarks, lower usually means faster. The tool sorts suites by mean ascending. If your metric is “higher is better”, interpret the ranking accordingly.

Is this sample variance or population variance?

It uses population variance: σ2=1n(xiμ)2\sigma^2 = \frac{1}{n}\sum (x_i-\mu)^2. This matches many quick benchmark summaries where your measured runs are treated as the full dataset.

What does the “(+Δ ; ×r)” annotation mean?

It compares each suite to the best mean. Δ\Delta is the absolute difference, and rr is the ratio to the best.

Can I export the results?

Yes. Use “Copy as markdown table” for docs and PRs, or “Copy as bullet list” for tickets and chat.

Does the calculator send my numbers to a server?

No — it runs locally in your browser. Share links only include what you choose to embed in the URL.

Limitations / disclaimers

This tool summarizes the numbers you enter; it cannot fix benchmarking methodology. For reliable comparisons, keep environment and test conditions consistent (hardware, load, warmups, and cache effects).

The results are informational and should not be treated as a guarantee of real-world performance.

External references / sources

For a deeper dive into descriptive statistics: