To understand what’s behind the mean value reported by our status quo measurement process, we’ll examine single benchmark with the newly increased sampling frequency. All eleven series of samples in the chart below represent ~1s (0.7–0.8s in this case) of timing the benchmark Calculator, with varying number of iterations. This yields progressively less samples (n in the table) as the number of iterations averaged in the reported time increases (denoted by i# in the series’ name). Remember that our current measurement process reports only the mean value ( column) from 1 second of execution for each of the series.

Notice the differences between mean and median ( vs. Med), which are statistical estimates of a typical value. The the spread, or variability of the sample is characterized by standard deviation (s) and interquartile range (IQR). They are followed in the series table by relative measures of this dispersion: the coefficient of variation (CV = s / Mean); the equivalent for IQR is computed here as IQR / Median. Take note of the differences between these values for series with low iteration counts.1

The range shown on Y axis is about 25% of the minimum value, as can be seen in R and % columns of the i4 series. You can explore the individual sample series by selecting it from the table. It will be highlighted in the chart, with filled region showing the mean value of the series.

In general, as less iterations are averaged in each reported sample, the variability of sample population increases. For example, there is quite extreme maximum value of 22586 in the i1 series. This is probably the reason for originally introducing the averaging over multiple iterations in Benchmark_X in the first place. At least that’s how I understand Andrew Trick’s recollection here.

But the increase in variability comes from measurement errors caused by preemptive multitasking. When the operating system’s scheduler interrupts our process in the middle of timing a benchmark, the reported sample time also includes the time it took to switch context to another process, its execution and switching back. The frequency and magnitude of these errors varies depending on the overall system load and is outside of our direct control. The higher iteration counts are averaging over longer stretches of time, therefore including more of the accumulated error. The worst case being our status quo, where the reported 1 sample/second also includes all interrupts to the measured process during that time.

What happens to the average of the mean with higher system load? The value currently used to detect improvements and regressions in our benchmark is the minimum of these averaged values, so we’ll examine that too. These are the values measured with 10 Series strategy:

Calculator a b c d e
Mean of x̅ 368 374 401 615 891
Min of x̅ 368 372 385 576 779

Current measurement system with auto-scaling always reports the time with cumulative error of all the interrupts that have occurred during the ~1s of measurement. This is the root cause of instability in the reported benchmark improvements and regressions. There are no unstable benchmarks. Our measurement process is fragile — easily susceptible to the whims of varying system load. Currently we only attempt to counteract it with brute force, looking for the lowest average sample gathered in about 20 seconds. We are also looking for an outlier there: a minimum of the means, not a typical value! With having just 20 samples overall, we have little indication of their quality. Smoke benchmark with 3 samples has hardly any statistical evidence for the reported improvements and regressions.

Due to the systemic measurement error introduced from preemptive multitasking, which has very long tail and is therefore nowhere near normally distributed, the mean and standard deviation are poor estimates of a location and spread for raw benchmark measurements when the system is even under moderate load. Median and interquartile range are a more robust measures in the presence of outliers.

The base workload of performance tests in the Swift Benchmark suite is in all cases scaled so that even with optimizations enabled, the reported runtime is at minimum in tens of microseconds — these are microbenchmarks. We are not measuring timing of individual methods or CPU operations, so we do not need to deal with timer precision and dynamically determine N in order to properly measure on nanosecond scale. That’s the domain where the averaging of runtime over multiple iterations comes into play.

I think it is safe to conclude that averaging multiple samples together, i.e. measuring with num-iters > 1, serves no useful purpose in Swift Benchmark Suite, only precludes the separation of signal from the noise.

Previous: Introduction
Next: Exclude Outliers