Exclude Setup Overhead

Some benchmarks need to perform additional setup work before their main workload. Historically, this was dealt with by sizing the main workload so that it dwarfs the setup, making it negligible. Setup is performed before the main work loop, and its impact is further lessened by amortizing it over the N measured iterations.

Since we no longer measure with N>1, the effect of setup becomes more pronounced. One of the most extreme cases, benchmark ReversedArray clearly demonstrates the amortization of the setup overhead as num-iters increase.7

The setup overhead is a systematic measurement error, that can be detected and corrected for, when measuring with different num-iters. Given two measurements performed with i and j iterations, that reported corresponding runtimes ti and tj, the setup overhead can be computed as follows:

setup = (i * j * (ti - tj)) / (j - i)

We can detect the setup overhead by picking smallest minimum from series with same num-iters and using the above formula where i=1, j=2. In the a10R series from ReversedArray it gives us 134µs of setup overhead (or 41.4% of the minimal value).7

We can normalize the series with different num-iters by subtracting the corresponding fraction of the setup from each sample. The median value after we exclude the setup overhead is 190µs which exactly matches the baseline from the i0 Series.8

However, our ability to detect overhead with this technique depends on its size relative to the sample variance. Some small overhead is apparent only in the a Series and gets lost in noisier series. Another issue is that for larger overheads, when we subtract it, the corrected sample has higher variance relative to other benchmarks with similar runtimes. This is because the sample dispersion is always relative to the runtime. When we subtract the constant overhead we get better runtime, but same dispersion (i.e. the IQR and standard deviation are unchanged).

Benchmarks from Swift Benchmark Suite with setup overhead in % relative to the runtime. The % links open the chart.html; Hover over the links for absolute values in µs:

Benchmark a10 a10R a12
ClassArrayGetter 100% 100% 99%
ReversedDictionary 99% 99% 100%
ArrayOfGenericRef 49% 50% 50%
ArrayOfPOD 50% 50% 50%
ArrayOfGenericPOD 50% 50% 50%
Chars 50% 50% 50%
ArrayOfRef 49% 50% 49%
ReversedArray 41% 41% 41%
DictionaryGroupOfObjects 36% 34% 36%
ArrayAppendStrings 24% 26% 25%
SetIsSubsetOf_OfObjects 20% 20% 20%
SortSortedStrings 15% 15% 15%
IterateData 10% 13% 11%
SuffixArray 10% 7% 10%
PolymorphicCalls 10% 10% 10%
Phonebook 9% 9% 9%
SetIntersect_OfObjects 8% 8% 9%
SetIsSubsetOf 9% 9% 9%
DropLastArray 9% 9% 9%
Dictionary 7% 8% 8%
SetIntersect 8% 8% 8%
DictionaryOfObjects 8% 7% 7%
MapReduceShort 7% 6% 5%
DropLastArrayLazy 7% 7% 7%
StaticArray 7% 7% 7%
SuffixArrayLazy 6% 6% 6%
MapReduceClass 6% 4% 4%
ArrayInClass 4% 5% 5%
SubstringFromLongString 4% 4% 4%
DropFirstArray 4% 4% 4%
PrefixArray 3% 3% 3%
SuffixAnyCollection 3% 3% 3%
DropLastAnyCollection 3% 3% 3%
SetUnion_OfObjects 3% 2% 3%
PrefixArrayLazy 3% 3% 3%
UTF8Decode 2% 2% 2%
PrefixWhileArrayLazy 2% 2% 2%
DropFirstArrayLazy 2% 2% 2%
SetExclusiveOr_OfObjects 2% 2% 2%
DropFirstSequenceLazy 2% 2% 2%
DropWhileArrayLazy 1% 1% 1%
DropFirstAnyCollection 1% 1% 1%
PrefixAnyCollection 1% 1% 1%
MapReduceLazySequence 1% 1% 1%
ArrayAppendToFromGeneric 1% 1% 0%
PrefixWhileAnyCollectionLazy 1% 1% 1%

The first two, ClassArrayGetter and ReversedDictionary, are clearly cases of incorrectly written benchmarks where the compiler’s optimizations eliminated the main workload and we are only measuring the setup overhead. Many other are cases of testing a certain feature of Array or Array-backed type, where the initial creation of the array ends up significant compared to the other fast methods the Array provides.

PR 12404 has added the ability to perform setup and tear down outside of the measured performance test that is so far used by one benchmark. Rather than automatically correct for the setup overhead, I believe it is best to manually audit the benchmarks from the above table and reassess what should be measured and what should be moved to the setup function outside the main workload.

Previous: Exclude Outliers
Next: Detecting Changes