Exclude Setup Overhead

Some benchmarks need to perform additional setup work before their main workload. Historically, this was dealt with by sizing the main workload so that it dwarfs the setup, making it negligible. Setup is performed before the main work loop, and its impact is further lessened by amortizing it over the N measured iterations.

Since we no longer measure with N>1, the effect of setup becomes more pronounced. One of the most extreme cases, benchmark ReversedArray clearly demonstrates the amortization of the setup overhead as num-iters increase.⁷

The setup overhead is a systematic measurement error, that can be detected and corrected for, when measuring with different num-iters. Given two measurements performed with i and j iterations, that reported corresponding runtimes ti and tj, the setup overhead can be computed as follows:

setup = (i * j * (ti - tj)) / (j - i)

We can detect the setup overhead by picking smallest minimum from series with same num-iters and using the above formula where i=1, j=2. In the a10R series from ReversedArray it gives us 134µs of setup overhead (or 41.4% of the minimal value).⁷

We can normalize the series with different num-iters by subtracting the corresponding fraction of the setup from each sample. The median value after we exclude the setup overhead is 190µs which exactly matches the baseline from the i0 Series.⁸

However, our ability to detect overhead with this technique depends on its size relative to the sample variance. Some small overhead is apparent only in the a Series and gets lost in noisier series. Another issue is that for larger overheads, when we subtract it, the corrected sample has higher variance relative to other benchmarks with similar runtimes. This is because the sample dispersion is always relative to the runtime. When we subtract the constant overhead we get better runtime, but same dispersion (i.e. the IQR and standard deviation are unchanged).

Benchmarks from Swift Benchmark Suite with setup overhead in % relative to the runtime. The % links open the chart.html; Hover over the links for absolute values in µs:

Benchmark	a10	a10R	a12
ClassArrayGetter	100%	100%	99%
ReversedDictionary	99%	99%	100%
ArrayOfGenericRef	49%	50%	50%
ArrayOfPOD	50%	50%	50%
ArrayOfGenericPOD	50%	50%	50%
Chars	50%	50%	50%
ArrayOfRef	49%	50%	49%
ReversedArray	41%	41%	41%
DictionaryGroupOfObjects	36%	34%	36%
ArrayAppendStrings	24%	26%	25%
SetIsSubsetOf_OfObjects	20%	20%	20%
SortSortedStrings	15%	15%	15%
IterateData	10%	13%	11%
SuffixArray	10%	7%	10%
PolymorphicCalls	10%	10%	10%
Phonebook	9%	9%	9%
SetIntersect_OfObjects	8%	8%	9%
SetIsSubsetOf	9%	9%	9%
DropLastArray	9%	9%	9%
Dictionary	7%	8%	8%
SetIntersect	8%	8%	8%
DictionaryOfObjects	8%	7%	7%
MapReduceShort	7%	6%	5%
DropLastArrayLazy	7%	7%	7%
StaticArray	7%	7%	7%
SuffixArrayLazy	6%	6%	6%
MapReduceClass	6%	4%	4%
ArrayInClass	4%	5%	5%
SubstringFromLongString	4%	4%	4%
DropFirstArray	4%	4%	4%
PrefixArray	3%	3%	3%
SuffixAnyCollection	3%	3%	3%
DropLastAnyCollection	3%	3%	3%
SetUnion_OfObjects	3%	2%	3%
PrefixArrayLazy	3%	3%	3%
UTF8Decode	2%	2%	2%
PrefixWhileArrayLazy	2%	2%	2%
DropFirstArrayLazy	2%	2%	2%
SetExclusiveOr_OfObjects	2%	2%	2%
DropFirstSequenceLazy	2%	2%	2%
DropWhileArrayLazy	1%	1%	1%
DropFirstAnyCollection	1%	1%	1%
PrefixAnyCollection	1%	1%	1%
MapReduceLazySequence	1%	1%	1%
ArrayAppendToFromGeneric	1%	1%	0%
PrefixWhileAnyCollectionLazy	1%	1%	1%

The first two, ClassArrayGetter and ReversedDictionary, are clearly cases of incorrectly written benchmarks where the compiler’s optimizations eliminated the main workload and we are only measuring the setup overhead. Many other are cases of testing a certain feature of Array or Array-backed type, where the initial creation of the array ends up significant compared to the other fast methods the Array provides.

PR 12404 has added the ability to perform setup and tear down outside of the measured performance test that is so far used by one benchmark. Rather than automatically correct for the setup overhead, I believe it is best to manually audit the benchmarks from the above table and reassess what should be measured and what should be moved to the setup function outside the main workload.

Previous: Exclude Outliers
Next: Detecting Changes