[BUG] benchmarks of very fast things produce misleading "geomean" results #1482

matta · 2022-09-08T12:29:32Z

Describe the bug

I've got a benchmark comparison where one side has a mean CPU time of 1.8ns and the other 2.2ns. compare.py reports their "OVERALL GEOMEAN" difference 1.0.

compare.py truncates to integer nanosecond resolution when computing geomean, so it will be inaccurate for many "fast" benchmarks.

System

Linux, but doesn't matter.

To reproduce

Steps to reproduce the behavior:

Run two benchmarks with low single digit timings, ideally with sub-nanosecond differences. The OVERALL_GEOMEAN will tend to be 0.0, 1.0, or -1.0.

Expected behavior

OVERALL_GEOMEAN is a correct result.

Additional context

The issue is here:

benchmark/tools/gbench/report.py

Lines 155 to 162 in db55c89

    
           def get_timedelta_field_as_seconds(benchmark, field_name): 
        
               """ 
        
               Get value of field_name field of benchmark, which is time with time unit 
        
               time_unit, as time in seconds. 
        
               """ 
        
               time_unit = benchmark['time_unit'] if 'time_unit' in benchmark else 's' 
        
               dt = Timedelta(benchmark[field_name], time_unit) 
        
               return dt / Timedelta(1, 's')

The pandas.Timedelta type stores integral (int64) values at nanosecond resolution. See https://github.com/pandas-dev/pandas/blob/28331e0cc7aa0729968565bbd1278b3b6239c823/pandas/_libs/tslibs/timedeltas.pyi#L84

I think the OVERALL_GEOMEAN has always had this bug. See #1289.

While pandas.Timedelta supports an impressive list of units I don't think compare.py needs that generality. For what Pandas supports see: https://github.com/pandas-dev/pandas/blob/28331e0cc7aa0729968565bbd1278b3b6239c823/pandas/_libs/tslibs/timedeltas.pyi#L19-L61

I think compare.py only needs to support the units in benchmark JSON. Implementing the C++ code here again in Python to achieve the same effect would be easy:

benchmark/include/benchmark/benchmark.h

Lines 1817 to 1843 in 4366d66

    
           inline const char* GetTimeUnitString(TimeUnit unit) { 
        
             switch (unit) { 
        
               case kSecond: 
        
                 return "s"; 
        
               case kMillisecond: 
        
                 return "ms"; 
        
               case kMicrosecond: 
        
                 return "us"; 
        
               case kNanosecond: 
        
                 return "ns"; 
        
             } 
        
             BENCHMARK_UNREACHABLE(); 
        
           } 
        
           inline double GetTimeUnitMultiplier(TimeUnit unit) { 
        
             switch (unit) { 
        
               case kSecond: 
        
                 return 1; 
        
               case kMillisecond: 
        
                 return 1e3; 
        
               case kMicrosecond: 
        
                 return 1e6; 
        
               case kNanosecond: 
        
                 return 1e9; 
        
             } 
        
             BENCHMARK_UNREACHABLE(); 
        
           }

From there, rewriting get_timedelta_field_as_seconds to preserve resolution is also easy.

Happy to work on this, as I've been looking at these "strange" geomean values in compare.py output the past few days and wondering about it. :)

Screenshots

Here is an example showing the issue:

+ taskset -c 0 ./build_gcc/_deps/googlebenchmark-src/tools/compare.py benchmarks ./build_clang/Release/lower_bound ./build_gcc/Release/lower_bound --benchmark_repetitions=20 '--benchmark_filter=Random_Random/10$'
RUNNING: ./build_clang/Release/lower_bound --benchmark_repetitions=20 --benchmark_filter=Random_Random/10$ --benchmark_out=/tmp/tmp6lsy84m8
CPU ID assignments in /proc/cpuinfo seem messed up. This is usually caused by a bad BIOS.
2022-09-08T04:25:50-07:00
Running ./build_clang/Release/lower_bound
Run on (23 X 3545.34 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x23)
  L1 Instruction 32 KiB (x23)
  L2 Unified 512 KiB (x23)
  L3 Unified 32768 KiB (x2)
Load Average: 0.12, 0.18, 0.23
--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
BM_LowerBound/Random_Random/10              1.79 ns         1.78 ns    391748437
BM_LowerBound/Random_Random/10              1.80 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.80 ns         1.80 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.80 ns         1.80 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10_mean         1.79 ns         1.79 ns           20
BM_LowerBound/Random_Random/10_median       1.79 ns         1.79 ns           20
BM_LowerBound/Random_Random/10_stddev      0.005 ns        0.005 ns           20
BM_LowerBound/Random_Random/10_cv           0.26 %          0.25 %            20
RUNNING: ./build_gcc/Release/lower_bound --benchmark_repetitions=20 --benchmark_filter=Random_Random/10$ --benchmark_out=/tmp/tmp2bfphzw0
CPU ID assignments in /proc/cpuinfo seem messed up. This is usually caused by a bad BIOS.
2022-09-08T04:26:04-07:00
Running ./build_gcc/Release/lower_bound
Run on (23 X 3400.42 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x23)
  L1 Instruction 32 KiB (x23)
  L2 Unified 512 KiB (x23)
  L3 Unified 32768 KiB (x2)
Load Average: 0.32, 0.22, 0.24
--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
BM_LowerBound/Random_Random/10              2.25 ns         2.25 ns    312822737
BM_LowerBound/Random_Random/10              2.22 ns         2.22 ns    312822737
BM_LowerBound/Random_Random/10              2.23 ns         2.23 ns    312822737
BM_LowerBound/Random_Random/10              2.23 ns         2.23 ns    312822737
BM_LowerBound/Random_Random/10              2.22 ns         2.22 ns    312822737
BM_LowerBound/Random_Random/10              2.22 ns         2.22 ns    312822737
BM_LowerBound/Random_Random/10              2.21 ns         2.21 ns    312822737
BM_LowerBound/Random_Random/10              2.23 ns         2.23 ns    312822737
BM_LowerBound/Random_Random/10              2.22 ns         2.22 ns    312822737
BM_LowerBound/Random_Random/10              2.20 ns         2.20 ns    312822737
BM_LowerBound/Random_Random/10              2.22 ns         2.22 ns    312822737
BM_LowerBound/Random_Random/10              2.21 ns         2.21 ns    312822737
BM_LowerBound/Random_Random/10              2.23 ns         2.23 ns    312822737
BM_LowerBound/Random_Random/10              2.21 ns         2.21 ns    312822737
BM_LowerBound/Random_Random/10              2.22 ns         2.22 ns    312822737
BM_LowerBound/Random_Random/10              2.23 ns         2.23 ns    312822737
BM_LowerBound/Random_Random/10              2.21 ns         2.21 ns    312822737
BM_LowerBound/Random_Random/10              2.24 ns         2.24 ns    312822737
BM_LowerBound/Random_Random/10              2.23 ns         2.23 ns    312822737
BM_LowerBound/Random_Random/10              2.21 ns         2.21 ns    312822737
BM_LowerBound/Random_Random/10_mean         2.22 ns         2.22 ns           20
BM_LowerBound/Random_Random/10_median       2.22 ns         2.22 ns           20
BM_LowerBound/Random_Random/10_stddev      0.012 ns        0.012 ns           20
BM_LowerBound/Random_Random/10_cv           0.55 %          0.55 %            20
Comparing ./build_clang/Release/lower_bound to ./build_gcc/Release/lower_bound
Benchmark                                               Time             CPU      Time Old      Time New       CPU Old       CPU New
------------------------------------------------------------------------------------------------------------------------------------
BM_LowerBound/Random_Random/10                       +0.2615         +0.2616             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2361         +0.2378             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2448         +0.2448             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2412         +0.2413             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2425         +0.2425             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2411         +0.2411             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2342         +0.2342             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2416         +0.2416             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2443         +0.2443             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2300         +0.2300             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2405         +0.2405             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2231         +0.2231             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2475         +0.2475             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2338         +0.2339             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2452         +0.2452             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2408         +0.2408             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2330         +0.2330             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2518         +0.2518             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2493         +0.2493             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2379         +0.2379             2             2             2             2
BM_LowerBound/Random_Random/10_pvalue                 0.0000          0.0000      U Test, Repetitions: 20 vs 20
BM_LowerBound/Random_Random/10_mean                  +0.2410         +0.2411             2             2             2             2
BM_LowerBound/Random_Random/10_median                +0.2421         +0.2421             2             2             2             2
BM_LowerBound/Random_Random/10_stddev                +1.6502         +1.7121             0             0             0             0
BM_LowerBound/Random_Random/10_cv                    +1.1355         +1.1852             0             0             0             0
OVERALL_GEOMEAN                                      +1.0000         +1.0000             0             0             0             0

The text was updated successfully, but these errors were encountered:

LebedevRI · 2022-09-08T13:36:00Z

Oops, that is quite unfortunate :)
I guess it would be fine to hand-roll our own timeunit-to-seconds function then.
Please feel free to look into it if you want.

The pandas.Timedelta class truncates to integral nanoseconds, which throws away sub-nanosecond precision present in benchmark JSON. Switch to floating point multiplication, which preserves it. Fixes #1482 Tentatively fixes #1477.

LebedevRI added the help wanted label Sep 8, 2022

matta mentioned this issue Sep 8, 2022

[BUG] compare.py: OVERALL_GEOMEAN does not work in filters and benchmarksfiltered modes #1477

Closed

LebedevRI closed this as completed in becf80f Sep 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] benchmarks of very fast things produce misleading "geomean" results #1482

[BUG] benchmarks of very fast things produce misleading "geomean" results #1482

matta commented Sep 8, 2022

LebedevRI commented Sep 8, 2022

[BUG] benchmarks of very fast things produce misleading "geomean" results #1482

[BUG] benchmarks of very fast things produce misleading "geomean" results #1482

Comments

matta commented Sep 8, 2022

LebedevRI commented Sep 8, 2022