Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] benchmarks of very fast things produce misleading "geomean" results #1482

Closed
matta opened this issue Sep 8, 2022 · 1 comment
Closed

Comments

@matta
Copy link
Contributor

matta commented Sep 8, 2022

Describe the bug

I've got a benchmark comparison where one side has a mean CPU time of 1.8ns and the other 2.2ns. compare.py reports their "OVERALL GEOMEAN" difference 1.0.

compare.py truncates to integer nanosecond resolution when computing geomean, so it will be inaccurate for many "fast" benchmarks.

System

Linux, but doesn't matter.

To reproduce

Steps to reproduce the behavior:

Run two benchmarks with low single digit timings, ideally with sub-nanosecond differences. The OVERALL_GEOMEAN will tend to be 0.0, 1.0, or -1.0.

Expected behavior

OVERALL_GEOMEAN is a correct result.

Additional context

The issue is here:

def get_timedelta_field_as_seconds(benchmark, field_name):
"""
Get value of field_name field of benchmark, which is time with time unit
time_unit, as time in seconds.
"""
time_unit = benchmark['time_unit'] if 'time_unit' in benchmark else 's'
dt = Timedelta(benchmark[field_name], time_unit)
return dt / Timedelta(1, 's')

The pandas.Timedelta type stores integral (int64) values at nanosecond resolution. See https://github.com/pandas-dev/pandas/blob/28331e0cc7aa0729968565bbd1278b3b6239c823/pandas/_libs/tslibs/timedeltas.pyi#L84

I think the OVERALL_GEOMEAN has always had this bug. See #1289.

While pandas.Timedelta supports an impressive list of units I don't think compare.py needs that generality. For what Pandas supports see: https://github.com/pandas-dev/pandas/blob/28331e0cc7aa0729968565bbd1278b3b6239c823/pandas/_libs/tslibs/timedeltas.pyi#L19-L61

I think compare.py only needs to support the units in benchmark JSON. Implementing the C++ code here again in Python to achieve the same effect would be easy:

inline const char* GetTimeUnitString(TimeUnit unit) {
switch (unit) {
case kSecond:
return "s";
case kMillisecond:
return "ms";
case kMicrosecond:
return "us";
case kNanosecond:
return "ns";
}
BENCHMARK_UNREACHABLE();
}
inline double GetTimeUnitMultiplier(TimeUnit unit) {
switch (unit) {
case kSecond:
return 1;
case kMillisecond:
return 1e3;
case kMicrosecond:
return 1e6;
case kNanosecond:
return 1e9;
}
BENCHMARK_UNREACHABLE();
}

From there, rewriting get_timedelta_field_as_seconds to preserve resolution is also easy.

Happy to work on this, as I've been looking at these "strange" geomean values in compare.py output the past few days and wondering about it. :)

Screenshots

Here is an example showing the issue:

+ taskset -c 0 ./build_gcc/_deps/googlebenchmark-src/tools/compare.py benchmarks ./build_clang/Release/lower_bound ./build_gcc/Release/lower_bound --benchmark_repetitions=20 '--benchmark_filter=Random_Random/10$'
RUNNING: ./build_clang/Release/lower_bound --benchmark_repetitions=20 --benchmark_filter=Random_Random/10$ --benchmark_out=/tmp/tmp6lsy84m8
CPU ID assignments in /proc/cpuinfo seem messed up. This is usually caused by a bad BIOS.
2022-09-08T04:25:50-07:00
Running ./build_clang/Release/lower_bound
Run on (23 X 3545.34 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x23)
  L1 Instruction 32 KiB (x23)
  L2 Unified 512 KiB (x23)
  L3 Unified 32768 KiB (x2)
Load Average: 0.12, 0.18, 0.23
--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
BM_LowerBound/Random_Random/10              1.79 ns         1.78 ns    391748437
BM_LowerBound/Random_Random/10              1.80 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.80 ns         1.80 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.80 ns         1.80 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10              1.79 ns         1.79 ns    391748437
BM_LowerBound/Random_Random/10_mean         1.79 ns         1.79 ns           20
BM_LowerBound/Random_Random/10_median       1.79 ns         1.79 ns           20
BM_LowerBound/Random_Random/10_stddev      0.005 ns        0.005 ns           20
BM_LowerBound/Random_Random/10_cv           0.26 %          0.25 %            20
RUNNING: ./build_gcc/Release/lower_bound --benchmark_repetitions=20 --benchmark_filter=Random_Random/10$ --benchmark_out=/tmp/tmp2bfphzw0
CPU ID assignments in /proc/cpuinfo seem messed up. This is usually caused by a bad BIOS.
2022-09-08T04:26:04-07:00
Running ./build_gcc/Release/lower_bound
Run on (23 X 3400.42 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x23)
  L1 Instruction 32 KiB (x23)
  L2 Unified 512 KiB (x23)
  L3 Unified 32768 KiB (x2)
Load Average: 0.32, 0.22, 0.24
--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
BM_LowerBound/Random_Random/10              2.25 ns         2.25 ns    312822737
BM_LowerBound/Random_Random/10              2.22 ns         2.22 ns    312822737
BM_LowerBound/Random_Random/10              2.23 ns         2.23 ns    312822737
BM_LowerBound/Random_Random/10              2.23 ns         2.23 ns    312822737
BM_LowerBound/Random_Random/10              2.22 ns         2.22 ns    312822737
BM_LowerBound/Random_Random/10              2.22 ns         2.22 ns    312822737
BM_LowerBound/Random_Random/10              2.21 ns         2.21 ns    312822737
BM_LowerBound/Random_Random/10              2.23 ns         2.23 ns    312822737
BM_LowerBound/Random_Random/10              2.22 ns         2.22 ns    312822737
BM_LowerBound/Random_Random/10              2.20 ns         2.20 ns    312822737
BM_LowerBound/Random_Random/10              2.22 ns         2.22 ns    312822737
BM_LowerBound/Random_Random/10              2.21 ns         2.21 ns    312822737
BM_LowerBound/Random_Random/10              2.23 ns         2.23 ns    312822737
BM_LowerBound/Random_Random/10              2.21 ns         2.21 ns    312822737
BM_LowerBound/Random_Random/10              2.22 ns         2.22 ns    312822737
BM_LowerBound/Random_Random/10              2.23 ns         2.23 ns    312822737
BM_LowerBound/Random_Random/10              2.21 ns         2.21 ns    312822737
BM_LowerBound/Random_Random/10              2.24 ns         2.24 ns    312822737
BM_LowerBound/Random_Random/10              2.23 ns         2.23 ns    312822737
BM_LowerBound/Random_Random/10              2.21 ns         2.21 ns    312822737
BM_LowerBound/Random_Random/10_mean         2.22 ns         2.22 ns           20
BM_LowerBound/Random_Random/10_median       2.22 ns         2.22 ns           20
BM_LowerBound/Random_Random/10_stddev      0.012 ns        0.012 ns           20
BM_LowerBound/Random_Random/10_cv           0.55 %          0.55 %            20
Comparing ./build_clang/Release/lower_bound to ./build_gcc/Release/lower_bound
Benchmark                                               Time             CPU      Time Old      Time New       CPU Old       CPU New
------------------------------------------------------------------------------------------------------------------------------------
BM_LowerBound/Random_Random/10                       +0.2615         +0.2616             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2361         +0.2378             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2448         +0.2448             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2412         +0.2413             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2425         +0.2425             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2411         +0.2411             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2342         +0.2342             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2416         +0.2416             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2443         +0.2443             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2300         +0.2300             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2405         +0.2405             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2231         +0.2231             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2475         +0.2475             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2338         +0.2339             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2452         +0.2452             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2408         +0.2408             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2330         +0.2330             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2518         +0.2518             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2493         +0.2493             2             2             2             2
BM_LowerBound/Random_Random/10                       +0.2379         +0.2379             2             2             2             2
BM_LowerBound/Random_Random/10_pvalue                 0.0000          0.0000      U Test, Repetitions: 20 vs 20
BM_LowerBound/Random_Random/10_mean                  +0.2410         +0.2411             2             2             2             2
BM_LowerBound/Random_Random/10_median                +0.2421         +0.2421             2             2             2             2
BM_LowerBound/Random_Random/10_stddev                +1.6502         +1.7121             0             0             0             0
BM_LowerBound/Random_Random/10_cv                    +1.1355         +1.1852             0             0             0             0
OVERALL_GEOMEAN                                      +1.0000         +1.0000             0             0             0             0
@LebedevRI
Copy link
Collaborator

Oops, that is quite unfortunate :)
I guess it would be fine to hand-roll our own timeunit-to-seconds function then.
Please feel free to look into it if you want.

dmah42 pushed a commit that referenced this issue Sep 14, 2022
The pandas.Timedelta class truncates to integral nanoseconds, which throws
away sub-nanosecond precision present in benchmark JSON.  Switch to
floating point multiplication, which preserves it.

Fixes #1482
Tentatively fixes #1477.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants