-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance regression in clang-19 when using computed goto #106846
Comments
-mllvm -tail-dup-pred-size=30 helps the testcase, but not Ajla. If you compile Ajla with clang-19 -O2 -mllvm -tail-dup-pred-size=30 and do |
What about |
Yes - there are 2284 instructions in the Ajla interpreter. So, I set tail-dup-pred-size=3000 and I get the same output as with clang-18 - 2466 indirect jumps in ipret.o. |
Note GCC (RTL) has a specific pass which duplicates the computed gotos (as GCC merges all computed goto into one BB for performance reasons) which was added in GCC 4.0 and then improved for GCC 7.1.0 (to also unduplicate some related instructions). This pass happens late in the pipeline after register allocation. This is just explaining GCC's solution to the problem and nothing more. |
The |
@mikulas-patocka Can you provide an example of Ajla? I may need to investigate this with the help of it. BTW, whether using INDIRECTBR directly or coming over from SWITCH, we introduce a lot of PHIs. |
The Ajla compiler is written in Ajla, you can have a look at files in newlib/compiler/. I usually use self-compilation as a benchmark. Run './configure && make' with CFLAGS='-O2 -DDEBUG_ENV'. The DEBUG_ENV macro enables envirnment variables that are useful for debugging. Set the environment variable 'export CG=none' - this will disable the machine code generator and use only the interpreter. Run the script ./scripts/update.sh - this will compile the Ajla compiler itself and it will re-generate the file builtin.pcd. If you want some simple benchmark, copy this piece of code to a file loop.ajla
and run it with 'CG=none ./ajla --nosave loop.ajla 1000000000' |
By comparing the results from gcc, I can observe the following:
Next, I'll look into some gcc code to see if I can find anything useful. Perhaps gcc is simply differentiating between Detailed commands
|
A quick note: /* Return true if INSN is an indirect jump (aka computed jump).
Tablejumps and casesi insns are not considered indirect jumps;
we can recognize them by a (use (label_ref)). */ I’ll go with this approach, but from the machine instructions: https://llvm.godbolt.org/z/jcje1xPaW, I haven’t found any fundamental difference between computed goto and table jumps. I plan to investigate further and then file this PR. |
It looks like this issue is also impacting CPython: python/cpython#129987 (comment) I measured a ~4% performance improvement on x86_64 by using an |
|
I haven't tested, but I do notice that #116072 cites CPython as their motivation so I suspect we're looking at the same issue. Is there somewhere I can download a nightly or otherwise test without doing a source build myself? |
I'll test it. LLVM 20 might not be available in your distribution. |
I've done some more benchmarking here. I'd previously estimated the cost here as ~4% for CPython; my current data puts the number at closer to 10% (!). I have also benchmarked #114990, and confirmed that it fixes the performance regression, as well as fixing the tail-duplication logic (as tested by @DianQK above). |
As much as I would like the tail call interpreter to beat the computed goto one, fixing the computed goto interpreter is more important as it affects more systems :). Thanks Nelson for your investigation of this, and DianQK for the patch! |
FWIW this also affects Luau interpreter (https://github.com/luau-lang/luau); on a Zen 4 system, running the Luau benchmark suite (which reports geomean deltas) with
luau-19 is built with clang-19, luau-19-fix is built with additional settings The degradation with clang-20 is less significant but still severe, perhaps the change to handling blocks without phi helps here but does not help enough:
|
I apologize for the disruption this regression has caused. Thanks to like https://llvm-compile-time-tracker.com and https://github.com/dtcxzyw/llvm-opt-benchmark, LLVM has been able to catch many issues early. I'm exploring ways to catch issues similar to this one earlier. |
/cherry-pick dd21aac |
/pull-request #130585 |
Hi
I noticed that the Ajla programming language ( https://www.ajla-lang.cz/ ) runs slower when being compiled by clang-19 compared to clang-18 or gcc-14. I looked at the assembler, and it turns out that the bytecode interpreter (the file "ipret.c") is not compiled as efficiently as it could be. In particular, clang-19 joins all the "goto *next_label" statements in a function into just one "jmp *" instruction. That reduces code size, but it also makes branch prediction inefficient, because the CPU cannot learn that a single instruction jumps to multiple targets.
I created this example that shows the issue:
http://www.jikos.cz/~mikulas/testcases/clang/computed-goto.c
(use: clang-19 -O2 computed-goto.c && time ./a.out)
The results (in seconds) are here:
http://www.jikos.cz/~mikulas/testcases/clang/computed-goto.txt
We can see that the worst slowdown happens on Sandy Bridge. Zen 4 shows no slowdown, so it seems that it has smart indirect branch predictor that can predict multiple jump targets from a single instruction.
The text was updated successfully, but these errors were encountered: