-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #12049
Conversation
…omplex/redundant pointer operation
How do you handle the QNN graph build-execute-free during inference? As we are also integrating the QNN in our framework, the graph building is time consuming and the memory increases hugely when finalizing the QNN graph. It seems that the QNN has no easy way to free a graph during execution. |
Hi @oreomaker , Nice question! This is actually the key point for similar QNN backend implementations. QNN's interface functions more like a traditional ML framework that requires "compilation" (what they call "finalize") before execution. In my fork, I've taken this a step further by generating a QNN graph based on the GGML graph. This approach allows the QNN framework to perform more comprehensive optimizations during compilation. |
I don't want to continue debating this here. The reason for restricting your access to this repo was your inappropriate comments on unrelated PRs (here, here2 and here3). And repo's owner gave a clear reason about it. I'd suggest focusing on improving your codebase in an objective manner without making assumptions about or judging others' work. If my comments made you uncomfortable, I apologize. I'm happy to step back from this discussion. I can also create my own PR where anyone interested can discuss the design approach more constructively. |
this is a good question, your concern is correct:
[updated on 02/26/2025] my previous answer might be wrong, because the first technical approach can works very good(quantized data with mulmat didn't implemented when I wrote the simple tech doc), there are 7x-10x performance improvements in my local dev envs with QNN backend. btw, you can refer to my personal understanding about ggml-qnn and other ggml backends in that simple tech doc: prons: this approach might be equivalent to the principle shown in the above quoted code, and I guess that's the secret of how to utilize the Hexagon NPU maximally in QNN backend. I don't know why there is such big difference between ggml-qnn and ggml-sycl/ggml-cann/ggml-opencl. |
Hi @zhouwg. I want to clarify that the comments made by @chraac in your previous PR had no influence whatsoever in the decision to block you from participating in this repository. Technical feedback and code reviews are always welcome and even encouraged. However, you were blocked due to a consistent pattern of comments that incited personal conflict, often in response to legitimate technical feedback. The comments linked by @chraac (now removed) are an example of this behavior. We value your technical contributions and encourage you to continue participating in discussions, but please focus on the technical aspects and the code itself. If you receive feedback on your code, try not to take it personally, state your point of view and move on. If you feel personally attacked or treated unfairly, please reach out to the maintainers privately, and we will do our best to address the situation. Engaging in personal conflict in public comments is not productive for anyone. Additionally, please keep in mind that while this is an open source project that welcomes contributions, it does not mean that contributions are guaranteed to be accepted. You may disagree with the maintainers' decisions, but refusing to make the changes requested by the maintainers will likely result in your contributions being rejected. If you disagree with these decisions, you always have the option to fork the project and maintain your own version. |
@slaren, thanks for your valuable and very helpful suggestion or guidance. I admitted that I made a same stupid mistake because of intentional challenge from a same CN programmer in my third PR and I try to adjust my mind and behavior accordingly. at the same time, I think now everyone in this community can see what happened in my first&second&third PR. especially in this third PR:
yes, your opinion is definitionally correct, I see. I came to this great tech community for learning real hard-core AI tech and try to make some contributions. I understand my PR probably might be not accepted and this is normal acceptable thing, but I really don't want to be involved with some meaningless conflict with others especially some CN programmers from mainland China. |
…for benchmark more conveniently
Hi everyone, |
thanks for your comment. your question is a really good question.
@null-define, thanks for your good question, it helps me brainstorm or deep dive into related technical issues and combine/assemble them to get a clearer understanding although it might be incorrect. |
close this PR on 03/01/2025 because:
|
@zhouwg |
thanks for your comment.
|
@zhouwg |
It looks like the developers of executorch are working on a version with improved performance. |
thanks for your comment and this is a really helpful information for ggml-qnn backend! yes, there are various technical approach/software stack of NPU inference on Qualcomm's (mobile/desktop) SoC, they are all both provided by Qualcomm, the core parts are closed-source(as my personal understanding) and this is a normal situation in big IT company, pls refer to: #12049 (comment) that's one of the reasons why I continue effort on this concise implementation of ggml-qnn: the code is simple and everyone can understand the domain technical details and understand code easily and quickly, this might-be useful for community. |
* [ ] Low
* [x] Medium
* [ ] High
PR Description
this PR is a continued effort of my original PR #6869 on 04/2024
thanks to the huge changes in the software architecture in the latest llama.cpp (especially the maturation of the "backend scheduler" feature and the maturation of test-backend-ops),
this implementation put main logic in one single source file(ggml-qnn.cpp) because it will helpful for other experienced programmers be involved in dev activity which similar to what ggerganov did in the very beginning of ggml.c/llama.cpp or what Intel did in the very beginning of ggml-sycl.cpp or what Qualcomm did in the ggml-opencl.cpp.
other reason of this coding style is I think this will make the developers' workflow more easily:
Features
special clarification in this section:
all original tech comes from Qualcomm, Qualcomm provide the fundamental mechanism and we programmer use it regardless C/C++ style or tech approach.
efforts on this PR might be useful for Qualcomm's QNN SDK's users or other similar PR. I personally think more people using Qualcomm chips or selling more Qualcomm chips may be the key-point rather than complex/complicated C++ encapsulation of the highly-well designed QNN SDK.
the core ideas and tech difficulties and performance issue should be completely same to this implementation even with complicated and cool C++ encapsulation.
Performance of ggml-qnn backend
all fp32 and quantize type mulmuat already offload to QNN backend:




How to build ggml‐qnn source code for Android and verify ggml-qnn backend on Snapdragon based phone
Ubuntu 20.04,22.04 is validated and recommended as host machine(other Linux distributions might be also ok). the dev activity in this PR can be done in pure command line without any IDE, so setup dev envs on Linux is simple:
download and install Qualcomm QNN SDK on Linux accordingly from https://www.qualcomm.com/developer/software/qualcomm-ai-engine-direct-sdk
utilize my self-made script build-run-android.sh to download Android NDK automatically(pls see below section)
you will need an Android smartphone with adb-connected running on one of below Qualcomm SoCs:
SM8450 (Snapdragon 8 Gen 1+)
SM8550 (Snapdragon 8 Gen 2)
SM8650 (Snapdragon 8 Gen 3)
SM8750 (Snapdragon 8 Gen 4)
SM8750-AB (Snapdragon 8 Elite)
we can find that this backend works fine as expected from the log output of "adb logcat | grep ggml-qnn". for programmers, we can use "adb logcat | grep ggml-qnn" to help troubleshooting work.
How to build ggml‐qnn source code for Snapdragon based WoA(Windows on ARM) device(verified)
similar to dev envs on Linux, I'll try to build ggml-qnn source code for Snapdragon based WoA in pure command line on Windows10 without any IDE. details can be found in my another PR: #12215
a Snapdragon desktop SoC equipped WoA device(Windows on ARM) is required to verify build result or further dev activity for WoA(Windows on ARM).
Thought & summary about performance
(1)load/performance loss of data transfer between AP(arm cpu) and NPU(dsp). that's the performance loss caused by transferring data between the main CPU and the NPU. this part requires redesigning the data structure in the ggml-qnn implementation, placing all tensor data entirely in the DSP's device memory to minimize data copying or ideally achieve zero-copy.
(2)relative trick with Qualcomm's QNN SDK. I find the RPC design a bit puzzling, its usage differs quite a bit from Intel's SYCL or Huawei's CANN. the AI operator acceleration provided by Qualcomm's QNN SDK, if not handled with particularly clever optimizations during use, ends up performing worse than the highly optimized default CPU backend in ggml. honestly, it’s baffling. doesn’t Qualcomm have a better implementation approach for hardware acceleration SDKs on this kind of heterogeneous multi-core architecture? years ago, I used a Linux SDK from another American chip company that was close to the driver layer (i.e., without much encapsulation) for video decoding acceleration, and it was incredibly user-friendly. once I grasped the tricks of its software design and usage, integrating it into FFmpeg was a breeze. their hardware decoding acceleration chip also used a DSP architecture. of course, I’m not entirely clear on the differences between hardware acceleration for AI operators and video decoding at the moment.
(3)some ops that are generally critical to inference performance in GGML, especially for Transformer-based models, which are a common use case. these ops often dominate computation time or memory usage during inference. Key Ops Significant to Inference Performance:
General Observations
For LLMs (e.g., LLaMA): MUL_MAT typically dominates inference time due to its role in attention and feed-forward layers. optimizing this (via quantization or NPU offloading) yields the biggest gains.
Specialized Ops: ROPE, FLASH_ATTN_EXT, and RMS_NORM are significant for specific model architectures (e.g., LLaMA-style LLMs) and can be optimized to unlock major performance gains.
Big picture of ggml-qnn backend
pls refer to a simple tech doc below: mapping ggml compute graph to QNN compute graph
the first technical approach can be seen in this PR. accordingly, the second technical approach can be extended base on this PR with the similar coding style without complicated C++ encapsulation. in the fact, there already a PoC in my local dev envs, but it's not a functional implementation in my local dev envs, the reason I had been said in this simple tech doc.

Acknowledgement
https://opensource.microsoft.com/codeofconduct/
Very Important
PaddlePaddle/Paddle-Lite#10539
this clue will/might help me and us understand some interesting things about ggml-qnn backend^:^.
Conclusion
this is a concise ggml-qnn implementation:
at the moment:
after spent too much efforts on ggml-qnn backend, I personally think: