PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #12049

zhouwg · 2025-02-24T08:59:59Z

I have read the contributing guidelines
Self-reported review complexity:
* [ ] Low
* [x] Medium
* [ ] High

PR Description

this PR is a continued effort of my original PR #6869 on 04/2024

thanks to the huge changes in the software architecture in the latest llama.cpp (especially the maturation of the "backend scheduler" feature and the maturation of test-backend-ops),

data path of ggml-qnn backend works pretty good as expected.
the official command line tool "test-backend-ops" & "llama-cli" has verified on a Qualcomm Snapdragon 8 Gen3 equipped Android phone.
works pretty good with ASR inference via whisper.cpp and LLM inference via llama.cpp with a standard Android APP(which is a self-made Android APP) on a Qualcomm Snapdragon 8 Gen 3 equipped Android phone.

this implementation put main logic in one single source file(ggml-qnn.cpp) because it will helpful for other experienced programmers be involved in dev activity which similar to what ggerganov did in the very beginning of ggml.c/llama.cpp or what Intel did in the very beginning of ggml-sycl.cpp or what Qualcomm did in the ggml-opencl.cpp.

other reason of this coding style is I think this will make the developers' workflow more easily:

this is a self-contained source file
try to overcome all relevant technical issues/difficulties with a specified op GGML_OP_ADD or GGML_OP_MUL_MAT
then expand other ggml ops in ggml-qnn-ops.cpp accordingly with team-work from AI experts in the upstream llama.cpp community

Features

data path works good between QNN SDK and ggml/llama.cpp through reverse engineering from executorch(the QNN backend's implementation in executorch comes from Qualcomm) in my first PR on 04/2024
a simple and effective graph cache mechanism which already implemented in project kanTV on 04/2024
use a simple STL to manage QNN resources in this PR rather than complex C++ encapsulation because the highly-well designed QNN SDK already manage it's internal hardware and software resource very carefully
a simple skeleton in function ggml_qnn_general_node:offload GGML_OP_ADD & GGML_OP_MUL to QNN backend. we can see this function is a very concise implementation rather than complex C++ encapsulation with hide many tech details.
a complex skeleton in function ggml_qnn_mulmat: offload GGML_OP_MULMAT to QNN backend, this skeleton can be used to illustrate the second technical approach of "how to utilize the Hexagon NPU maximally". we can see this function is a very concise implementation rather than complex C++ encapsulation with hide many tech details.
QNN NPU RPC feature which already implemented in project kanTV on 04/2024 (UT passed but some unknown bugs should be fixed and should be seen in all hard-forked ggml-qnn projects, this is not an intentional bug)
provide big picture and different technical approaches of ggml-qnn in my forked llama.cpp and this PR, the second technical approach of "mapping the entire ggml computational graph to QNN graph" already discovered in project KanTV on 04/02024
overcome necessary technical difficulties in this PR
quantized data type with 2D mulmat and a very significant performance improvement for LLM inference with ggml-qnn backend(added on 02/26/2025,12:40, pass UT and CT through test-backend-ops and llama-cli)
code is simple and everyone can understand code easily and quickly, without complex encapsulation, because layered abstraction and loose coupling will bring difficult with code tracking and troubleshooting.

special clarification in this section:

all original tech comes from Qualcomm, Qualcomm provide the fundamental mechanism and we programmer use it regardless C/C++ style or tech approach.
efforts on this PR might be useful for Qualcomm's QNN SDK's users or other similar PR. I personally think more people using Qualcomm chips or selling more Qualcomm chips may be the key-point rather than complex/complicated C++ encapsulation of the highly-well designed QNN SDK.
the core ideas and tech difficulties and performance issue should be completely same to this implementation even with complicated and cool C++ encapsulation.

Performance of ggml-qnn backend

all fp32 and quantize type mulmuat already offload to QNN backend:

How to build ggml‐qnn source code for Android and verify ggml-qnn backend on Snapdragon based phone

Ubuntu 20.04,22.04 is validated and recommended as host machine(other Linux distributions might be also ok). the dev activity in this PR can be done in pure command line without any IDE, so setup dev envs on Linux is simple:

download and install Qualcomm QNN SDK on Linux accordingly from https://www.qualcomm.com/developer/software/qualcomm-ai-engine-direct-sdk
utilize my self-made script build-run-android.sh to download Android NDK automatically(pls see below section)
you will need an Android smartphone with adb-connected running on one of below Qualcomm SoCs:

SM8450 (Snapdragon 8 Gen 1+)
SM8550 (Snapdragon 8 Gen 2)
SM8650 (Snapdragon 8 Gen 3)
SM8750 (Snapdragon 8 Gen 4)
SM8750-AB (Snapdragon 8 Elite)

  git clone https://github.com/kantv-ai/ggml-qnn
  cd ggml-qnn
  git checkout build_fix
  ./scripts/build-run-android.sh build          (it'll setup local build envs automatically and build the entire project)
  ./scripts/build-run-android.sh updateqnnlib   (upload Qualcomm's QNN binary runtime libs to Android phone)
  ./scripts/build-run-android.sh run_llamacli   (running llama-cli on Android pohone)
  ./scripts/build-run-android.sh run_testop     (running test-backend-ops on Android phone)

we can find that this backend works fine as expected from the log output of "adb logcat | grep ggml-qnn". for programmers, we can use "adb logcat | grep ggml-qnn" to help troubleshooting work.

How to build ggml‐qnn source code for Snapdragon based WoA(Windows on ARM) device(verified)

similar to dev envs on Linux, I'll try to build ggml-qnn source code for Snapdragon based WoA in pure command line on Windows10 without any IDE. details can be found in my another PR: #12215
a Snapdragon desktop SoC equipped WoA device(Windows on ARM) is required to verify build result or further dev activity for WoA(Windows on ARM).

Thought & summary about performance

(1)load/performance loss of data transfer between AP(arm cpu) and NPU(dsp). that's the performance loss caused by transferring data between the main CPU and the NPU. this part requires redesigning the data structure in the ggml-qnn implementation, placing all tensor data entirely in the DSP's device memory to minimize data copying or ideally achieve zero-copy.

(2)relative trick with Qualcomm's QNN SDK. I find the RPC design a bit puzzling, its usage differs quite a bit from Intel's SYCL or Huawei's CANN. the AI operator acceleration provided by Qualcomm's QNN SDK, if not handled with particularly clever optimizations during use, ends up performing worse than the highly optimized default CPU backend in ggml. honestly, it’s baffling. doesn’t Qualcomm have a better implementation approach for hardware acceleration SDKs on this kind of heterogeneous multi-core architecture? years ago, I used a Linux SDK from another American chip company that was close to the driver layer (i.e., without much encapsulation) for video decoding acceleration, and it was incredibly user-friendly. once I grasped the tricks of its software design and usage, integrating it into FFmpeg was a breeze. their hardware decoding acceleration chip also used a DSP architecture. of course, I’m not entirely clear on the differences between hardware acceleration for AI operators and video decoding at the moment.

(3)some ops that are generally critical to inference performance in GGML, especially for Transformer-based models, which are a common use case. these ops often dominate computation time or memory usage during inference. Key Ops Significant to Inference Performance：

MUL_MAT (Matrix Multiplication)
    Why it’s significant: Matrix multiplication is the backbone of neural network inference, especially in Transformers, where it’s used in attention mechanisms (e.g., query-key-value computations) and feed-forward layers. In GGML, MUL_MAT is heavily optimized for various hardware (e.g., CPU with SIMD, GPU with CUDA, or Metal on Apple Silicon).
    Performance impact: This op often dominates the compute time in inference. Its efficiency depends on quantization (e.g., Q4, Q8), hardware acceleration, and memory alignment. For example, in llama.cpp, optimized MUL_MAT kernels can significantly speed up token generation.
    Context: Most critical for large language models (LLMs) during the forward pass.
SOFT_MAX (Softmax)
    Why it’s significant: Softmax is used in the attention mechanism of Transformers to normalize attention scores. It’s computationally expensive because it involves exponentiation and summation across potentially large vectors.
    Performance impact: While not as compute-heavy as MUL_MAT, SOFT_MAX can become a bottleneck for long sequences due to its sequential nature and memory access patterns. Optimizations in GGML (e.g., vectorization) help mitigate this.
    Context: Critical in attention-based models like LLMs or vision transformers.
ROPE (Rotary Position Embeddings)
    Why it’s significant: ROPE is a specialized op for positional encodings in some Transformer models (e.g., LLaMA). It applies rotary transformations to embeddings, which is a key part of the attention mechanism in these models.
    Performance impact: This op is executed for every token in every layer, so its efficiency directly affects inference latency. GGML optimizes it for low overhead, but it’s still a frequent operation in modern LLMs.
    Context: Highly relevant for models like LLaMA or its derivatives.
ADD / ADD_REL_POS (Element-wise Addition)
    Why it’s significant: Addition ops are used throughout neural networks—for example, in residual connections (common in Transformers) or when combining positional encodings with token embeddings.
    Performance impact: While individually lightweight, these ops are executed frequently across layers and tokens, so their cumulative impact is notable. Efficient memory access and vectorization are key to keeping them fast.
    Context: Ubiquitous in Transformer inference.
FLASH_ATTN_EXT (Flash Attention Extension)
    Why it’s significant: This is an optimized implementation of attention, inspired by techniques like FlashAttention, which reduces memory usage and improves compute efficiency by fusing operations and minimizing memory reads/writes.
    Performance impact: For models supporting this op, it can drastically improve inference speed and memory efficiency, especially for long sequences. It’s a game-changer on GPU hardware.
    Context: Relevant for cutting-edge LLMs with long context lengths.
RMS_NORM (Root Mean Square Normalization)
    Why it’s significant: RMSNorm is a lightweight alternative to LayerNorm, used in models like LLaMA. It normalizes activations across layers, which is essential for stable inference.
    Performance impact: It’s executed for every layer and token, so its efficiency matters. GGML optimizes it for speed, but it still contributes to the overall latency.
    Context: Common in modern LLMs.
CONV_TRANSPOSE_1D / CONV_TRANSPOSE_2D (Convolution Operations)
    Why it’s significant: These ops are critical for models with convolutional components, such as Whisper (speech processing) or vision transformers. They involve sliding window computations over input data.
    Performance impact: Convolutions are computationally intensive and memory-bound, especially for high-dimensional inputs (e.g., audio spectrograms or images). Optimizations like IM2COL (image-to-column transformation) help, but they remain costly.
    Context: Key for non-LLM models like Whisper.
POOL_2D (Pooling)
    Why it’s significant: Pooling reduces spatial dimensions in convolutional models, often used in audio or vision tasks (e.g., Whisper’s encoder).
    Performance impact: It’s less compute-intensive than convolutions but can bottleneck memory bandwidth if not optimized.
    Context: Relevant for feature extraction in non-text models.

General Observations
For LLMs (e.g., LLaMA): MUL_MAT typically dominates inference time due to its role in attention and feed-forward layers. optimizing this (via quantization or NPU offloading) yields the biggest gains.
Specialized Ops: ROPE, FLASH_ATTN_EXT, and RMS_NORM are significant for specific model architectures (e.g., LLaMA-style LLMs) and can be optimized to unlock major performance gains.

Big picture of ggml-qnn backend

pls refer to a simple tech doc below: mapping ggml compute graph to QNN compute graph

the first technical approach can be seen in this PR. accordingly, the second technical approach can be extended base on this PR with the similar coding style without complicated C++ encapsulation. in the fact, there already a PoC in my local dev envs, but it's not a functional implementation in my local dev envs, the reason I had been said in this simple tech doc.

Acknowledgement

this implementation of ggml-qnn is mainly porting/reverse engineering from executorch(the QNN backend's implementation in executorch comes from Qualcomm). all the original techs of this topic comes from Qualcomm.
I got breakthrough help from chiwwang@Qualcomm Technologies Inc on 04/2024.
I also got a meaningful help from XiaoMi-StableDiffusionOnDevice on 05/2024.
thanks for that I borrowed 5-7 functions from implementation which comes from a CN programmer @chraac 's team. special notice for his team: pls follow some default rules: code review of other PR's work is a serious thing, verify and validate PR firstly before drop code review or tech comments, follow the coding style in this PR ....., or DO NOT disturb this PR again, thanks so much!
https://opensource.microsoft.com/codeofconduct/
recently I tried AI-assisted programming for ggml-qnn backend with the help from the powerful Grok 3, it really helped me a lot in this PR.

Very Important

PaddlePaddle/Paddle-Lite#10539

this clue will/might help me and us understand some interesting things about ggml-qnn backend^:^.

Conclusion

this is a concise ggml-qnn implementation:

follow the principle of "simple is beautiful" which comes from the great Unix in the US, code is simple and easily/quickly understand, without complex encapsulation. btw, I personally think the principle of "simple is beautiful" is one of the key-reasons why ggml/whisper.cpp/llama.cpp is so popular and so successful, especially we can see that there are already TensorflowLite from Google, Executorch from Meta, MNN from Alibaba, MACE from Xiaomi... they are both good but many programmers and IT giants and IT startups and research institutions prefer ggml for device-AI scenarios.
follow the principle of "make it run, then make it right, then make it fast"

at the moment:

it's a real functional PR(can pass the test-backend-ops, can do real LLM inference with QNN backend on a Snapdragon 8 Gen3 phone)
other programmers and AI experts can be involved for further dev activities accordingly

after spent too much efforts on ggml-qnn backend, I personally think:

there are many many QNN API calling and assembling work in the rest parts of ggml-qnn regardless C/C++ style or tech approach
AI experts must be involved in the rest parts of ggml-qnn regardless C/C++ style or tech approach
a fully open-source implementation of ggml-qnn backend might be a team-work between experienced programmers and AI experts even the professional technical help from Qualcomm.
tech approach in this PR should be a P0 team-work task(this is general approaches and steps in Intel sycl/Qualcomm opencl backend), another tech approach should be a P1 team-work task.

… ggml-qnn

…blob/ggml-qnn-quantize/core/ggml/llamacpp/ggml-qnn.cpp

…om branch kantvai-ggmlqnn-npurpc, https://github.com/kantv-ai/llama.cpp/wiki/offloading-mulmat-to-QNN-backend)

… ggml-qnn

…blob/ggml-qnn-quantize/core/ggml/llamacpp/ggml-qnn.cpp

…om branch kantvai-ggmlqnn-npurpc, https://github.com/kantv-ai/llama.cpp/wiki/offloading-mulmat-to-QNN-backend)

…omplex/redundant pointer operation

oreomaker · 2025-02-25T06:25:50Z

How do you handle the QNN graph build-execute-free during inference? As we are also integrating the QNN in our framework, the graph building is time consuming and the memory increases hugely when finalizing the QNN graph. It seems that the QNN has no easy way to free a graph during execution.
Maybe a better execution pipeline is needed for it.

chraac · 2025-02-25T06:40:35Z

How do you handle the QNN graph build-execute-free during inference? As we are also integrating the QNN in our framework, the graph building is time consuming and the memory increases hugely when finalizing the QNN graph. It seems that the QNN has no easy way to free a graph during execution. Maybe a better execution pipeline is needed for it.

Hi @oreomaker ,

Nice question! This is actually the key point for similar QNN backend implementations. QNN's interface functions more like a traditional ML framework that requires "compilation" (what they call "finalize") before execution.
To reduce compilation time, this PR utilizes a mechanism called "graph cache" to store each operation graph with its specific tensor configuration:

In my fork, I've taken this a step further by generating a QNN graph based on the GGML graph. This approach allows the QNN framework to perform more comprehensive optimizations during compilation.

chraac · 2025-02-25T07:04:44Z

How do you handle the QNN graph build-execute-free during inference? As we are also integrating the QNN in our framework, the graph building is time consuming and the memory increases hugely when finalizing the QNN graph. It seems that the QNN has no easy way to free a graph during execution. Maybe a better execution pipeline is needed for it.

Hi @oreomaker ,
Nice question! This is actually the key point for similar QNN backend implementations. QNN's interface functions more like a traditional ML framework that requires "compilation" (what they call "finalize") before execution. To reduce compilation time, this PR utilizes a mechanism called "graph cache" to store each operation graph with its specific tensor configuration:
In my fork, I've taken this a step further by generating a QNN graph based on the GGML graph. This approach allows the QNN framework to perform more comprehensive optimizations during compilation.

I'm a little curious of that whether you are a regular employee from Qualcomm Shenzhen branch. as I said before many times, you can submit your standalone PR and I personally like to see your success in this community, but pls don't brings some non-technical comments again and again in my PR:

what you did in my first PR got me blocked from this community

what you did in my second PR resulted in my PR being closed by the maintainers because two Chinese programmers dropped two much pointless arguments in this community and I think this is another joke from China.
thanks so much!

I don't want to continue debating this here. The reason for restricting your access to this repo was your inappropriate comments on unrelated PRs (here, here2 and here3). And repo's owner gave a clear reason about it.

I'd suggest focusing on improving your codebase in an objective manner without making assumptions about or judging others' work.

If my comments made you uncomfortable, I apologize. I'm happy to step back from this discussion. I can also create my own PR where anyone interested can discuss the design approach more constructively.

zhouwg · 2025-02-25T07:32:08Z

How do you handle the QNN graph build-execute-free during inference? As we are also integrating the QNN in our framework, the graph building is time consuming and the memory increases hugely when finalizing the QNN graph. It seems that the QNN has no easy way to free a graph during execution. Maybe a better execution pipeline is needed for it.

this is a good question, your concern is correct:

the existing execution pipeline in this PR comes from ggml's default cpu backend and other hardware-accelerated backends such as ggml-sycl, ggml-cann, ggml-opencl
I personally think the highly-well designed QNN SDK will manage it's internal hardware and software resource very carefully that's why they don't provide cleanup APIs such as releaseGraph/releaseTensor/.... this design philosophy is different with common SDK design principle and this is one of the key-reasons why I disagree another implementation through complex C++ encapsulation because the highly-well designed QNN SDK already do that. you will find that I just use a simple STL to manage QNN resources in this PR.
the graph cache mechanism already used in my PoC of ggml-qnn or my first PR here on 04/2024:https://github.com/kantv-ai/kantv/blob/ggml-qnn-quantize/core/ggml/llamacpp/ggml-qnn.cpp#L2091.
a better execution pipeline is needed because of performance concern, pls see my simple tech doc:mapping ggml compute graph to QNN compute graph .

[updated on 02/26/2025] my previous answer might be wrong, because the first technical approach can works very good(quantized data with mulmat didn't implemented when I wrote the simple tech doc), there are 7x-10x performance improvements in my local dev envs with QNN backend.

btw, you can refer to my personal understanding about ggml-qnn and other ggml backends in that simple tech doc:

prons: this approach might be equivalent to the principle shown in the above quoted code, and I guess that's the secret of how to utilize the Hexagon NPU maximally in QNN backend. I don't know why there is such big difference between ggml-qnn and ggml-sycl/ggml-cann/ggml-opencl.

…happy

slaren · 2025-02-27T19:14:24Z

Hi @zhouwg. I want to clarify that the comments made by @chraac in your previous PR had no influence whatsoever in the decision to block you from participating in this repository. Technical feedback and code reviews are always welcome and even encouraged. However, you were blocked due to a consistent pattern of comments that incited personal conflict, often in response to legitimate technical feedback. The comments linked by @chraac (now removed) are an example of this behavior.

We value your technical contributions and encourage you to continue participating in discussions, but please focus on the technical aspects and the code itself. If you receive feedback on your code, try not to take it personally, state your point of view and move on. If you feel personally attacked or treated unfairly, please reach out to the maintainers privately, and we will do our best to address the situation. Engaging in personal conflict in public comments is not productive for anyone.

Additionally, please keep in mind that while this is an open source project that welcomes contributions, it does not mean that contributions are guaranteed to be accepted. You may disagree with the maintainers' decisions, but refusing to make the changes requested by the maintainers will likely result in your contributions being rejected. If you disagree with these decisions, you always have the option to fork the project and maintain your own version.

zhouwg · 2025-02-28T00:20:34Z

Hi @zhouwg. I want to clarify that the comments made by @chraac in your previous PR had no influence whatsoever in the decision to block you from participating in this repository. Technical feedback and code reviews are always welcome and even encouraged. However, you were blocked due to a consistent pattern of comments that incited personal conflict, often in response to legitimate technical feedback. The comments linked by @chraac (now removed) are an example of this behavior.

We value your technical contributions and encourage you to continue participating in discussions, but please focus on the technical aspects and the code itself. If you receive feedback on your code, try not to take it personally, state your point of view and move on. If you feel personally attacked or treated unfairly, please reach out to the maintainers privately, and we will do our best to address the situation. Engaging in personal conflict in public comments is not productive for anyone.

@slaren, thanks for your valuable and very helpful suggestion or guidance. I admitted that I made a same stupid mistake because of intentional challenge from a same CN programmer in my third PR and I try to adjust my mind and behavior accordingly. at the same time, I think now everyone in this community can see what happened in my first&second&third PR.

especially in this third PR:

I really don't understand why this CN programmer @chraac spent efforts to study my inappropriate and stupid comments in this community? I already admitted my mistake and I have no negative opinion about that: I think the cost I paid was what I deserved and I learnt something from that. and I really don't understand why this CN programmer @chraac intentionally quoted them here? I think he want to use your hands to punish me again, I know these Chinese very much although I hope this is my misunderstanding.
I feel very strange why this CN programmer @chraac does not go to the Intel sycl backend or Huawei cann backend to make comments but goes to my PR again and again to make various challenge comments, and fire the conflict firstly again in my third PR.
I personally think code review in such an important tech community is not someone's family garden and someone can arbitrarily drop comments in other people's PR. why this CN programmer @chraac challenged me again and again in my first&second&third PR? I personally think the reason is because (1)I'm an independent programmer (2)I'm a Chinese programmer.
offered an unacceptable PR/help in my first PR or local forked llama.cpp project, and then claimed my effort on ggml-qnn is a duplicated effort and I should waiting for their PR get approved and continue my effort based on their approved PR in this important tech community, brought troubles in my first & second PR again, then fire the conflict firstly in my third PR(even want to use your team's hands to punish me again), then claimed they want to step back with so-called constructive dialogue, at the same time create an illusion to make people think that they are be suppressed by someone rather than reflecting on their thoughts/minds/behaviors/actions...... I know this very much and generally speaking we call this a "PUA master" in China, such these behaviors/things can be commonly seen in China and I personally understand that because there 1.4 billion people in China for the limited resource and survive. BUT I personally think it's unacceptable in this pure tech community which is out of mainland China.
I made several cooperation invitations to this CN programmer @chraac but no response from him or his team. I think he or his team can contribute ideas or code in this PR with the same coding style as their intention or walk our own way without intentionally disturbing each other. I'd like to see their success in this great tech community which is out of mainland China.
we all can see two code review comments from this CN programmer @chraac in this PR are totally meaningless: he wouldn't drop such code review comments if this CN programmer @chraac already take a little time to verify and validate this PR, so I think his so-called code review comments are an intentionally challenge to me because I'm an independent Chinese programmer. at the same time, I personally think his two pointless code review comments are not important and I already gave a professional response because (1)I'm an open-minded and kind programmer (2) I still want to cooperation with him or his team.
I think I know/understand China&Chinese very much after I spent many many years in China. this world is really not so beautiful or perfect as someone's imagination when someone lived in the US or EU or as a tourist in China. I try my best to avoid be involved with pointless conflict with them because I know them and I can't change the reality, but I try to don't do behaviors like that.
I understand some really unpleasant behaviors(I never do that or I try to never to that) from this CN programmer because I was once a young programmer and I can see this CN programmer also really did much efforts on this ggml-qnn backend. this young CN programmer might be understand something in the future: all winners or losers from mainland China are totally meaningless(this is my personal point of view and might be not correct),e.g: they must access this tech community via a dedicated VPN or proxy. this is one of the key-reasons why I repeatedly emphasized that I have no intention of getting involved in a meaningless competition between me and this CN programmer in this non-CN tech community after he refused my cooperation invitation. I do this thing for fun, to learn and try to make some contributions to this great tech community, not for fight others or win others. I will be very happy if this PR gets approved and I have nothing to lose if it's rejected and I'm tired of fighting or competing with these CN programmers.
finally, I'm definitionally sure that most Chinese and most CN programmers are kind&honest&integrity people although I met a few unpleasant people in China even in this great tech community which is out of mainland China.

Additionally, please keep in mind that while this is an open source project that welcomes contributions, it does not mean that contributions are guaranteed to be accepted. You may disagree with the maintainers' decisions, but refusing to make the changes requested by the maintainers will likely result in your contributions being rejected. If you disagree with these decisions, you always have the option to fork the project and maintain your own version.

yes, your opinion is definitionally correct, I see. I came to this great tech community for learning real hard-core AI tech and try to make some contributions. I understand my PR probably might be not accepted and this is normal acceptable thing, but I really don't want to be involved with some meaningless conflict with others especially some CN programmers from mainland China.

…for benchmark more conveniently

null-define · 2025-02-28T07:18:05Z

Hi everyone,
I've been eagerly looking forward to deploying a model on Android Qualcomm QNN devices using llama.cpp, and I've been closely following the QNN developments since the initial PR. While I understand Zhouwg's concerns, I’m not in a position to judge who is right or wrong.
That said, I’m curious: is there any possibility of merging QNN support (either from Zhouwg’s branch or Chraac’s) in the near future? Your efforts are greatly appreciated!
Thank you!

zhouwg · 2025-02-28T07:21:30Z

Hi everyone, I've been eagerly looking forward to deploying a model on Android Qualcomm QNN devices using llama.cpp, and I've been closely following the QNN developments since the initial PR. While I understand Zhouwg's concerns, I’m not in a position to judge who is right or wrong. That said, I’m curious: is there any possibility of merging QNN support (either from Zhouwg’s branch or Chraac’s) in the near future? Your efforts are greatly appreciated! Thank you!

thanks for your comment. your question is a really good question.

I strongly agree your point of view: there is no right or wrong from pure tech perspective.
I can see chraac really did many efforts on ggml-qnn backend as his way based on my initial PR and did a good progress on Windows port and 4d mulmat and complex C++ encapsulation although the most of core ideas are comes from the initial PR and tech difficulties should be completely same to this PR. btw, I must clarify one thing here: all original tech in this ggml-qnn backend comes from Qualcomm because Qualcomm provide fundamental mechanism.
unfortunately, I personally think:

my C style implementation is not compatible with his C++ style implementation
chraac seems a proud cpper, I know/met many similar programmers in China, this is also another non-tech problem.

at the same time, I think

Windows port in this PR can be done by a skilled Windows programmer less then 1-5 days although I have nothing knowledge about Windows programming(Qualcomm has provide a very simple reference code without complex C++ encapsulation in latest QNN SDK and I have ported them from QNN SDK to this PR but I don't know how to build them on Windows platform)
he or his team can contribute the 4d mulmat to this PR because Qualcomm provide the fundamental mechanism and we programmer use it regardless C style or C++ style
the second technical approach which describe in the simple tech doc or the "standout feature" in his PPT can also implemented in this PR with some additional efforts and I already have a PoC in my local dev envs but I think that's might be not a P0 task.

for avoid misunderstanding, I never thought/claimed that his/his team's effort on ggml-qnn backend is a duplicated effort although he has publicly claimed my continued effort on ggml-qnn is a duplicated effort in this tech community, because I'm an open-minded programmer and I strongly agree that's also the GGML way: try crazy ideas, build wild demos, and push the edge of what’s possible.
furthermore, I personally guess that the second technical approach that I have discovered and mentioned in this community on 04/2024 may not be implemented without the technical help of very skilled QNN experts and AI experts or help from Qualcomm because it seems Qualcomm already provides some other similar dedicated tech stack or dedicated toolchains in QNN:

https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-100/tools.html

https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-100/tutorials.html#qnn-genaitransformer-backend-workflow

for further understand what I mentioned here, pls refer to following 20000+ LoC source file which generated from Qualcomm's dedicated tool in my study/research of relative topic on 04/2024:
https://github.com/kantv-ai/kantv/blob/kantv-poc-with-qnn/core/ggml/jni/Inception_v3.cpp
especially following line which I think/guess that's something chraac or his team wants to do through many many C++ class and C++ encapsulation(because the technical path in his team already chosen when he decided to hard-forked that PR through C++ encapsulation rather than along the similar tech path last year or cooperation with me this year through the first technical approach or the general approach in the ggml, this is also one of the key-reasons why I thought his PR/help in my first PR/forked llama.cpp is unacceptable although he challenged me again and again in my first PR which might-be because he is a cpper):
https://github.com/kantv-ai/kantv/blob/kantv-poc-with-qnn/core/ggml/jni/Inception_v3.cpp#L20634
for avoid misunderstanding from chraac or his team, the above guess may be incorrect and I sincerely wish his team can success. btw, I personally don't think chraace or his small dev team comes from Qualcomm Shenzhen branch because chraac’s behavior in my first&second&third PR was completely unlike the behavior of a regular employee from a top international company.
finally, as I described in this PR's description, I personally think a fully open-source implementation of ggml-qnn through the first tech approach might be a team work between experienced programmers(Android system sw programmers, Windows system sw programmers, QNN experts) and AI experts, and this is should be a P0 team-work task.
I hope my understanding is correct but correction from experts is greatly appreciated.

@null-define, thanks for your good question, it helps me brainstorm or deep dive into related technical issues and combine/assemble them to get a clearer understanding although it might be incorrect.

zhouwg · 2025-03-01T05:12:46Z

close this PR on 03/01/2025 because:

I have too much unpleasant experiences with an unknown(here means I don't know) CN programmer chrraac which brought many troubles in my first & second especially in this third PR.I never though he suddenly happened in my third PR again and breaks my bottom-line in this PR. at the same time, he dropped some intentional meaningless so-called code review comments again in this PR.I hope this CN programmer can remove his third and fourth comments in this PR(which he want to use maintainer's hands to punish me again) and then this PR can keep clean although I clearly know I have to face the reality to this xxx style behaviors.
my first PR and our second PR(submitted by my friend) and my second PR has been all broken / polluted by a CN programmer chraace intentionally, this PR is meaningless because it's already polluted by this CN programmer.
I have no intention of getting involved in a meaningless competition between me and this CN programmer in this non-CN tech community, because:

his hard-forked PR can be considered as an exactly code construction of this PR or can be considered as equivalent to (in Chinese "殊途同归") this PR through complex/cool C++ encapsulation or a beautiful dress(in Chinese "华丽包装") which I already carefully checked and confirmed. the core ideas and tech difficulties and performance issue should be completely same to this implementation even with complicated and cool C++ encapsulation.
there are many many QNN API calling and assembling in the rest parts of ggml-qnn backend although these API calling and assembling work really not easy
his hard-forked ggml-qnn project caused a split in limited community development resources and just for his personal purpose("garners more attention"), and now he has got what he wanted. what the beautiful words(so-called collaboration) he said in his PR and the actions/behaviors are totally opposite, this is a typical xxx style behavior(taking advantage of the goodwill of maintainers; say A, do B, anger you by C) , this is NOT ggml way:try crazy ideas, build wild demos, and push the edge of what’s possible. the values between this CN programmer and me is significantly different, this is also the key-reason why I dropped inappropriate comments in his PR.

…ode more clearly

myan-o · 2025-03-10T04:43:36Z

@zhouwg
Thank you for the great work. I tried it with snapdragon888 and it didn't work. Do you support older NPUs?

zhouwg · 2025-03-10T04:53:03Z

@zhouwg Thank you for the great work. I tried it with snapdragon888 and it didn't work. Do you support older NPUs?

thanks for your comment.

I verified it with my Snapdragon 8Gen3 phone currently, I has tried it with my another Snapdragon (low-end chip) equipped phone last year:https://github.com/kantv-ai/kantv/blob/kantv-poc-with-qnn/core/ggml/llamacpp/ggml-qnn.cpp#L2331
following the clarification in the https://github.com/pytorch/executorch/tree/main/backends/qualcomm
Supported Chipset

Snapdragon 8 Gen 1
Snapdragon 8 Gen 1+
Snapdragon 8 Gen 2
Snapdragon 8 Gen 3
Snapdragon 8 Elite
the performance of NPU is not good currently because there many tricks in QNN SDK, I personally think this(how to use QNN SDK correctly) is meaningless.

myan-o · 2025-03-10T05:00:07Z

@zhouwg
Thank you for your reply.

myan-o · 2025-03-10T05:30:41Z

@zhouwg Thank you for the great work. I tried it with snapdragon888 and it didn't work. Do you support older NPUs?

thanks for your comment.

I verified it with my Snapdragon 8Gen3 phone currently

following the clarification in the https://github.com/pytorch/executorch/tree/main/backends/qualcomm
Supported Chipset

Snapdragon 8 Gen 1
Snapdragon 8 Gen 1+
Snapdragon 8 Gen 2
Snapdragon 8 Gen 3
Snapdragon 8 Elite

the performance of NPU is not good currently, there many tricks in QNN SDK, I personally think this(how to use QNN SDK correctly) is meaningless.

It looks like the developers of executorch are working on a version with improved performance.

pytorch/executorch#8194

zhouwg · 2025-03-10T05:38:11Z

thanks for your comment and this is a really helpful information for ggml-qnn backend!

yes, there are various technical approach/software stack of NPU inference on Qualcomm's (mobile/desktop) SoC, they are all both provided by Qualcomm, the core parts are closed-source(as my personal understanding) and this is a normal situation in big IT company, pls refer to: #12049 (comment)

that's one of the reasons why I continue effort on this concise implementation of ggml-qnn: the code is simple and everyone can understand the domain technical details and understand code easily and quickly, this might-be useful for community.
unfortunately, I'm not AI expert and I don't know how to implement other ggml ops through QNN API.

zhouwg added 27 commits February 16, 2025 12:39

ggml-qnn: add Qualcomm QNN backend for GGML

74029f3

ggml-qnn: santiy check

986a37d

ggml-qnn: update script build-run-android.sh to compare peformance of…

af604d5

… ggml-qnn

ggml-qnn: fix minor issue in test-backend-ops.cpp

816ebb9

ggml-qnn: merge QNN RPC feature from https://github.com/zhouwg/kantv/…

2a8020b

…blob/ggml-qnn-quantize/core/ggml/llamacpp/ggml-qnn.cpp

ggml-qnn: sync from branch kantvai-ggmlqnn-npurpc

da4d007

ggml-qnn: a concise approach to offload mulmat to QNN backend(sync fr…

7cb1a86

…om branch kantvai-ggmlqnn-npurpc, https://github.com/kantv-ai/llama.cpp/wiki/offloading-mulmat-to-QNN-backend)

ggml-qnn: remove redundant codes

c8cf291

ggml-qnn: sync from branch kantvai-ggmlqnn-npurpc

84317c7

ggml-qnn: sync from branch kantvai-ggmlqnn-npurpc

c6a04c6

ggml-qnn: sync from branch kantvai-ggmlqnn-npurpc

59a2fbe

ggml-qnn: add Qualcomm QNN backend for GGML

1e6f4a7

ggml-qnn: santiy check

6974079

ggml-qnn: update script build-run-android.sh to compare peformance of…

ea970f9

… ggml-qnn

ggml-qnn: fix minor issue in test-backend-ops.cpp

d0c01c0

ggml-qnn: merge QNN RPC feature from https://github.com/zhouwg/kantv/…

b48ad85

…blob/ggml-qnn-quantize/core/ggml/llamacpp/ggml-qnn.cpp

ggml-qnn: sync from branch kantvai-ggmlqnn-npurpc

5ac113b

ggml-qnn: a concise approach to offload mulmat to QNN backend(sync fr…

31152be

…om branch kantvai-ggmlqnn-npurpc, https://github.com/kantv-ai/llama.cpp/wiki/offloading-mulmat-to-QNN-backend)

ggml-qnn: remove redundant codes

e16dd3c

ggml-qnn: sync from branch kantvai-ggmlqnn-npurpc

1d56350

ggml-qnn: sync from branch kantvai-ggmlqnn-npurpc

12f4911

ggml-qnn: sync from branch kantvai-ggmlqnn-npurpc

37985f9

rebase to the latest upstream

9fa0765

ggml-qnn: fix a minior typo in internal doc

60ca941

ggml-qnn: refine function ggml_qnn_create_general_tensor() to avoid c…

d5d110d

…omplex/redundant pointer operation

ggml-qnn: fix a minor typo in source code

c687f26

build: avoid ggml-qnn backend breaking other backend's builds

d1b9d1b

zhouwg closed this Feb 24, 2025

github-actions bot added script Script related testing Everything test related labels Feb 24, 2025

zhouwg force-pushed the build_fix branch from c902c4d to 71dae47 Compare February 25, 2025 05:50

zhouwg added 3 commits February 26, 2025 13:38

ggml-qnn: offload quantized type mulmat to QNN backend

d80b289

ggml-qnn: benchmark of real LLM inference on a Snapdragon 8 Gen3 phone

eb47de0

ggml-qnn: refine source code structure to make code more clearly

36b58e3

zhouwg force-pushed the build_fix branch from ff3fd43 to 36b58e3 Compare February 27, 2025 06:37

zhouwg added 3 commits February 27, 2025 15:34

ggml-qnn: refine code

302e014

ggml-qnn: enable release build with necessary logs to make reviewers …

a134884

…happy

ggml-qnn: enable all quantize type with 2d mulmat

137b347

zhouwg mentioned this pull request Feb 28, 2025

PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp kantv-ai/kantv#246

Closed

ggml-qnn: enable log output of GGMLQNN_LOG_INFO in command line mode …

653bc33

…for benchmark more conveniently

ggml-qnn: Windows port --- step2

9d10e4f

zhouwg closed this Mar 1, 2025

zhouwg mentioned this pull request Mar 1, 2025

[WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs #12063

Draft

zhouwg referenced this pull request in kantv-ai/ggml-qnn Mar 4, 2025

ggml-qnn: refine ggml_qnn_general_node and ggml_qnn_mul_mat to make c…

19c1a36

…ode more clearly

zhouwg mentioned this pull request Mar 8, 2025

doc: add text-based diagram of software architecture in toplevel README.md #12263

Open

zhouwg mentioned this pull request Mar 10, 2025

在8gen3上crash kantv-ai/kantv#261

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #12049

PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #12049

zhouwg commented Feb 24, 2025 •

edited

Loading

oreomaker commented Feb 25, 2025 •

edited

Loading

chraac commented Feb 25, 2025 •

edited

Loading

chraac commented Feb 25, 2025

zhouwg commented Feb 25, 2025 •

edited

Loading

slaren commented Feb 27, 2025

zhouwg commented Feb 28, 2025 •

edited

Loading

null-define commented Feb 28, 2025

zhouwg commented Feb 28, 2025 •

edited

Loading

zhouwg commented Mar 1, 2025 •

edited

Loading

myan-o commented Mar 10, 2025

zhouwg commented Mar 10, 2025 •

edited

Loading

myan-o commented Mar 10, 2025

myan-o commented Mar 10, 2025

zhouwg commented Mar 10, 2025 •

edited

Loading

PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #12049

PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #12049

Conversation

zhouwg commented Feb 24, 2025 • edited Loading

PR Description

Features

Performance of ggml-qnn backend

How to build ggml‐qnn source code for Android and verify ggml-qnn backend on Snapdragon based phone

How to build ggml‐qnn source code for Snapdragon based WoA(Windows on ARM) device(verified)

Thought & summary about performance

Big picture of ggml-qnn backend

Acknowledgement

Very Important

Conclusion

oreomaker commented Feb 25, 2025 • edited Loading

chraac commented Feb 25, 2025 • edited Loading

chraac commented Feb 25, 2025

zhouwg commented Feb 25, 2025 • edited Loading

slaren commented Feb 27, 2025

zhouwg commented Feb 28, 2025 • edited Loading

null-define commented Feb 28, 2025

zhouwg commented Feb 28, 2025 • edited Loading

zhouwg commented Mar 1, 2025 • edited Loading

myan-o commented Mar 10, 2025

zhouwg commented Mar 10, 2025 • edited Loading

myan-o commented Mar 10, 2025

myan-o commented Mar 10, 2025

zhouwg commented Mar 10, 2025 • edited Loading

zhouwg commented Feb 24, 2025 •

edited

Loading

oreomaker commented Feb 25, 2025 •

edited

Loading

chraac commented Feb 25, 2025 •

edited

Loading

zhouwg commented Feb 25, 2025 •

edited

Loading

zhouwg commented Feb 28, 2025 •

edited

Loading

zhouwg commented Feb 28, 2025 •

edited

Loading

zhouwg commented Mar 1, 2025 •

edited

Loading

zhouwg commented Mar 10, 2025 •

edited

Loading

zhouwg commented Mar 10, 2025 •

edited

Loading