Skip to content

Latest commit

 

History

History

capp

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

LLM on-device demo

We provide a simple demo to run the quantized LLMs on device. Currently, the app supports running W4A8 and W8A8 LlaMA models on an Android phone with a Snapdragon 8 Gen 3 NPU.

The demo was initially developed by Lukasz Dudziak for stable-diffusion. Shell Xu Hu further adapted the code for prompt-encoding. Fuwen Tan finalized the code by implementing the KV cache and auto-regressive generation.

Pre-compiled app and models

You're welcome to try out the precompiled app and models directly on an Android phone with a Snapdragon 8 Gen 3 HTP, e.g. Samsung Galaxy S24, Xiaomi 14, etc.

  • get the quantized on-device models
# W8A8
git lfs install
git clone https://huggingface.co/fwtan/llama-1.1b-mobilequant-w8a8-s1024-e60-8gen3

# W4A8
git lfs install
git clone https://huggingface.co/fwtan/llama-1.1b-mobilequant-w4a8-s1024-e60-sym-8gen3
  • push the precompile app to the phone
git clone https://huggingface.co/fwtan/llm_8gen3_demo
adb push llm_8gen3_demo /data/local/tmp/
  • push the quantized models to the phone
# W8A8
adb push llama-1.1b-mobilequant-w8a8-s1024-e60-8gen3 /data/local/tmp/llm_8gen3_demo

# W4A8
adb push llama-1.1b-mobilequant-w4a8-s1024-e60-sym-8gen3 /data/local/tmp/llm_8gen3_demo
  • run the demo
# W8A8
adb shell "cd /data/local/tmp/llm_8gen3_demo && LD_LIBRARY_PATH=. ./simple_app llama-1.1b-mobilequant-w8a8-s1024-e60-8gen3"

# W4A8
adb shell "cd /data/local/tmp/llm_8gen3_demo && LD_LIBRARY_PATH=. ./simple_app llama-1.1b-mobilequant-w4a8-s1024-e60-sym-8gen3"

The W8A8 demo will look like:

llama-1.1b-w8a8-8gen3-htp.mp4

The W4A8 demo will look like:

llama-1.1b-w4a8-8gen3-htp.mp4

Build the demo yourself

🐼 Installation

The code requires clang-16, QNN and Android NDK. To install clang-16 using apt:

wget -O - https://apt.llvm.org/llvm-snapshot.gpg.key|sudo apt-key add -

Add this line to /etc/apt/sources.list

deb http://apt.llvm.org/focal/ llvm-toolchain-focal-16 main

and run

sudo apt update
sudo apt install clang-16 lldb-16 lld-16

The code also depends on QNN 2.22 and Android-NDK. Make sure you have these two env variables defined:

export QNN_SDK_ROOT=/path/to/qnn-2.22
export ANDROID_NDK_ROOT=/path/to/android_sdk/ndk-bundle

🔨 Compilation

make aarch64-android

This should populate the bin/aarch64-android folder with the required files in the demo:

libc++_shared.so*  libllmod.so*  libQnnHtp.so*	libQnnHtpV75Skel.so*  libQnnHtpV75Stub.so*  libQnnSystem.so*  simple_app*

🏃 Preparing meta.bin, tokenizer.bin, and qnn_model.bin

We need to prepare three extra files: meta.bin that stores the embedding layer, tokenizer.bin that stores the LlaMA tokenizer, qnn_model.bin that stores the model graph.

  • prepare meta.bin
python scripts/export_bin.py meta.bin --hf /path/to/TinyLlama-1.1B-Chat-v1.0
  • prepare tokenizer.bin
python scripts/tokenizer.py -t /path/to/TinyLlama-1.1B-Chat-v1.0/tokenizer.model -c /path/to/TinyLlama-1.1B-Chat-v1.0/tokenizer_config.json
  • prepare qnn_model.bin

    Please checkout the profiling section (device/README.md)