MobileQuant/capp at main · saic-fi/MobileQuant

History

Name		Name	Last commit message	Last commit date
parent directory ..
api		api
make		make
scripts		scripts
src		src
test		test
Makefile		Makefile
README.md		README.md

README.md

LLM on-device demo

We provide a simple demo to run the quantized LLMs on device. Currently, the app supports running W4A8 and W8A8 LlaMA models on an Android phone with a Snapdragon 8 Gen 3 NPU.

The demo was initially developed by Lukasz Dudziak for stable-diffusion. Shell Xu Hu further adapted the code for prompt-encoding. Fuwen Tan finalized the code by implementing the KV cache and auto-regressive generation.

Pre-compiled app and models

You're welcome to try out the precompiled app and models directly on an Android phone with a Snapdragon 8 Gen 3 HTP, e.g. Samsung Galaxy S24, Xiaomi 14, etc.

get the quantized on-device models

# W8A8
git lfs install
git clone https://huggingface.co/fwtan/llama-1.1b-mobilequant-w8a8-s1024-e60-8gen3

# W4A8
git lfs install
git clone https://huggingface.co/fwtan/llama-1.1b-mobilequant-w4a8-s1024-e60-sym-8gen3

push the precompile app to the phone

git clone https://huggingface.co/fwtan/llm_8gen3_demo
adb push llm_8gen3_demo /data/local/tmp/

push the quantized models to the phone

# W8A8
adb push llama-1.1b-mobilequant-w8a8-s1024-e60-8gen3 /data/local/tmp/llm_8gen3_demo

# W4A8
adb push llama-1.1b-mobilequant-w4a8-s1024-e60-sym-8gen3 /data/local/tmp/llm_8gen3_demo

run the demo

# W8A8
adb shell "cd /data/local/tmp/llm_8gen3_demo && LD_LIBRARY_PATH=. ./simple_app llama-1.1b-mobilequant-w8a8-s1024-e60-8gen3"

# W4A8
adb shell "cd /data/local/tmp/llm_8gen3_demo && LD_LIBRARY_PATH=. ./simple_app llama-1.1b-mobilequant-w4a8-s1024-e60-sym-8gen3"

The W8A8 demo will look like:

llama-1.1b-w8a8-8gen3-htp.mp4

The W4A8 demo will look like:

llama-1.1b-w4a8-8gen3-htp.mp4

Build the demo yourself

🐼 Installation

The code requires clang-16, QNN and Android NDK. To install clang-16 using apt:

wget -O - https://apt.llvm.org/llvm-snapshot.gpg.key|sudo apt-key add -

Add this line to /etc/apt/sources.list

deb http://apt.llvm.org/focal/ llvm-toolchain-focal-16 main

and run

sudo apt update
sudo apt install clang-16 lldb-16 lld-16

The code also depends on QNN 2.22 and Android-NDK. Make sure you have these two env variables defined:

export QNN_SDK_ROOT=/path/to/qnn-2.22
export ANDROID_NDK_ROOT=/path/to/android_sdk/ndk-bundle

🔨 Compilation

make aarch64-android

This should populate the bin/aarch64-android folder with the required files in the demo:

libc++_shared.so*  libllmod.so*  libQnnHtp.so*	libQnnHtpV75Skel.so*  libQnnHtpV75Stub.so*  libQnnSystem.so*  simple_app*

🏃 Preparing `meta.bin`, `tokenizer.bin`, and `qnn_model.bin`

We need to prepare three extra files: meta.bin that stores the embedding layer, tokenizer.bin that stores the LlaMA tokenizer, qnn_model.bin that stores the model graph.

prepare meta.bin

python scripts/export_bin.py meta.bin --hf /path/to/TinyLlama-1.1B-Chat-v1.0

prepare tokenizer.bin

python scripts/tokenizer.py -t /path/to/TinyLlama-1.1B-Chat-v1.0/tokenizer.model -c /path/to/TinyLlama-1.1B-Chat-v1.0/tokenizer_config.json

prepare qnn_model.bin

Please checkout the profiling section (device/README.md)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

capp

capp

README.md

LLM on-device demo

Pre-compiled app and models

Build the demo yourself

🐼 Installation

🔨 Compilation

🏃 Preparing `meta.bin`, `tokenizer.bin`, and `qnn_model.bin`

Files

capp

Directory actions

More options

Directory actions

More options

Latest commit

History

capp

Folders and files

parent directory

README.md

LLM on-device demo

Pre-compiled app and models

Build the demo yourself

🐼 Installation

🔨 Compilation

🏃 Preparing meta.bin, tokenizer.bin, and qnn_model.bin

🏃 Preparing `meta.bin`, `tokenizer.bin`, and `qnn_model.bin`