Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : add option to override model tensor buffers #11397

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

slaren
Copy link
Member

@slaren slaren commented Jan 24, 2025

Adds command line parameter --override-tensor (-ot) that allows changing the buffer type where a model tensor is allocated. This gives user fine grained control over what tensors are to offloaded to each device.

How is this useful: for example, to force the experts in MoE models to stay on the CPU, while offloading the rest to the GPU, you could use -ngl 99 -ot exps=CPU. This may allow more efficient offloading schemes.

The syntax is <tensor name pattern>=<buffer type>. Currently the pattern is just a string search (edit: this is no longer the case, it is a C++ regex search), ie. any tensors that contains the characters in <tensor name pattern> will be matched and loaded into the given buffer type. Multiple overrides can be given by separating them with commas, or passing the -ot option multiple times. To see what tensors are being matched, enable debugging output with -v.

At this point it is just a demo, feel free to experiment and report if you find any interesting uses.

Edit: added regex support, for example to keep experts of layers 20-99 in the CPU you could use -ot "[2-9][0-9]\.ffn_.*_exps\.=CPU"

TODO:

  • Fix pipeline parallelism check
  • Support overriding KV cache allocation

@slaren slaren added the demo Demonstrate some concept or idea, not intended to be merged label Jan 24, 2025
@slaren slaren changed the title llama : add option to override tensor buffers llama : add option to override model tensor buffers Jan 24, 2025
@slaren slaren added the need feedback Testing and feedback with results are needed label Jan 24, 2025
@bmtwl
Copy link
Contributor

bmtwl commented Jan 26, 2025

Is there a chance that the direction you're taking these changes might allow for scheduling specific threads to work on specific tensors? With R1 coming out, I'm very interested in reviving my work on trying to improve memory locality to increase CPU inference speeds.

@slaren
Copy link
Member Author

slaren commented Jan 26, 2025

No, that's something that would need to handled at a lower level in the CPU backend.

@bmtwl
Copy link
Contributor

bmtwl commented Jan 26, 2025

No, that's something that would need to handled at a lower level in the CPU backend.

Thanks for the reply @slaren. I figured it wouldn't directly help, but that maybe you'd be adding useful metadata to tensor objects that could help coordinate affinity in the future. I'll start a fresh branch and see how far I get.

At this point it is just a demo, feel free to experiment and report if you find any interesting uses.

I'll also try to pull this branch and test it to see what the speedup and sysmem savings look like.

@bmtwl
Copy link
Contributor

bmtwl commented Jan 27, 2025

Quick, non-scientific initial test with Deepseek R1 at q6 on llama-server with -ot exps=CPU:

-ngl 0 = 4.65t/s
-ngl 10 = 5.15t/s
-ngl 20 = 5.64t/s
-ngl 30 = 6.10t/s
-ngl 40 = 6.95t/s

So there is definitely a major speedup potential for this patch. I can't offload all 62 layers for this model because I only have 24GB VRAM, but I expect the trend would be continue in the same general direction. This is without dropping caches, so its inefficient, but I didn't have the time to do a proper drop/reload cycle since it takes so long to be read back into memory on each test run.

@saood06
Copy link

saood06 commented Jan 27, 2025

Quick, non-scientific initial test with Deepseek R1 at q6 on llama-server with -ot exps=CPU:

-ngl 0 = 4.65t/s -ngl 10 = 5.15t/s -ngl 20 = 5.64t/s -ngl 30 = 6.10t/s -ngl 40 = 6.95t/s

So there is definitely a major speedup potential for this patch. I can't offload all 62 layers for this model because I only have 24GB VRAM, but I expect the trend would be continue in the same general direction. This is without dropping caches, so its inefficient, but I didn't have the time to do a proper drop/reload cycle since it takes so long to be read back into memory on each test run.

@bmtwl
Do you mind testing performance with -nkvo?

@jukofyork
Copy link
Contributor

What are the shared expert tensors called in llama.cpp - is there a pattern that catches the routed experts (that only activate 1/32 of the time), but doesn't catch the shared experts?

@slaren
Copy link
Member Author

slaren commented Jan 28, 2025

I believe the pattern exps will not match the shared experts, since they are called ffn_xxx_shexp.weight. You can use the gguf preview feature in huggingface to see the names of the tensors. Also remember that you can use multiple patterns, it doesn't have to be a single one.

@jukofyork
Copy link
Contributor

I believe the pattern exps will not match the shared experts, since they are called ffn_xxx_shexp.weight. You can use the gguf preview feature in huggingface to see the names of the tensors. Also remember that you can use multiple patterns, it doesn't have to be a single one.

Thanks - I'll give this a try later in the week.

This PR together with Reddit post opens up the interesting possibility:

https://old.reddit.com/r/LocalLLaMA/comments/1ibbloy/158bit_deepseek_r1_131gb_dynamic_gguf/

of quantising up/gate projections to q2_k and down projections to q4_k (or something similar), then keeping everything else as q8_0.

Sadly I need to move some stuff about to get space to upscale the fp8 download to bf16 before I can try it, but will report back when I do.

@jukofyork
Copy link
Contributor

Quick, non-scientific initial test with Deepseek R1 at q6 on llama-server with -ot exps=CPU:

-ngl 0 = 4.65t/s -ngl 10 = 5.15t/s -ngl 20 = 5.64t/s -ngl 30 = 6.10t/s -ngl 40 = 6.95t/s

So there is definitely a major speedup potential for this patch. I can't offload all 62 layers for this model because I only have 24GB VRAM, but I expect the trend would be continue in the same general direction. This is without dropping caches, so its inefficient, but I didn't have the time to do a proper drop/reload cycle since it takes so long to be read back into memory on each test run.

It might be worth trying q4_0 as should almost let you offload all the layers and IIRC should be slightly faster to dequantise than the K-quants?

@jukofyork
Copy link
Contributor

Is there a chance that the direction you're taking these changes might allow for scheduling specific threads to work on specific tensors? With R1 coming out, I'm very interested in reviving my work on trying to improve memory locality to increase CPU inference speeds.

Just being able to split the experts between NUMA nodes would make a big difference, but not sure how easy that would be as IIRC the experts' tensors are all in one huge tensor now?

@BarfingLemurs
Copy link
Contributor

During normal operation, When I fit a model between ram and vram, Does the offloading follow a set layer sequence? (layer 0 is chosen first to be offloaded to GPU, then layer 1, etc)

Between GPU offloading and ram, which takes priority?

Quick, non-scientific initial test with Deepseek R1 at q6 on llama-server with -ot exps=CPU:

-ngl 0 = 4.65t/s -ngl 10 = 5.15t/s -ngl 20 = 5.64t/s -ngl 30 = 6.10t/s -ngl 40 = 6.95t/s

So there is definitely a major speedup potential for this patch. I can't offload all 62 layers for this model because I only have 24GB VRAM, but I expect the trend would be continue in the same general direction. This is without dropping caches, so its inefficient, but I didn't have the time to do a proper drop/reload cycle since it takes so long to be read back into memory on each test run.

Do you remember how much of a speedup? No need for extensive benchmarks, just the rough % estimate.

@saood06
Copy link

saood06 commented Feb 2, 2025

Quick, non-scientific initial test with Deepseek R1 at q6 on llama-server with -ot exps=CPU:

-ngl 0 = 4.65t/s -ngl 10 = 5.15t/s -ngl 20 = 5.64t/s -ngl 30 = 6.10t/s -ngl 40 = 6.95t/s

I can't seem to offload more than 29 layers of R1 (unsloth's UD-IQ2_XXS) via RPC. 29 layers and below work fine, but 30 just crashes my rpc_server, with no error output. It is not an issue of VRAM as even setting context very low so that it takes up nowhere near my GPU's limits and it still crashes.

@jukofyork
Copy link
Contributor

Quick, non-scientific initial test with Deepseek R1 at q6 on llama-server with -ot exps=CPU:
-ngl 0 = 4.65t/s -ngl 10 = 5.15t/s -ngl 20 = 5.64t/s -ngl 30 = 6.10t/s -ngl 40 = 6.95t/s

I can't seem to offload more than 29 layers of R1 (unsloth's UD-IQ2_XXS) via RPC. 29 layers and below work fine, but 30 just crashes my rpc_server, with no error output. It is not an issue of VRAM as even setting context very low so that it takes up nowhere near my GPU's limits and it still crashes.

I had a similar problem where if I used a single GPU (via CUDA_VISIBLE_DEVICES=0) it ran fine and if I used both GPUs with the --no-kv-offload option it also ran fine (but much slower).

If I didn't use either of these it tried to allocate this 1.4TB monster buffer:

llama_init_from_model: pipeline parallelism enabled (n_copies=4)
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1407257.91 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 1475616865280
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 351268.28 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 368331484928
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 353465.98 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 370635939584

After some searching I found this issue:

#7217

and recompiled using -DGGML_SCHED_MAX_COPIES=1 and now it's working fine.

(It's likely nothing to do with this PR, but thought it might help!)

@jukofyork
Copy link
Contributor

@saood06

I figured it out: you have to reorder the devices so the local CUDA devices are last::

#11606
#11424

and mainly these:

#11435

You don't need to run RPC servers for local devices.

#9296
#11424

For those that don't get it (like me initially), you first need to check the device names using the --list-devices option (example below):

 $ llama.cpp/build/bin/llama-server --rpc <IP1>:<PORT1> --rpc <IP2>:<PORT2> --list-devices
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX XXXX, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce GTX YYYY, compute capability 7.5, VMM: yes
Available devices:
  CUDA0: NVIDIA GeForce RTX XXXX (A MiB, B MiB free)
  CUDA1: NVIDIA GeForce GTX YYYY (A MiB, B MiB free)
  RPC[IP1:PORT1]: RPC[IP1:PORT1] (A MiB, B MiB free)
  RPC[IP2:PORT2]: RPC[IP2:PORT2] (A MiB, B MiB free)

It is under Available devices where you get the device names. Next time you launch llama-server, you will use the --device option with the order you want for your devices. An example:

$ llama.cpp/build/bin/llama-server --rpc <IP1>:<PORT1> --rpc <IP2>:<PORT2> \
--device RPC[IP1:PORT1],CUDA0,CUDA1,RPC[IP2:PORT2] \
-ngl 33 --tensor_split 3/20/10/0 --device-draft CUDA1,RPC[IP2:PORT2] -ngld 99 [...]

This way, you can set up the order however you want. In the complicated example above, the main model is offloaded to the first RPC device (using IP1:PORT1 address), mostly on the CUDA0 device, and partially to the CUDA1 device, while the draft model is offloaded to the CUDA1 device and the second RPC device (using IP2:PORT2 address).

Means this works:

--device "RPC[IP1:PORT1],RPC[IP1:PORT2],RPC[IP1:PORT1],RPC[IP2:PORT2],CUDA0,CUDA1"

But if I don't do this I get OOM errors with plenty of VRAM left like you had.

@saood06
Copy link

saood06 commented Feb 5, 2025

I'm testing this with and without #11446 and without on unsloth's UD-IQ2_XXS I was only able to offload 29 layers, and with I was able to allocate only 28 (on a Q4_K_S quant). This is not a VRAM issue, it would have plenty of spare VRAM, it would even get past allocation, and get to warmup, where the rpc-server would then just crash.

The other issue is performance the more layers I allocate the worse performance gets while bmtwl shows performance increase with more layers offloaded with non-RPC based offloading.

@ro99
Copy link

ro99 commented Feb 5, 2025

I am able to load the model with llama-server -m /mnt/models/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf --threads 28 --host 0.0.0.0 --port 5001 -c 8192 -ngl 99 -ot exps=CPU :

PID DEV TYPE GPU MEM HOST MEM Command
16431 0 Compute 13294MiB 54% 215686MiB /opt/llama.cpp/build/bin/llama-server -m /mnt/models/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-000
16431 2 Compute 12088MiB 49% 215686MiB /opt/llama.cpp/build/bin/llama-server -m /mnt/models/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-000
16431 3 Compute 11616MiB 47% 215686MiB /opt/llama.cpp/build/bin/llama-server -m /mnt/models/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-000
16431 1 Compute 11488MiB 47% 215686MiB /opt/llama.cpp/build/bin/llama-server -m /mnt/models/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-000

But as soon as I send the prompt I receive:

/opt/llama.cpp/ggml/src/ggml-alloc.c:182: not enough space in the buffer
ggml_dyn_tallocr_alloc: not enough space in the buffer to allocate 18446744073709550624 bytes, largest block available 9223372036854775807 bytes
[New LWP 16444]
[New LWP 16445]
[New LWP 16446]
[New LWP 16447]
...
[New LWP 16533]
[New LWP 16534]
[New LWP 16535]
[New LWP 16536]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f1e950d0bd7 in wait4 () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x00007f1e950d0bd7 in wait4 () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f1e95527fc1 in ggml_abort () from /opt/llama.cpp/build/bin/libggml-base.so
#2  0x00007f1e9553619c in ggml_gallocr_allocate_node () from /opt/llama.cpp/build/bin/libggml-base.so
#3  0x00007f1e955369d0 in ggml_gallocr_reserve_n () from /opt/llama.cpp/build/bin/libggml-base.so
#4  0x00007f1e9553c244 in ggml_backend_sched_alloc_graph () from /opt/llama.cpp/build/bin/libggml-base.so
#5  0x00007f1e95646030 in llama_decode_impl(llama_context&, llama_batch) () from /opt/llama.cpp/build/bin/libllama.so
#6  0x00007f1e95646f57 in llama_decode () from /opt/llama.cpp/build/bin/libllama.so
#7  0x000055f47d6647c9 in server_context::update_slots() ()
#8  0x000055f47d64f4d1 in server_queue::start_loop() ()
#9  0x000055f47d5fd067 in main ()
[Inferior 1 (process 16431) detached]
Aborted (core dumped)

Without the --override-tensor and offloading 20 layers to the GPU it works fine.

Testing with 4x RTX 3090 and 320GiB RAM. Built with cmake -B build -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1.

@jukofyork
Copy link
Contributor

Without the --override-tensor and offloading 20 layers to the GPU it works fine.

Testing with 4x RTX 3090 and 320GiB RAM. Built with cmake -B build -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1.

Maybe try -ngl 61 to keep the output layer on the CPU too (that oddly worked for me earlier when I was having trouble with the RPC stuff).

@ro99
Copy link

ro99 commented Feb 5, 2025

Maybe try -ngl 61 to keep the output layer on the CPU too (that oddly worked for me earlier when I was having trouble with the RPC stuff).

No luck, still the same issue.

Oddly enough, the issue only happens when sending more than 450 tokens.

@slaren
Copy link
Member Author

slaren commented Feb 5, 2025

ggml_dyn_tallocr_alloc: not enough space in the buffer to allocate 18446744073709550624 bytes

It's trying to allocate a tensor of size 2^64, which suggest there is an integer overflow somewhere. If you set the environment variable GGML_SCHED_DEBUG=2, it will print the graph before allocating it, which may give some indication of which tensor is causing this. Or just change the error message in ggml_dyn_tallocr_alloc to include the tensor name.

@ro99
Copy link

ro99 commented Feb 6, 2025

It's trying to allocate a tensor of size 2^64, which suggest there is an integer overflow somewhere. If you set the environment variable GGML_SCHED_DEBUG=2, it will print the graph before allocating it, which may give some indication of which tensor is causing this. Or just change the error message in ggml_dyn_tallocr_alloc to include the tensor name.

It is the CPU#ffn_moe_topk-60#0 tensor.

Is it possible to try to force this particular one to be allocated into the GPU buffer?

@slaren
Copy link
Member Author

slaren commented Feb 6, 2025

This is most likely a bug, we need to understand why it is happening and fix it. Since you mentioned that it only happens with large prompts, I suspect that this is caused by a zero-sized tensors. When evaluating a batch where no logits are required (which happens when evaluating a prompt that needs to be split into multiple ubatches), zero-size tensors are created to skip the calculation of the logits.
I cannot run this model, so I would need your help to figure why this is happening. Can you print more details about the tensor? Something like this should do it:

diff --git a/ggml/src/ggml-alloc.c b/ggml/src/ggml-alloc.c
index 9a3bf9f29..470ef13e6 100644
--- a/ggml/src/ggml-alloc.c
+++ b/ggml/src/ggml-alloc.c
@@ -179,6 +179,9 @@ static size_t ggml_dyn_tallocr_alloc(struct ggml_dyn_tallocr * alloc, size_t siz
             // this should never happen
             GGML_LOG_ERROR("%s: not enough space in the buffer to allocate %zu bytes, largest block available %zu bytes\n",
                     __func__, size, max_avail);
+            GGML_LOG_ERROR("%s: tensor: %s, shape: %ld %ld %ld %ld, size: %zu",
+                __func__, tensor->name, tensor->ne[0], tensor->ne[1], tensor->ne[2], tensor->ne[3],
+                ggml_nbytes(tensor));
             GGML_ABORT("not enough space in the buffer");
         }
     }

@slaren
Copy link
Member Author

slaren commented Feb 6, 2025

Ok nvm, I think I see the problem. I will push a possible fix soon.

@slaren
Copy link
Member Author

slaren commented Feb 19, 2025

So running with the following:

The names of these tensors do not match the names of the tensors in llama.cpp. I suggest running with -v to see which tensors are being affected by the filter, and using the gguf preview in HF to see the list of tensors in a model (for example, try: https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-IQ1_S?show_file_info=DeepSeek-R1-UD-IQ1_S%2FDeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf).

Performance is lower given that GPU memory does not appear to be fully utilised:

You would need to increase the value of -ngl to take advantage of the GPU memory freed by the tensor overrides, or use a different set of overrides.

@jukofyork
Copy link
Contributor

--cache-type-k q4_0 likely hurts performance a lot too.

@lingster
Copy link

@slaren : thanks for pointing out that I was using the incorrect tensor names (infact the ktransformers were using the model names from safetensor format files and not gguf). So now I have rerun some tests and can see improved GPU usage, increasing to 50%:

./build/bin/llama-cli -ub 512 --no-mmap  --tensor-split 20,19   --model /data/gguf/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf      --threads 16     --prio 2     --temp 0.6     --ctx-size 8192     --seed 3407     --n-gpu-layers 99   -ctk iq4_nl  -no-cnv     --prompt "<|User|>You are an expert python developer. Write a factorial function using python.<|Assistant|>" -ot 'ffn_up_exps=CPU'  -ot 'ffn_down_exps=CPU' -ot 'attn_kv_b=CUDA0' -ot 'ffn_up=CUDA0' -ot 'ffn_norm=CUDA1' -ot 'attn=CUDA1'

image

llama_perf_sampler_print:    sampling time =      27.61 ms /   366 runs   (    0.08 ms per token, 13256.55 tokens per second)
llama_perf_context_print:        load time =   60802.72 ms
llama_perf_context_print: prompt eval time =    2907.77 ms /    17 tokens (  171.05 ms per token,     5.85 tokens per second)
llama_perf_context_print:        eval time =  107544.58 ms /   348 runs   (  309.04 ms per token,     3.24 tokens per second)
llama_perf_context_print:       total time =  110841.42 ms /   365 tokens

However, using the -ot option it's seems impossible to utilise the full memory on the GPUs, the ffn_gate/ffn_up/ffn_down layers are simply too large to be loaded into 48gb vram. But this results in ~3.2 tok/s.

The best combination appears to be --ngl 36 --tensor-split 19,20, where I can get over 4.2 tok/s.

It seems that the bottleneck would be the CPU memory. With -ot we get more GPU utilisation, but this doesn't seem to make up for the time lost to having some of the layers on slower CPU memory.

@jukofyork : I'm using the --ctk q4_0 as per the unsloth: (https://unsloth.ai/blog/deepseekr1-dynamic) If I remove this and use the default. I get CUDA OOM. I have tried the different ctk values but it doesn't appear to be any noticeable performance improvements.

@Reactantvr
Copy link

Reactantvr commented Feb 19, 2025

So, sadly, this is not compatible with 5090s, but the current llama cpp is. I guess I will have to wait to test this with a 5090. This is the same build I used with my 3090, so the only variable here is the GPU being swapped.

image

@lingster
Copy link

lingster commented Feb 20, 2025

So, sadly, this is not compatible with 5090s, but the current llama cpp is. I guess I will have to wait to test this with a 5090. This is the same build I used with my 3090, so the only variable here is the GPU being swapped.

image

Have you tried rrcompiling with the right cuda architecture flag set? #4215

Looking at the nvidia docs sm_100 is what you need:
https://docs.nvidia.com/cuda/blackwell-compatibility-guide/index.html#application-compatibility-on-blackwell-architecture

@Reactantvr
Copy link

Reactantvr commented Feb 21, 2025

So, sadly, this is not compatible with 5090s, but the current llama cpp is. I guess I will have to wait to test this with a 5090. This is the same build I used with my 3090, so the only variable here is the GPU being swapped.
image

Have you tried rrcompiling with the right cuda architecture flag set? #4215

Looking at the nvidia docs sm_100 is what you need: https://docs.nvidia.com/cuda/blackwell-compatibility-guide/index.html#application-compatibility-on-blackwell-architecture

I looked into that and it seems to have done the trick.
For compiling it, I just changed the first command.

cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="100"

This changed it to sm=100 while it compiled.

I still need to mess with settings to get the best speed, but here is the very first run.

llama-server --model DeepSeek-R1-Q4_K_M-00001-of-00011.gguf --flash-attn --threads 36 --temp 0.6 --min-p 0.05 --ctx-size 2048 --no-mmap -ngl 36 -ot exps=CPU

I am getting about 28% higher t/s for eval_time. For prompt eval_time, around a 50% improvement. (6.2 t/s / 14.1 t/s). This one leaves a lot of room for context as it only uses 17 GB of GPU memory.

llama-server --model DeepSeek-R1-Q4_K_M-00001-of-00011.gguf --flash-attn --threads 36 --temp 0.6 --min-p 0.05 --ctx-size 2048 --no-mmap -ngl 62 -ot exps=CPU

This command uses 26 GB of GPU memory, so still 6 GB for extra context over 2k context (I tested this and it uses 31.1 GB at 4096 context). This gets me around eval_time / prompt eval_time (7.8 t/s / 20.5 t/s).

Overall, the changes you made lead to a 66% performance increase on eval_time and around 100% performance increase on promp eval_time vs CPU only on a threadripper 7965WX, 512 GB memory, 5090. You are an absolute genius.

If you have some proper benches you want me to run, let me know.

@Reactantvr
Copy link

Reactantvr commented Feb 21, 2025

Another update.

llama-server --model DeepSeek-R1-Q4_K_M-00001-of-00011.gguf --flash-attn --threads 40 --temp 0.6 --min-p 0.05 --ctx-size 4096 --no-mmap -ngl 62 -ot exps=CPU

This uses up all my threads completely and I get a small performance bump.

image

82% performance increase now on eval time. Make me really want a 64 core threadripper now. Also, a second 5090 for more context. Using 31 GB of GPU memory right now at 4k. I am also curious if getting double the system memory bandwidth will make a difference after the 64 core threadripper upgrade. Maybe I can get up to 10-15 t/s.

Another thing I noticed is that it no longer drops off a cliff in inference speed as I continue a story. After 1k context generated, then another new 2k context, the new t/s was still 8.01 t/s. If this was CPU, it would have dropped by 25% by then.

The only real limiting factor is that 3.5k context seems like the absolute upper limit. I was having trouble with 4k context. I really need more context.

Another issue is that promp eval time is actually all over the place. Sometimes it is fast, sometimes it does this:

image

Another update:

I found that --flash-attn makes no difference. Also, I changed --no-mmap to --mlock and I get consistent promp eval now around 12 t/s. Still pretty amazing for running Q4 of R1 on CPU with one consumer grade GPU.

image

Yet another update. This time using Unsloth DeepSeek-R1-UD-IQ2_XXS-00001-of-00004.gguf. This model is still really good and uses only ~200 GB system memory and 27.5 GB GPU memory at 3k context.

image

Was able to get 3600 context max with this unsloth model.

The only real limited factor with this setup is context. Any chance KV cache allocation will resolve this issue?

@slaren
Copy link
Member Author

slaren commented Feb 22, 2025

Thanks for all the testing, I will try to get this ready for merging over the next days.

@ubergarm
Copy link

ubergarm commented Feb 23, 2025

@Reactantvr

found that --flash-attn makes no difference.

Yeah flash-attn is not supported yet in llama.cpp for DeepSeek-R1 psure, check out #11557

I changed --no-mmap to --mlock and I get consistent promp eval now

This is likely because without those args, llama.cpp defaults to normal mmap() which may not immediately cache all the weights from disk into memory page cache causing some variation in performance i think. using those args forces it all to be pre-allocated and in ram ready to go.

thanks for the benchmarks, i'm revisiting this exciting branch after playing with ktransformers and trying to figure out how they get almost 2x inference speeds on R1. i noticed when i disabled CUDA Graphs on ktransformers, it performs almost same as llama.cpp again... however cuda graphs only work when not offloading any experts into VRAM hrmm...

anyway enjoying the quest for more tok/sec! cheers!

@saood06
Copy link

saood06 commented Feb 23, 2025

The only real limited factor with this setup is context. Any chance KV cache allocation will resolve this issue?

You can try it with the PR the comment is from and the modification shown at the bottom of the comment: #11446 (comment) . This further comment showed it worked, #11446 (comment)

@lingster
Copy link

@Reactantvr Thanks for sharing your test results. Just curious, what is the ratings of your DIMM memory you are using on your setup? if you run nvtop do you see your GPU running at max compute? For me it seems that in my testing CPU memory is the limiting factor/bottleneck.

@Reactantvr
Copy link

Reactantvr commented Feb 23, 2025

@Reactantvr Thanks for sharing your test results. Just curious, what is the ratings of your DIMM memory you are using on your setup? if you run nvtop do you see your GPU running at max compute? For me it seems that in my testing CPU memory is the limiting factor/bottleneck.

My memory is 8x64 V-Color DDR5 6000 running at 4800. I didn't bother overclocking it yet because I am on 4 CCDs, which should limit me to around 230 GB/s. I assume I would not get more bandwidth until I upgrade to a 64 core Threadripper. Waiting on Shimada Peak for that. I'll probably run it at 6400 once I get that CPU.

I've never used nvtop. Plus, I am doing everything in Windows 10, so not sure if I can use it. I can give you stats from GPU-Z. Looks like GPU load is around 18-19%. This was using DeepSeek-R1-UD-IQ2_XXS.

image

@Readon
Copy link

Readon commented Feb 24, 2025

Works perfect for me, with dual E5 v2 + 2080Ti by running DeepSeek-R1-UD-Q2_K_XL. boost the token generation speed from 1.8tps to 3.3tps. While disable one node of numa, it can increase to 3.8tps.

@Rotatingxenomorph
Copy link

trying to figure out how they get almost 2x inference speeds on R1. i noticed when i disabled CUDA Graphs on ktransformers, it performs almost same as llama.cpp again... however cuda graphs only work when not offloading any experts into VRAM hrmm...

Not sure if this is related, but I get slightly worse t/s generation speed (0.4 t/s slower) offloading q2_k_xl any layers to vram (24gb over 2x 3060s) than using -ngl 0 and using only quad channel 2133 ddr4 system ram. This is using the main branch and an old version.

@lingster
Copy link

Is there a way to disable to kv cache and just recompute values as required? The freed cache memory could be used for loading additional model layers. In my tests I've not seen my gpu max out, so maybe there is a sweet spot between caching vs calculating?

@slaren
Copy link
Member Author

slaren commented Feb 25, 2025

Is there a way to disable to kv cache and just recompute values as required?

No, but you can keep it in system memory with -nkvo.

@ejrydhfs
Copy link

ejrydhfs commented Feb 25, 2025

I changed --no-mmap to --mlock and I get consistent promp eval now

This is likely because without those args, llama.cpp defaults to normal mmap() which may not immediately cache all the weights from disk into memory page cache causing some variation in performance i think. using those args forces it all to be pre-allocated and in ram ready to go.

would it be possible to have llama.cpp only load some experts from disk to ram or vram, or from ram to vram, on demand? but it would come at the cost of latency after the prompt is sent to the model

@ejrydhfs
Copy link

ejrydhfs commented Feb 25, 2025

i am not sure if this is similar, but would it also be possible to implement keeping several instances of experts or most used tensors on each compute device to increase inference speed for common queries, and also perform separation of each expert into commonly used and rarely used neurons aka hot and cold neurons respectively like powerinfer and powerinfer 2 do?

Would it also be possible to perform sharding of the model to achieve tensor parallelization between different types of devices like CPUs with GPUs using the hot and cold neurons approach, on any kind of AI model?

@ubergarm
Copy link

ubergarm commented Mar 2, 2025

The good news

Using -ot / --override-tensor flag seems to work properly in my testing. Running with -ngl 62 -ot exps=CPU is the fastest way to run R1 671B UD-Q2_K_XL (212GiB weights) on 256GB RAM plus single CUDA GPU on my ThreadRipper Pro 24 core test rig with llama.cpp.

It is counter-intuitive to me that offloading less layers onto GPU makes it go faster, and I presume this has something to do with CUDA graphs not working as well with even a single expert also in VRAM, but I'm really just speculating wildly.

This method is still not quite as fast as ktransformers, but it is faster than running ktransformers --no-use_cuda_graph.

The technically unrelated news

I had hoped to use this fine-grained offload method to distribute experts across 6 NUMA nodes on a big dual socket Intel Xeon 6980P. While it does technically work and runs, it is much slower than just running normally with no NUMA optimizations at all. I even tried making a patch to rpc-server example to allow specifying number of threads and forcing CPU backend.

--override-tensor works well with RPC devices and I appreciate how specifying the flag multiple times stacks how I would expect. However, as others have mentioned above, the current synchronous RPC send() implementation seems to bottleneck attempts to distribute computation and is not a true async tensor parallel optimized solution. (vLLM seems to implement some of this, and I hope to test it to find how well CPU backend works on it).

Example

I tried a few configurations including 5x rpc-server backends and a single llama-server frontend each in a different NUMA node. I also tried a more simple version with 1x rpc-server on a single NUMA node on the opposite CPU socket as the llama-server frontend. Even communicating over loopback device the performance was much worse.

I'll leave the commands and some info for anyone interested inside the fold below. Also a whole discussion on the challenges of running llama.cpp in more than a single NUMA node over here.

Cheers!

EDIT Tried one last time with -nkvo, --no-kv-offload disable KV offload but didn't make a significant difference, still very slow and not saturating CPU cores as probably waiting around for send() calls...

Example selective RPC backend offloading experiments

System Info

# $ numactl -H --cpu-compress
available: 6 nodes (0-5)
node 0 cpus: 0-42, 256-298 (86)
node 0 size: 257688 MB
node 1 cpus: 43-85, 299-341 (86)
node 1 size: 258018 MB
node 2 cpus: 86-127, 342-383 (84)
node 2 size: 258019 MB
node 3 cpus: 128-170, 384-426 (86)
node 3 size: 258018 MB
node 4 cpus: 171-213, 427-469 (86)
node 4 size: 258018 MB
node 5 cpus: 214-255, 470-511 (84)
node 5 size: 257949 MB

Backend RPC server(s)

Bash script to distribute rpc-servers across NUMA nodes.

#!/usr/bin/env bash

RPC_SERVER="./build_amx/bin/rpc-server"

# Define cleanup function
cleanup() {
    echo "Exiting all rpc-server processes..."
    killall rpc-server
    exit 0
}

# Trap SIGINT signal and call cleanup function
trap cleanup SIGINT

echo "Starting llama.cpp rpc-server backend for each NUMA node except for node 0."
# NOTE : for this specific test I only wanted a single server on node 3 (second CPU socket)
# NOTE2: i don't think the -mem flag does anything, just put 1 and it still loads whatever you send it
#for node in {1..5}
for node in {3..3}
do
    CMD="numactl -N $node -m $node \
        $RPC_SERVER \
        --mem 23000 \
        --threads 42 \
        --host 127.0.0.1 \
        --port 5005$node"
    echo $CMD
    $CMD &
    sleep 0.25
done

# Wait indefinitely until SIGINT is received
echo "==="
echo "Done..."
echo "Press <cntrl>+c to exit and kill all rpc-servers."
read -r -d '' _ </dev/tty

Frontend Client

I noticed llama-server starts like 555 threads for some reason. I tested llama-cli which starts correct requested number of threads. Both seem to generate at same very poor speeds.

## Start frontend in node 0
CMD="numactl -N 0 -m 0 \
    ./build_amx/bin/llama-server \
    --model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf \
    --threads 42 \
    --numa numactl \
    --ctx-size 2048 \
    --rpc 127.0.0.1:50053 \
    --device RPC[127.0.0.1:50053] \
    --n-gpu-layers 62 \
    --override-tensor exps=RPC[127.0.0.1:50053] \
    --override-tensor \.*=CPU \
    --host 127.0.0.1 \
    --port 8080 -v"

numastat confirming allocations in correct nodes

$ watch numastat -p $(pidof llama-server)
Per-node process memory usage (in MBs) for PID 3493834 (llama-server)
                           Node 0          Node 1          Node 2          Node 3          Node 4          Node 5           Total
                  --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Huge                         0.00            0.00            0.00            0.00            0.00            0.00            0.00
Heap                        39.93            0.00            0.00            0.00            0.00            0.00           39.93
Stack                        0.07            0.00            0.00            0.00            0.00            0.00            0.07
Private                 210980.80            0.04            0.00            1.49            0.00            0.00       210982.33
----------------  --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Total                   211020.81            0.04            0.00            1.49            0.00            0.00       211022.34



$ watch numastat -m -v z
Per-node system memory usage (in MBs):
                          Node 0          Node 1          Node 2          Node 3          Node 4          Node 5           Total
                 --------------- --------------- --------------- --------------- --------------- --------------- ---------------
MemTotal               257688.18       258018.79       258019.46       258018.79       258018.79       257949.26      1547713.25
MemFree                 38192.29       256419.21       256928.47        40355.96       255835.55       256676.09      1104407.58
MemUsed                219495.88         1599.57         1090.99       217662.82         2183.24         1273.16       443305.67
SwapCached                  1.16            0.05            0.00            0.76            0.46            0.02            2.45
Active                    136.95           50.34            3.38       215533.59          120.05           30.82       215875.14
Inactive               215620.14            3.86            0.36            5.04           19.48            0.00       215648.88
Active(anon)              125.72            7.79            0.78       215517.16           85.89           28.62       215765.98
Inactive(anon)              6.27            0.47            0.00            0.00            0.28            0.00            7.02
Active(file)               11.23           42.55            2.59           16.43           34.16            2.20          109.16
Inactive(file)         215613.86            3.39            0.36            5.04           19.21            0.00       215641.86
Unevictable                33.43            1.52            0.00            0.00            0.41            0.00           35.36
Mlocked                    24.64            1.52            0.00            0.00            0.41            0.00           26.57
Dirty                       0.01            0.25            0.00            0.00            0.00            0.00            0.27
FilePages              215641.79           47.52            3.71           22.75           57.19            2.84       215775.79
Mapped                 210910.81           36.04            2.95           17.79           31.80            2.21       211001.59
AnonPages                 148.90            8.21            0.02       215515.99           82.81           15.12       215771.06
Shmem                       8.86            0.01            0.75            0.52            2.95            0.62           13.72
KernelStack                22.39           12.57           12.44           12.93           13.10           12.08           85.50
PageTables                419.59            0.23            0.02          422.52            0.97            0.05          843.37
Slab                     2064.86          488.38          210.09          332.59          920.96          355.46         4372.35
SReclaimable              561.28           32.97           21.71           40.41           59.71           34.47          750.55
SUnreclaim               1503.58          455.41          188.39          292.18          861.25          320.99         3621.80
AnonHugePages              68.00            4.00            0.00       215502.00           66.00            0.00       215640.00
KReclaimable              561.28           32.97           21.71           40.41           59.71           34.47          750.55

btop

Confirm CPU cores are running on the NUMA nodes with memory allocation.

Q2-rpc-servers-moar-testing-why-so-many-threads-lol

Layer Mappings

I explicitly wildcard all non exps to CPU to get it to print out all the layers with -v. This is the inverse of -ot exps=CPU as the "GPU" in this case is a remote RPC CPU backend.

tensor token_embd.weight buffer type overriden to CPU
tensor output_norm.weight buffer type overriden to CPU
tensor output.weight buffer type overriden to CPU
tensor blk.0.attn_norm.weight buffer type overriden to CPU
tensor blk.0.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.0.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.0.attn_q_a.weight buffer type overriden to CPU
tensor blk.0.attn_q_b.weight buffer type overriden to CPU
tensor blk.0.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.0.attn_kv_b.weight buffer type overriden to CPU
tensor blk.0.attn_output.weight buffer type overriden to CPU
tensor blk.0.ffn_norm.weight buffer type overriden to CPU
tensor blk.0.ffn_gate.weight buffer type overriden to CPU
tensor blk.0.ffn_down.weight buffer type overriden to CPU
tensor blk.0.ffn_up.weight buffer type overriden to CPU
tensor blk.1.attn_norm.weight buffer type overriden to CPU
tensor blk.1.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.1.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.1.attn_q_a.weight buffer type overriden to CPU
tensor blk.1.attn_q_b.weight buffer type overriden to CPU
tensor blk.1.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.1.attn_kv_b.weight buffer type overriden to CPU
tensor blk.1.attn_output.weight buffer type overriden to CPU
tensor blk.1.ffn_norm.weight buffer type overriden to CPU
tensor blk.1.ffn_gate.weight buffer type overriden to CPU
tensor blk.1.ffn_down.weight buffer type overriden to CPU
tensor blk.1.ffn_up.weight buffer type overriden to CPU
tensor blk.2.attn_norm.weight buffer type overriden to CPU
tensor blk.2.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.2.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.2.attn_q_a.weight buffer type overriden to CPU
tensor blk.2.attn_q_b.weight buffer type overriden to CPU
tensor blk.2.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.2.attn_kv_b.weight buffer type overriden to CPU
tensor blk.2.attn_output.weight buffer type overriden to CPU
tensor blk.2.ffn_norm.weight buffer type overriden to CPU
tensor blk.2.ffn_gate.weight buffer type overriden to CPU
tensor blk.2.ffn_down.weight buffer type overriden to CPU
tensor blk.2.ffn_up.weight buffer type overriden to CPU
tensor blk.3.attn_norm.weight buffer type overriden to CPU
tensor blk.3.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.3.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.3.attn_q_a.weight buffer type overriden to CPU
tensor blk.3.attn_q_b.weight buffer type overriden to CPU
tensor blk.3.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.3.attn_kv_b.weight buffer type overriden to CPU
tensor blk.3.attn_output.weight buffer type overriden to CPU
tensor blk.3.ffn_norm.weight buffer type overriden to CPU
tensor blk.3.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.3.exp_probs_b.bias buffer type overriden to CPU
tensor blk.3.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.3.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.3.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.3.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.3.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.3.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.4.attn_norm.weight buffer type overriden to CPU
tensor blk.4.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.4.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.4.attn_q_a.weight buffer type overriden to CPU
tensor blk.4.attn_q_b.weight buffer type overriden to CPU
tensor blk.4.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.4.attn_kv_b.weight buffer type overriden to CPU
tensor blk.4.attn_output.weight buffer type overriden to CPU
tensor blk.4.ffn_norm.weight buffer type overriden to CPU
tensor blk.4.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.4.exp_probs_b.bias buffer type overriden to CPU
tensor blk.4.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.4.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.4.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.4.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.4.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.4.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.5.attn_norm.weight buffer type overriden to CPU
tensor blk.5.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.5.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.5.attn_q_a.weight buffer type overriden to CPU
tensor blk.5.attn_q_b.weight buffer type overriden to CPU
tensor blk.5.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.5.attn_kv_b.weight buffer type overriden to CPU
tensor blk.5.attn_output.weight buffer type overriden to CPU
tensor blk.5.ffn_norm.weight buffer type overriden to CPU
tensor blk.5.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.5.exp_probs_b.bias buffer type overriden to CPU
tensor blk.5.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.5.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.5.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.5.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.5.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.5.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.6.attn_norm.weight buffer type overriden to CPU
tensor blk.6.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.6.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.6.attn_q_a.weight buffer type overriden to CPU
tensor blk.6.attn_q_b.weight buffer type overriden to CPU
tensor blk.6.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.6.attn_kv_b.weight buffer type overriden to CPU
tensor blk.6.attn_output.weight buffer type overriden to CPU
tensor blk.6.ffn_norm.weight buffer type overriden to CPU
tensor blk.6.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.6.exp_probs_b.bias buffer type overriden to CPU
tensor blk.6.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.6.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.6.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.6.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.6.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.6.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.7.attn_norm.weight buffer type overriden to CPU
tensor blk.7.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.7.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.7.attn_q_a.weight buffer type overriden to CPU
tensor blk.7.attn_q_b.weight buffer type overriden to CPU
tensor blk.7.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.7.attn_kv_b.weight buffer type overriden to CPU
tensor blk.7.attn_output.weight buffer type overriden to CPU
tensor blk.7.ffn_norm.weight buffer type overriden to CPU
tensor blk.7.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.7.exp_probs_b.bias buffer type overriden to CPU
tensor blk.7.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.7.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.7.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.7.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.7.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.7.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.8.attn_norm.weight buffer type overriden to CPU
tensor blk.8.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.8.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.8.attn_q_a.weight buffer type overriden to CPU
tensor blk.8.attn_q_b.weight buffer type overriden to CPU
tensor blk.8.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.8.attn_kv_b.weight buffer type overriden to CPU
tensor blk.8.attn_output.weight buffer type overriden to CPU
tensor blk.8.ffn_norm.weight buffer type overriden to CPU
tensor blk.8.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.8.exp_probs_b.bias buffer type overriden to CPU
tensor blk.8.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.8.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.8.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.8.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.8.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.8.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.9.attn_norm.weight buffer type overriden to CPU
tensor blk.9.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.9.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.9.attn_q_a.weight buffer type overriden to CPU
tensor blk.9.attn_q_b.weight buffer type overriden to CPU
tensor blk.9.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.9.attn_kv_b.weight buffer type overriden to CPU
tensor blk.9.attn_output.weight buffer type overriden to CPU
tensor blk.9.ffn_norm.weight buffer type overriden to CPU
tensor blk.9.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.9.exp_probs_b.bias buffer type overriden to CPU
tensor blk.9.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.9.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.9.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.9.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.9.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.9.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.10.attn_norm.weight buffer type overriden to CPU
tensor blk.10.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.10.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.10.attn_q_a.weight buffer type overriden to CPU
tensor blk.10.attn_q_b.weight buffer type overriden to CPU
tensor blk.10.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.10.attn_kv_b.weight buffer type overriden to CPU
tensor blk.10.attn_output.weight buffer type overriden to CPU
tensor blk.10.ffn_norm.weight buffer type overriden to CPU
tensor blk.10.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.10.exp_probs_b.bias buffer type overriden to CPU
tensor blk.10.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.10.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.10.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.10.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.10.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.10.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.11.attn_norm.weight buffer type overriden to CPU
tensor blk.11.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.11.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.11.attn_q_a.weight buffer type overriden to CPU
tensor blk.11.attn_q_b.weight buffer type overriden to CPU
tensor blk.11.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.11.attn_kv_b.weight buffer type overriden to CPU
tensor blk.11.attn_output.weight buffer type overriden to CPU
tensor blk.11.ffn_norm.weight buffer type overriden to CPU
tensor blk.11.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.11.exp_probs_b.bias buffer type overriden to CPU
tensor blk.11.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.11.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.11.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.11.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.11.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.11.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.12.attn_norm.weight buffer type overriden to CPU
tensor blk.12.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.12.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.12.attn_q_a.weight buffer type overriden to CPU
tensor blk.12.attn_q_b.weight buffer type overriden to CPU
tensor blk.12.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.12.attn_kv_b.weight buffer type overriden to CPU
tensor blk.12.attn_output.weight buffer type overriden to CPU
tensor blk.12.ffn_norm.weight buffer type overriden to CPU
tensor blk.12.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.12.exp_probs_b.bias buffer type overriden to CPU
tensor blk.12.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.12.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.12.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.12.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.12.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.12.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.13.attn_norm.weight buffer type overriden to CPU
tensor blk.13.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.13.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.13.attn_q_a.weight buffer type overriden to CPU
tensor blk.13.attn_q_b.weight buffer type overriden to CPU
tensor blk.13.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.13.attn_kv_b.weight buffer type overriden to CPU
tensor blk.13.attn_output.weight buffer type overriden to CPU
tensor blk.13.ffn_norm.weight buffer type overriden to CPU
tensor blk.13.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.13.exp_probs_b.bias buffer type overriden to CPU
tensor blk.13.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.13.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.13.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.13.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.13.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.13.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.14.attn_norm.weight buffer type overriden to CPU
tensor blk.14.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.14.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.14.attn_q_a.weight buffer type overriden to CPU
tensor blk.14.attn_q_b.weight buffer type overriden to CPU
tensor blk.14.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.14.attn_kv_b.weight buffer type overriden to CPU
tensor blk.14.attn_output.weight buffer type overriden to CPU
tensor blk.14.ffn_norm.weight buffer type overriden to CPU
tensor blk.14.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.14.exp_probs_b.bias buffer type overriden to CPU
tensor blk.14.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.14.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.14.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.14.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.14.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.14.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.15.attn_norm.weight buffer type overriden to CPU
tensor blk.15.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.15.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.15.attn_q_a.weight buffer type overriden to CPU
tensor blk.15.attn_q_b.weight buffer type overriden to CPU
tensor blk.15.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.15.attn_kv_b.weight buffer type overriden to CPU
tensor blk.15.attn_output.weight buffer type overriden to CPU
tensor blk.15.ffn_norm.weight buffer type overriden to CPU
tensor blk.15.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.15.exp_probs_b.bias buffer type overriden to CPU
tensor blk.15.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.15.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.15.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.15.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.15.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.15.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.16.attn_norm.weight buffer type overriden to CPU
tensor blk.16.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.16.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.16.attn_q_a.weight buffer type overriden to CPU
tensor blk.16.attn_q_b.weight buffer type overriden to CPU
tensor blk.16.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.16.attn_kv_b.weight buffer type overriden to CPU
tensor blk.16.attn_output.weight buffer type overriden to CPU
tensor blk.16.ffn_norm.weight buffer type overriden to CPU
tensor blk.16.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.16.exp_probs_b.bias buffer type overriden to CPU
tensor blk.16.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.16.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.16.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.16.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.16.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.16.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.17.attn_norm.weight buffer type overriden to CPU
tensor blk.17.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.17.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.17.attn_q_a.weight buffer type overriden to CPU
tensor blk.17.attn_q_b.weight buffer type overriden to CPU
tensor blk.17.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.17.attn_kv_b.weight buffer type overriden to CPU
tensor blk.17.attn_output.weight buffer type overriden to CPU
tensor blk.17.ffn_norm.weight buffer type overriden to CPU
tensor blk.17.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.17.exp_probs_b.bias buffer type overriden to CPU
tensor blk.17.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.17.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.17.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.17.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.17.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.17.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.18.attn_norm.weight buffer type overriden to CPU
tensor blk.18.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.18.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.18.attn_q_a.weight buffer type overriden to CPU
tensor blk.18.attn_q_b.weight buffer type overriden to CPU
tensor blk.18.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.18.attn_kv_b.weight buffer type overriden to CPU
tensor blk.18.attn_output.weight buffer type overriden to CPU
tensor blk.18.ffn_norm.weight buffer type overriden to CPU
tensor blk.18.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.18.exp_probs_b.bias buffer type overriden to CPU
tensor blk.18.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.18.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.18.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.18.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.18.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.18.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.19.attn_norm.weight buffer type overriden to CPU
tensor blk.19.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.19.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.19.attn_q_a.weight buffer type overriden to CPU
tensor blk.19.attn_q_b.weight buffer type overriden to CPU
tensor blk.19.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.19.attn_kv_b.weight buffer type overriden to CPU
tensor blk.19.attn_output.weight buffer type overriden to CPU
tensor blk.19.ffn_norm.weight buffer type overriden to CPU
tensor blk.19.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.19.exp_probs_b.bias buffer type overriden to CPU
tensor blk.19.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.19.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.19.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.19.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.19.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.19.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.20.attn_norm.weight buffer type overriden to CPU
tensor blk.20.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.20.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.20.attn_q_a.weight buffer type overriden to CPU
tensor blk.20.attn_q_b.weight buffer type overriden to CPU
tensor blk.20.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.20.attn_kv_b.weight buffer type overriden to CPU
tensor blk.20.attn_output.weight buffer type overriden to CPU
tensor blk.20.ffn_norm.weight buffer type overriden to CPU
tensor blk.20.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.20.exp_probs_b.bias buffer type overriden to CPU
tensor blk.20.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.20.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.20.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.20.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.20.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.20.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.21.attn_norm.weight buffer type overriden to CPU
tensor blk.21.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.21.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.21.attn_q_a.weight buffer type overriden to CPU
tensor blk.21.attn_q_b.weight buffer type overriden to CPU
tensor blk.21.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.21.attn_kv_b.weight buffer type overriden to CPU
tensor blk.21.attn_output.weight buffer type overriden to CPU
tensor blk.21.ffn_norm.weight buffer type overriden to CPU
tensor blk.21.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.21.exp_probs_b.bias buffer type overriden to CPU
tensor blk.21.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.21.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.21.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.21.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.21.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.21.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.22.attn_norm.weight buffer type overriden to CPU                                                                        12:37:35 [637/1920]
tensor blk.22.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.22.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.22.attn_q_a.weight buffer type overriden to CPU
tensor blk.22.attn_q_b.weight buffer type overriden to CPU
tensor blk.22.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.22.attn_kv_b.weight buffer type overriden to CPU
tensor blk.22.attn_output.weight buffer type overriden to CPU
tensor blk.22.ffn_norm.weight buffer type overriden to CPU
tensor blk.22.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.22.exp_probs_b.bias buffer type overriden to CPU
tensor blk.22.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.22.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.22.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.22.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.22.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.22.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.23.attn_norm.weight buffer type overriden to CPU
tensor blk.23.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.23.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.23.attn_q_a.weight buffer type overriden to CPU
tensor blk.23.attn_q_b.weight buffer type overriden to CPU
tensor blk.23.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.23.attn_kv_b.weight buffer type overriden to CPU
tensor blk.23.attn_output.weight buffer type overriden to CPU
tensor blk.23.ffn_norm.weight buffer type overriden to CPU
tensor blk.23.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.23.exp_probs_b.bias buffer type overriden to CPU
tensor blk.23.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.23.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.23.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.23.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.23.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.23.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.24.attn_norm.weight buffer type overriden to CPU
tensor blk.24.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.24.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.24.attn_q_a.weight buffer type overriden to CPU
tensor blk.24.attn_q_b.weight buffer type overriden to CPU
tensor blk.24.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.24.attn_kv_b.weight buffer type overriden to CPU
tensor blk.24.attn_output.weight buffer type overriden to CPU
tensor blk.24.ffn_norm.weight buffer type overriden to CPU
tensor blk.24.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.24.exp_probs_b.bias buffer type overriden to CPU
tensor blk.24.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.24.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.24.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.24.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.24.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.24.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.25.attn_norm.weight buffer type overriden to CPU
tensor blk.25.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.25.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.25.attn_q_a.weight buffer type overriden to CPU
tensor blk.25.attn_q_b.weight buffer type overriden to CPU
tensor blk.25.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.25.attn_kv_b.weight buffer type overriden to CPU
tensor blk.25.attn_output.weight buffer type overriden to CPU
tensor blk.25.ffn_norm.weight buffer type overriden to CPU
tensor blk.25.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.25.exp_probs_b.bias buffer type overriden to CPU
tensor blk.25.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.25.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.25.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.25.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.25.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.25.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.26.attn_norm.weight buffer type overriden to CPU
tensor blk.26.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.26.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.26.attn_q_a.weight buffer type overriden to CPU
tensor blk.26.attn_q_b.weight buffer type overriden to CPU
tensor blk.26.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.26.attn_kv_b.weight buffer type overriden to CPU
tensor blk.26.attn_output.weight buffer type overriden to CPU
tensor blk.26.ffn_norm.weight buffer type overriden to CPU
tensor blk.26.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.26.exp_probs_b.bias buffer type overriden to CPU
tensor blk.26.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.26.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.26.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.26.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.26.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.26.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.27.attn_norm.weight buffer type overriden to CPU
tensor blk.27.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.27.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.27.attn_q_a.weight buffer type overriden to CPU
tensor blk.27.attn_q_b.weight buffer type overriden to CPU
tensor blk.27.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.27.attn_kv_b.weight buffer type overriden to CPU
tensor blk.27.attn_output.weight buffer type overriden to CPU
tensor blk.27.ffn_norm.weight buffer type overriden to CPU
tensor blk.27.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.27.exp_probs_b.bias buffer type overriden to CPU
tensor blk.27.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.27.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.27.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.27.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.27.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.27.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.28.attn_norm.weight buffer type overriden to CPU
tensor blk.28.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.28.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.28.attn_q_a.weight buffer type overriden to CPU
tensor blk.28.attn_q_b.weight buffer type overriden to CPU
tensor blk.28.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.28.attn_kv_b.weight buffer type overriden to CPU
tensor blk.28.attn_output.weight buffer type overriden to CPU
tensor blk.28.ffn_norm.weight buffer type overriden to CPU
tensor blk.28.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.28.exp_probs_b.bias buffer type overriden to CPU
tensor blk.28.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.28.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.28.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.28.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.28.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.28.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.29.attn_norm.weight buffer type overriden to CPU
tensor blk.29.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.29.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.29.attn_q_a.weight buffer type overriden to CPU
tensor blk.29.attn_q_b.weight buffer type overriden to CPU
tensor blk.29.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.29.attn_kv_b.weight buffer type overriden to CPU
tensor blk.29.attn_output.weight buffer type overriden to CPU
tensor blk.29.ffn_norm.weight buffer type overriden to CPU
tensor blk.29.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.29.exp_probs_b.bias buffer type overriden to CPU
tensor blk.29.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.29.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.29.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.29.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.29.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.29.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.30.attn_norm.weight buffer type overriden to CPU
tensor blk.30.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.30.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.30.attn_q_a.weight buffer type overriden to CPU
tensor blk.30.attn_q_b.weight buffer type overriden to CPU
tensor blk.30.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.30.attn_kv_b.weight buffer type overriden to CPU
tensor blk.30.attn_output.weight buffer type overriden to CPU
tensor blk.30.ffn_norm.weight buffer type overriden to CPU
tensor blk.30.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.30.exp_probs_b.bias buffer type overriden to CPU
tensor blk.30.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.30.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.30.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.30.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.30.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.30.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.31.attn_norm.weight buffer type overriden to CPU
tensor blk.31.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.31.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.31.attn_q_a.weight buffer type overriden to CPU
tensor blk.31.attn_q_b.weight buffer type overriden to CPU
tensor blk.31.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.31.attn_kv_b.weight buffer type overriden to CPU
tensor blk.31.attn_output.weight buffer type overriden to CPU
tensor blk.31.ffn_norm.weight buffer type overriden to CPU
tensor blk.31.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.31.exp_probs_b.bias buffer type overriden to CPU
tensor blk.31.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.31.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.31.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.31.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.31.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.31.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.32.attn_norm.weight buffer type overriden to CPU
tensor blk.32.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.32.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.32.attn_q_a.weight buffer type overriden to CPU
tensor blk.32.attn_q_b.weight buffer type overriden to CPU
tensor blk.32.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.32.attn_kv_b.weight buffer type overriden to CPU
tensor blk.32.attn_output.weight buffer type overriden to CPU
tensor blk.32.ffn_norm.weight buffer type overriden to CPU
tensor blk.32.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.32.exp_probs_b.bias buffer type overriden to CPU
tensor blk.32.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.32.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.32.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.32.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.32.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.32.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.33.attn_norm.weight buffer type overriden to CPU
tensor blk.33.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.33.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.33.attn_q_a.weight buffer type overriden to CPU
tensor blk.33.attn_q_b.weight buffer type overriden to CPU
tensor blk.33.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.33.attn_kv_b.weight buffer type overriden to CPU
tensor blk.33.attn_output.weight buffer type overriden to CPU
tensor blk.33.ffn_norm.weight buffer type overriden to CPU
tensor blk.33.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.33.exp_probs_b.bias buffer type overriden to CPU
tensor blk.33.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.33.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.33.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.33.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.33.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.33.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.34.attn_norm.weight buffer type overriden to CPU
tensor blk.34.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.34.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.34.attn_q_a.weight buffer type overriden to CPU
tensor blk.34.attn_q_b.weight buffer type overriden to CPU
tensor blk.34.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.34.attn_kv_b.weight buffer type overriden to CPU
tensor blk.34.attn_output.weight buffer type overriden to CPU
tensor blk.34.ffn_norm.weight buffer type overriden to CPU
tensor blk.34.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.34.exp_probs_b.bias buffer type overriden to CPU
tensor blk.34.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.34.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.34.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.34.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.34.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.34.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.35.attn_norm.weight buffer type overriden to CPU
tensor blk.35.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.35.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.35.attn_q_a.weight buffer type overriden to CPU
tensor blk.35.attn_q_b.weight buffer type overriden to CPU
tensor blk.35.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.35.attn_kv_b.weight buffer type overriden to CPU
tensor blk.35.attn_output.weight buffer type overriden to CPU
tensor blk.35.ffn_norm.weight buffer type overriden to CPU
tensor blk.35.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.35.exp_probs_b.bias buffer type overriden to CPU
tensor blk.35.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.35.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.35.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.35.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.35.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.35.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.36.attn_norm.weight buffer type overriden to CPU
tensor blk.36.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.36.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.36.attn_q_a.weight buffer type overriden to CPU
tensor blk.36.attn_q_b.weight buffer type overriden to CPU
tensor blk.36.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.36.attn_kv_b.weight buffer type overriden to CPU
tensor blk.36.attn_output.weight buffer type overriden to CPU
tensor blk.36.ffn_norm.weight buffer type overriden to CPU
tensor blk.36.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.36.exp_probs_b.bias buffer type overriden to CPU
tensor blk.36.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.36.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.36.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.36.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.36.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.36.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.37.attn_norm.weight buffer type overriden to CPU
tensor blk.37.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.37.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.37.attn_q_a.weight buffer type overriden to CPU
tensor blk.37.attn_q_b.weight buffer type overriden to CPU
tensor blk.37.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.37.attn_kv_b.weight buffer type overriden to CPU
tensor blk.37.attn_output.weight buffer type overriden to CPU
tensor blk.37.ffn_norm.weight buffer type overriden to CPU
tensor blk.37.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.37.exp_probs_b.bias buffer type overriden to CPU
tensor blk.37.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.37.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.37.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.37.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.37.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.37.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.38.attn_norm.weight buffer type overriden to CPU
tensor blk.38.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.38.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.38.attn_q_a.weight buffer type overriden to CPU
tensor blk.38.attn_q_b.weight buffer type overriden to CPU
tensor blk.38.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.38.attn_kv_b.weight buffer type overriden to CPU
tensor blk.38.attn_output.weight buffer type overriden to CPU
tensor blk.38.ffn_norm.weight buffer type overriden to CPU
tensor blk.38.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.38.exp_probs_b.bias buffer type overriden to CPU
tensor blk.38.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.38.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.38.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.38.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.38.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.38.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.39.attn_norm.weight buffer type overriden to CPU
tensor blk.39.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.39.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.39.attn_q_a.weight buffer type overriden to CPU
tensor blk.39.attn_q_b.weight buffer type overriden to CPU
tensor blk.39.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.39.attn_kv_b.weight buffer type overriden to CPU
tensor blk.39.attn_output.weight buffer type overriden to CPU
tensor blk.39.ffn_norm.weight buffer type overriden to CPU
tensor blk.39.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.39.exp_probs_b.bias buffer type overriden to CPU
tensor blk.39.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.39.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.39.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.39.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.39.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.39.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.40.attn_norm.weight buffer type overriden to CPU
tensor blk.40.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.40.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.40.attn_q_a.weight buffer type overriden to CPU
tensor blk.40.attn_q_b.weight buffer type overriden to CPU
tensor blk.40.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.40.attn_kv_b.weight buffer type overriden to CPU
tensor blk.40.attn_output.weight buffer type overriden to CPU
tensor blk.40.ffn_norm.weight buffer type overriden to CPU
tensor blk.40.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.40.exp_probs_b.bias buffer type overriden to CPU
tensor blk.40.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.40.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.40.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.40.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.40.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.40.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.41.attn_norm.weight buffer type overriden to CPU
tensor blk.41.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.41.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.41.attn_q_a.weight buffer type overriden to CPU
tensor blk.41.attn_q_b.weight buffer type overriden to CPU
tensor blk.41.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.41.attn_kv_b.weight buffer type overriden to CPU
tensor blk.41.attn_output.weight buffer type overriden to CPU
tensor blk.41.ffn_norm.weight buffer type overriden to CPU
tensor blk.41.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.41.exp_probs_b.bias buffer type overriden to CPU
tensor blk.41.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.41.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.41.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.41.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.41.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.41.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.42.attn_norm.weight buffer type overriden to CPU                                                                        12:37:35 [297/1920]
tensor blk.42.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.42.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.42.attn_q_a.weight buffer type overriden to CPU
tensor blk.42.attn_q_b.weight buffer type overriden to CPU
tensor blk.42.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.42.attn_kv_b.weight buffer type overriden to CPU
tensor blk.42.attn_output.weight buffer type overriden to CPU
tensor blk.42.ffn_norm.weight buffer type overriden to CPU
tensor blk.42.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.42.exp_probs_b.bias buffer type overriden to CPU
tensor blk.42.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.42.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.42.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.42.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.42.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.42.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.43.attn_norm.weight buffer type overriden to CPU
tensor blk.43.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.43.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.43.attn_q_a.weight buffer type overriden to CPU
tensor blk.43.attn_q_b.weight buffer type overriden to CPU
tensor blk.43.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.43.attn_kv_b.weight buffer type overriden to CPU
tensor blk.43.attn_output.weight buffer type overriden to CPU
tensor blk.43.ffn_norm.weight buffer type overriden to CPU
tensor blk.43.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.43.exp_probs_b.bias buffer type overriden to CPU
tensor blk.43.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.43.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.43.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.43.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.43.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.43.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.44.attn_norm.weight buffer type overriden to CPU
tensor blk.44.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.44.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.44.attn_q_a.weight buffer type overriden to CPU
tensor blk.44.attn_q_b.weight buffer type overriden to CPU
tensor blk.44.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.44.attn_kv_b.weight buffer type overriden to CPU
tensor blk.44.attn_output.weight buffer type overriden to CPU
tensor blk.44.ffn_norm.weight buffer type overriden to CPU
tensor blk.44.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.44.exp_probs_b.bias buffer type overriden to CPU
tensor blk.44.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.44.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.44.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.44.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.44.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.44.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.45.attn_norm.weight buffer type overriden to CPU
tensor blk.45.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.45.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.45.attn_q_a.weight buffer type overriden to CPU
tensor blk.45.attn_q_b.weight buffer type overriden to CPU
tensor blk.45.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.45.attn_kv_b.weight buffer type overriden to CPU
tensor blk.45.attn_output.weight buffer type overriden to CPU
tensor blk.45.ffn_norm.weight buffer type overriden to CPU
tensor blk.45.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.45.exp_probs_b.bias buffer type overriden to CPU
tensor blk.45.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.45.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.45.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.45.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.45.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.45.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.46.attn_norm.weight buffer type overriden to CPU
tensor blk.46.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.46.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.46.attn_q_a.weight buffer type overriden to CPU
tensor blk.46.attn_q_b.weight buffer type overriden to CPU
tensor blk.46.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.46.attn_kv_b.weight buffer type overriden to CPU
tensor blk.46.attn_output.weight buffer type overriden to CPU
tensor blk.46.ffn_norm.weight buffer type overriden to CPU
tensor blk.46.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.46.exp_probs_b.bias buffer type overriden to CPU
tensor blk.46.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.46.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.46.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.46.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.46.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.46.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.47.attn_norm.weight buffer type overriden to CPU
tensor blk.47.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.47.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.47.attn_q_a.weight buffer type overriden to CPU
tensor blk.47.attn_q_b.weight buffer type overriden to CPU
tensor blk.47.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.47.attn_kv_b.weight buffer type overriden to CPU
tensor blk.47.attn_output.weight buffer type overriden to CPU
tensor blk.47.ffn_norm.weight buffer type overriden to CPU
tensor blk.47.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.47.exp_probs_b.bias buffer type overriden to CPU
tensor blk.47.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.47.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.47.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.47.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.47.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.47.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.48.attn_norm.weight buffer type overriden to CPU
tensor blk.48.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.48.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.48.attn_q_a.weight buffer type overriden to CPU
tensor blk.48.attn_q_b.weight buffer type overriden to CPU
tensor blk.48.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.48.attn_kv_b.weight buffer type overriden to CPU
tensor blk.48.attn_output.weight buffer type overriden to CPU
tensor blk.48.ffn_norm.weight buffer type overriden to CPU
tensor blk.48.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.48.exp_probs_b.bias buffer type overriden to CPU
tensor blk.48.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.48.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.48.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.48.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.48.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.48.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.49.attn_norm.weight buffer type overriden to CPU
tensor blk.49.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.49.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.49.attn_q_a.weight buffer type overriden to CPU
tensor blk.49.attn_q_b.weight buffer type overriden to CPU
tensor blk.49.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.49.attn_kv_b.weight buffer type overriden to CPU
tensor blk.49.attn_output.weight buffer type overriden to CPU
tensor blk.49.ffn_norm.weight buffer type overriden to CPU
tensor blk.49.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.49.exp_probs_b.bias buffer type overriden to CPU
tensor blk.49.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.49.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.49.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.49.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.49.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.49.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.50.attn_norm.weight buffer type overriden to CPU
tensor blk.50.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.50.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.50.attn_q_a.weight buffer type overriden to CPU
tensor blk.50.attn_q_b.weight buffer type overriden to CPU
tensor blk.50.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.50.attn_kv_b.weight buffer type overriden to CPU
tensor blk.50.attn_output.weight buffer type overriden to CPU
tensor blk.50.ffn_norm.weight buffer type overriden to CPU
tensor blk.50.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.50.exp_probs_b.bias buffer type overriden to CPU
tensor blk.50.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.50.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.50.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.50.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.50.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.50.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.51.attn_norm.weight buffer type overriden to CPU
tensor blk.51.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.51.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.51.attn_q_a.weight buffer type overriden to CPU
tensor blk.51.attn_q_b.weight buffer type overriden to CPU
tensor blk.51.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.51.attn_kv_b.weight buffer type overriden to CPU
tensor blk.51.attn_output.weight buffer type overriden to CPU
tensor blk.51.ffn_norm.weight buffer type overriden to CPU
tensor blk.51.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.51.exp_probs_b.bias buffer type overriden to CPU
tensor blk.51.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.51.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.51.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.51.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.51.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.51.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.52.attn_norm.weight buffer type overriden to CPU
tensor blk.52.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.52.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.52.attn_q_a.weight buffer type overriden to CPU
tensor blk.52.attn_q_b.weight buffer type overriden to CPU
tensor blk.52.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.52.attn_kv_b.weight buffer type overriden to CPU
tensor blk.52.attn_output.weight buffer type overriden to CPU
tensor blk.52.ffn_norm.weight buffer type overriden to CPU
tensor blk.52.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.52.exp_probs_b.bias buffer type overriden to CPU
tensor blk.52.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.52.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.52.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.52.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.52.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.52.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.53.attn_norm.weight buffer type overriden to CPU
tensor blk.53.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.53.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.53.attn_q_a.weight buffer type overriden to CPU
tensor blk.53.attn_q_b.weight buffer type overriden to CPU
tensor blk.53.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.53.attn_kv_b.weight buffer type overriden to CPU
tensor blk.53.attn_output.weight buffer type overriden to CPU
tensor blk.53.ffn_norm.weight buffer type overriden to CPU
tensor blk.53.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.53.exp_probs_b.bias buffer type overriden to CPU
tensor blk.53.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.53.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.53.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.53.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.53.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.53.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.54.attn_norm.weight buffer type overriden to CPU
tensor blk.54.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.54.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.54.attn_q_a.weight buffer type overriden to CPU
tensor blk.54.attn_q_b.weight buffer type overriden to CPU
tensor blk.54.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.54.attn_kv_b.weight buffer type overriden to CPU
tensor blk.54.attn_output.weight buffer type overriden to CPU
tensor blk.54.ffn_norm.weight buffer type overriden to CPU
tensor blk.54.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.54.exp_probs_b.bias buffer type overriden to CPU
tensor blk.54.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.54.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.54.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.54.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.54.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.54.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.55.attn_norm.weight buffer type overriden to CPU
tensor blk.55.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.55.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.55.attn_q_a.weight buffer type overriden to CPU
tensor blk.55.attn_q_b.weight buffer type overriden to CPU
tensor blk.55.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.55.attn_kv_b.weight buffer type overriden to CPU
tensor blk.55.attn_output.weight buffer type overriden to CPU
tensor blk.55.ffn_norm.weight buffer type overriden to CPU
tensor blk.55.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.55.exp_probs_b.bias buffer type overriden to CPU
tensor blk.55.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.55.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.55.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.55.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.55.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.55.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.56.attn_norm.weight buffer type overriden to CPU
tensor blk.56.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.56.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.56.attn_q_a.weight buffer type overriden to CPU
tensor blk.56.attn_q_b.weight buffer type overriden to CPU
tensor blk.56.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.56.attn_kv_b.weight buffer type overriden to CPU
tensor blk.56.attn_output.weight buffer type overriden to CPU
tensor blk.56.ffn_norm.weight buffer type overriden to CPU
tensor blk.56.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.56.exp_probs_b.bias buffer type overriden to CPU
tensor blk.56.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.56.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.56.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.56.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.56.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.56.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.57.attn_norm.weight buffer type overriden to CPU
tensor blk.57.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.57.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.57.attn_q_a.weight buffer type overriden to CPU
tensor blk.57.attn_q_b.weight buffer type overriden to CPU
tensor blk.57.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.57.attn_kv_b.weight buffer type overriden to CPU
tensor blk.57.attn_output.weight buffer type overriden to CPU
tensor blk.57.ffn_norm.weight buffer type overriden to CPU
tensor blk.57.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.57.exp_probs_b.bias buffer type overriden to CPU
tensor blk.57.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.57.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.57.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.57.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.57.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.57.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.58.attn_norm.weight buffer type overriden to CPU
tensor blk.58.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.58.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.58.attn_q_a.weight buffer type overriden to CPU
tensor blk.58.attn_q_b.weight buffer type overriden to CPU
tensor blk.58.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.58.attn_kv_b.weight buffer type overriden to CPU
tensor blk.58.attn_output.weight buffer type overriden to CPU
tensor blk.58.ffn_norm.weight buffer type overriden to CPU
tensor blk.58.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.58.exp_probs_b.bias buffer type overriden to CPU
tensor blk.58.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.58.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.58.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.58.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.58.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.58.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.59.attn_norm.weight buffer type overriden to CPU
tensor blk.59.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.59.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.59.attn_q_a.weight buffer type overriden to CPU
tensor blk.59.attn_q_b.weight buffer type overriden to CPU
tensor blk.59.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.59.attn_kv_b.weight buffer type overriden to CPU
tensor blk.59.attn_output.weight buffer type overriden to CPU
tensor blk.59.ffn_norm.weight buffer type overriden to CPU
tensor blk.59.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.59.exp_probs_b.bias buffer type overriden to CPU
tensor blk.59.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.59.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.59.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.59.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.59.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.59.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.60.attn_norm.weight buffer type overriden to CPU
tensor blk.60.attn_q_a_norm.weight buffer type overriden to CPU
tensor blk.60.attn_kv_a_norm.weight buffer type overriden to CPU
tensor blk.60.attn_q_a.weight buffer type overriden to CPU
tensor blk.60.attn_q_b.weight buffer type overriden to CPU
tensor blk.60.attn_kv_a_mqa.weight buffer type overriden to CPU
tensor blk.60.attn_kv_b.weight buffer type overriden to CPU
tensor blk.60.attn_output.weight buffer type overriden to CPU
tensor blk.60.ffn_norm.weight buffer type overriden to CPU
tensor blk.60.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.60.exp_probs_b.bias buffer type overriden to CPU
tensor blk.60.ffn_gate_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.60.ffn_down_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.60.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50053]
tensor blk.60.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.60.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.60.ffn_up_shexp.weight buffer type overriden to CPU
load_tensors: tensor 'token_embd.weight' (q4_K) (and 850 others) cannot be used with preferred buffer type AMX, using CPU instead
load_tensors: offloading 61 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 62/62 layers to GPU
load_tensors: RPC[127.0.0.1:50053] model buffer size = 205716.00 MiB
load_tensors:   CPU_Mapped model buffer size = 47485.39 MiB
load_tensors:   CPU_Mapped model buffer size = 46505.52 MiB
load_tensors:   CPU_Mapped model buffer size = 46505.52 MiB
load_tensors:   CPU_Mapped model buffer size = 46505.52 MiB
load_tensors:   CPU_Mapped model buffer size = 24393.12 MiB
....................................................................................................
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 2048
llama_init_from_model: n_ctx_per_seq = 2048
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 0.025
llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0
llama_kv_cache_init: layer 0: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 1: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 2: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 3: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 4: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 5: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 6: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 7: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 8: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 9: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 10: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 11: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 12: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 13: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 14: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 15: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 16: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 17: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 18: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 19: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 20: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 21: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 22: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 23: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 24: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 25: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 26: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 27: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 28: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 29: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 30: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 31: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 32: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 33: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 34: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 35: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 36: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 37: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 38: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 39: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 40: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 41: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 42: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 43: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 44: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 45: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 46: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 47: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 48: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 49: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 50: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 51: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 52: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 53: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 54: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 55: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 56: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 57: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 58: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 59: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 60: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: RPC[127.0.0.1:50053] KV buffer size =  9760.00 MiB
llama_init_from_model: KV self size  = 9760.00 MiB, K (f16): 5856.00 MiB, V (f16): 3904.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.49 MiB
llama_init_from_model: RPC[127.0.0.1:50053] compute buffer size =   707.12 MiB
llama_init_from_model:        CPU compute buffer size =   280.50 MiB
llama_init_from_model: graph nodes  = 5025
llama_init_from_model: graph splits = 727
common_init_from_params: KV cache shifting is not supported for this model, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048

Again but with -nkvo, --no-kv-offload disable KV offload

Bummer, I had hoped keeping the KV cache with the attention stuff and shared experts in a single numa node, and only exps in the other numa node would reduce overhead of send(), but it is still really slow either way.

llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 0, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0
llama_kv_cache_init:        CPU KV buffer size = 19520.00 MiB
llama_init_from_model: KV self size  = 19520.00 MiB, K (f16): 11712.00 MiB, V (f16): 7808.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.49 MiB
llama_init_from_model: RPC[127.0.0.1:50053] compute buffer size =   186.00 MiB
llama_init_from_model:        CPU compute buffer size =  1161.01 MiB
llama_init_from_model: graph nodes  = 5025
llama_init_from_model: graph splits = 605

@zts9989
Copy link

zts9989 commented Mar 4, 2025

Thank you for this feature, it allows me to be more efficient in hybrid CPU/GPU inference for the DeepSeek R1 671B model, achieving approximately 13.x t/s through it.

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 61 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 62/62 layers to GPU
load_tensors: CUDA0 model buffer size = 9460.52 MiB
load_tensors: CPU_Mapped model buffer size = 45358.94 MiB
load_tensors: CPU_Mapped model buffer size = 47124.64 MiB
load_tensors: CPU_Mapped model buffer size = 47124.64 MiB
load_tensors: CPU_Mapped model buffer size = 45927.96 MiB
load_tensors: CPU_Mapped model buffer size = 47124.64 MiB
load_tensors: CPU_Mapped model buffer size = 46024.76 MiB
load_tensors: CPU_Mapped model buffer size = 47124.64 MiB
load_tensors: CPU_Mapped model buffer size = 44656.01 MiB
load_tensors: CPU_Mapped model buffer size = 14105.06 MiB
....................................................................................................
llama_init_from_model: flash_attn requires n_embd_head_k == n_embd_head_v - forcing off
llama_init_from_model: n_seq_max = 1
llama_init_from_model: n_ctx = 163840
llama_init_from_model: n_ctx_per_seq = 163840
llama_init_from_model: n_batch = 2048
llama_init_from_model: n_ubatch = 512
llama_init_from_model: flash_attn = 0
llama_init_from_model: freq_base = 10000.0
llama_init_from_model: freq_scale = 0.025
llama_kv_cache_init: kv_size = 163840, offload = 0, type_k = 'q8_0', type_v = 'f16', n_layer = 61, can_shift = 0
llama_kv_cache_init: CPU KV buffer size = 561200.00 MiB
llama_init_from_model: KV self size = 561200.00 MiB, K (q8_0): 248880.00 MiB, V (f16): 312320.00 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.49 MiB
llama_init_from_model: CUDA0 compute buffer size = 5009.50 MiB
llama_init_from_model: CUDA_Host compute buffer size = 41408.01 MiB
llama_init_from_model: graph nodes = 4793
llama_init_from_model: graph splits = 298 (with bs=512), 240 (with bs=1)
common_init_from_params: KV cache shifting is not supported for this model, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 163840
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 48
main: chat template example:
You are a helpful assistant

<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>

system_info: n_threads = 48 (n_threads_batch = 48) / 192 | CUDA : ARCHS = 860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 |

main: interactive mode on.
sampler seed: 3047
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 163840
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.600
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 163840, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==

  • Press Ctrl+C to interject at any time.
  • To return control to the AI, end your input with ''.
  • To return control without starting a new line, end your input with '/'.

You are a helpful assistant.

Hello

Okay, the user said "Hello". That's a friendly greeting. I should respond in a warm and welcoming manner. Let me make sure to keep it open-ended so they feel comfortable to ask anything. Maybe add a smiley to keep it friendly. Alright, something like, "Hello! How can I assist you today? 😊" That should work.

Hello! How can I assist you today? 😊

llama_perf_sampler_print: sampling time = 5.85 ms / 96 runs ( 0.06 ms per token, 16413.06 tokens per second)
llama_perf_context_print: load time = 105911.73 ms
llama_perf_context_print: prompt eval time = 9642.73 ms / 12 tokens ( 803.56 ms per token, 1.24 tokens per second)
llama_perf_context_print: eval time = 6497.03 ms / 90 runs ( 72.19 ms per token, 13.85 tokens per second)
llama_perf_context_print: total time = 17529.38 ms / 102 tokens
Interrupted by user

@Readon
Copy link

Readon commented Mar 5, 2025

Thank you for this feature, it allows me to be more efficient in hybrid CPU/GPU inference for the DeepSeek R1 671B model, achieving approximately 13.x t/s through it.

Could you provide more information on you machine and commands which get 13.x t/s?

@zts9989
Copy link

zts9989 commented Mar 5, 2025

Certainly.

My setup is: Amd epyc 9654 * 2, 64G DDR5 4800Mhz * 24, 4070 Ti Super 16G Gpu * 1, Debian 12.

The model used is: DeepSeek R1 671B Q4_K_M.

Cmake command: cmake -B build -DGGML_CUDA=ON -DGGML_BUILD_NUMBER=3 -DGGML_OPENMP=OFF -DGGML_SCHED_MAX_COPIES=1

Run command: CUDA_VISIBLE_DEVICES=0 ./build/bin/llama-cli -m /data/deepseekR1/DeepSeek-R1-Q4_K_M-000000.gguf -cnv -p "You are a helpful assistant." -fa -c 65536 --temp 0.6 --top-p 0.95 -s 3047 -if -mli -t 48 -ngl 160 -nkvo -c 163840 -ctk q8_0 -ot exps=CPU

@zts9989
Copy link

zts9989 commented Mar 5, 2025

To ubergarm:
Ktransformers is a good experimental field that can help validate some conjectures about optimizing performance. However, it is not a well-engineered project. For example, in the application of large models, a long context is a must-have option. I can also place a small amount of context, such as 256 tokens, on the GPU, which is very short. In this case, the existing configuration can achieve a speed of 15.x t/s, which is very fast but impractical. This is because it cannot handle large amounts of text processing work.

Therefore, I used the -nkvo option, which allows for large-scale text processing.

The performance optimization approach of ktransformers has two key points:
1: CPU inference, where in a multi-Numa environment, a separate copy of the model data is used for each NUMA node to avoid bottlenecks in data communication between NUMA nodes.
2: Placing the key-value (kv) and some layers on the GPU to leverage the high memory bandwidth advantage of the GPU.

However, from an engineering perspective, using multiple NUMA nodes with separate data copies can release more memory bandwidth, but synchronizing across multiple NUMA nodes is a major issue. By analyzing the inference program with the perf tool, it can be seen that approximately 50% of CPU usage is spent on thread synchronization, which is a significant area for optimization. While KT achieves some acceleration in inference across different NUMA nodes, it still faces the problem of high synchronization overhead when merging data back to the main thread.

The second optimization point is exactly what this PR implements: placing sparse expert weights in CPU memory instead of GPU memory. This saves GPU memory. CUDA GRAPH can effectively reduce the cost of interaction between CUDA and CPU computations.

Placing kv in GPU memory, as mentioned earlier, only has an advantage in benchmark scores but is practically useless in real-world applications. For example, with a context length of 163,840, the quantized kv cache requires 540 GB of storage space. I do not have that much GPU memory, and the cost is too high, making it inefficient.

Of course, MLA can alleviate the performance degradation when dealing with long contexts, and combining it with -ot provides the best single-machine deployment experience for R1 inference.

VLLM and SGLang have advantages in enterprise-scale deployments, but for single-machine deployments, I believe we may still need to rely on llama.cpp (my personal opinion).

@jukofyork
Copy link
Contributor

To ubergarm: Ktransformers is a good experimental field that can help validate some conjectures about optimizing performance.

Yeah, I think if you care about the quality of the generation then llama.cpp is definitely the best choice too. The MLA attention tensors don't seem to quantize well at all and they are using 4bit for these, plus last time I checked they were only using 6 experts instead of 8.

I've got a custom llama.cpp quant with BF16 for all the _a and _b low-rank MLA attention tensors, Q6_K / Q5_K for all non-shared expert down_proj and up_proj/gate_proj respectively, and Q8_0 for everything else, and the story generation ability is on par with the official deepseek served models (and a lot better than many of the non-official versions being served on openrouter!).

Just changing the _b tensors for Q8_0 (and keeping everything else the same as above) starts to have really obvious negative effects on story generation, and using Q4_K or Q4_0 is severely degraded in comparison. I haven't rested this yet with the modified version of the MLA PR where I converted all the 3D batch matrix multiples to 2D though (this seemed to be a cause of some numerical problems too and might be the same reason for this).

@ubergarm
Copy link

ubergarm commented Mar 5, 2025

(sorry for spamming this PR thread)

@zts9989

Thanks for the discussion and I agree with many of your points. One more detail you left out:

My setup is: Amd epyc 9654 * 2

My guess is you have BIOS set to NPS0 ? This is the only way to get over 10 tok/sec on full CPU of which I am aware to avoid the multi-numa issue. Please confirm.

@jukofyork

I've got a custom llama.cpp quant ...

Could you share the convert_hf_to_gguf.py CLI command or what you used to create your quant, if possible?

@zts9989
Copy link

zts9989 commented Mar 6, 2025

(sorry for spamming this PR thread too)

@ubergarm NPS0 allows for the maximum memory bandwidth with minimal overhead, but it introduces additional latency by involving the remote NUMA node's memory controller. In this mode, the CPU's L3 cache becomes available only after data is fetched from both local (assumed latency of 10) and remote (assumed latency of 50) memory controllers.

NPS1 provides the highest memory bandwidth while maintaining low latency. However, it requires additional memory access handling from the program. This mode offers the maximum bandwidth, which aligns with the optimization approach of KT. By storing the full model data across two NUMA nodes, local threads can access local model data, thereby unlocking the system's maximum memory bandwidth.

I adopted a similar approach by modifying the struct ggml_tensor to include data storage for each NUMA node (doubling memory consumption). Additionally, I bound NUMA nodes (CPU nodes) within the thread pool. During the ggml_compute_forward_mul_mat computation, threads can access local data backups based on their NUMA node ID, achieving the same NUMA optimization result as KT. (This is why I require the GGML_OPENMP=OFF option: I need to control thread CPU binding manually, distributing threads evenly across AMD CPU CCDs.)

Through this method, in an NPS1 system configuration, faster inference performance can be achieved.

@jukofyork Thank you for sharing your quantization experience. I tested scenarios with 4, 6, 8, and 12 experts and ultimately settled on the 6-expert configuration, which strikes the best balance between inference quality and speed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
demo Demonstrate some concept or idea, not intended to be merged ggml changes relating to the ggml tensor library for machine learning need feedback Testing and feedback with results are needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.