Error malloc(): unaligned tcache chunk detected Always Occur after tensorrt server handling a certain amount requests #587

wangpeilin · 2024-08-28T04:15:05Z

System Info

Ubuntu 20.04
NVIDIA A100

Who can help?

@kaiyux

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

docker run -itd --gpus=all --shm-size=1g -p8000:8000 -p8001:8001 -p8002:8002 -v /share/datasets:/share/datasets nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
code version is 0.11.0
git clone https://github.com/NVIDIA/TensorRT-LLM.git
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
Perform some serving inference calls by aiohttp

Expected behavior

All request are successfully processed and no error

actual behavior

When the server performs multiple inferences, such as 5000 times, it raise error
malloc(): unaligned tcache chunk detected
Signal (6) received.

Both continuous and intermittent (such as one day) inference will cause this error.

When I calls 8000 inferences in one test, it raise error
pinned_memory_manager.cc:170] "failed to allocate pinned system memory, falling back to non-pinned system memory
Finally I set parameter cuda-memory-pool-byte-size to 512M and pinned-memory-pool-byte-size to 512M and solve this problem, but these two parameters are not exposed in the script scripts/launch_triton_server.py, so I want to ask why this problem occurs and if there is any other way to solve this problem.

When I call the server with high concurrency it raise error
malloc_consolidate(): unaligned fastbin chunk detected
Signal (6) received.

Hope you can help me solve these problems, thanks very much!

additional notes

I think this seems to be because the server does not completely clean up the memory after each inference is completed.

The text was updated successfully, but these errors were encountered:

KuntaiDu · 2024-09-03T04:23:51Z

Same bug observed. Exact same behavior on 8xH100 with llama models.

KuntaiDu · 2024-09-03T04:31:53Z

Let me provide more detail on my side. I am working on benchmarking vLLM together with TensorRT-LLM and encountered the same issue when running benchmark on 8xH100 (the same benchmark runs normally on 8xA100 on my side).
Docker image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
Hardware: 8xH100
Reproducing command (directly runnable inside the docker container

export HF_TOKEN=<your HF token>
apt update
apt install -y wget unzip 
# download benchmarking code
wget -O benchmarking_code.zip https://buildkite.com/organizations/vllm/pipelines/performance-benchmark/builds/8510/jobs/0191b4d9-7ae6-406f-ba11-e7d31b08cd44/artifacts/0191b5f6-2ce6-40d4-8344-beb6fc94f405
unzip benchmarking_code.zip
# remove previous results
rm -r ./benchmarks/results
VLLM_SOURCE_CODE_LOC=$(pwd) bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh

This code is from vLLM performance benchmark
It typically crashes when running the test llama8B_tp1_sonnet_512_256_qps_2.

jfpichlme · 2024-12-09T13:16:05Z

We are facing the same issue:
In our case, it does happen under moderate levels of user concurrency (somewhere between 20 - 40 parallel users). However, it also seems to happen after a few thousand total requests.

Based on this, we also believe it is likely caused by a buggy cleanup process.

Hardware-setup: DGX H100
Model: LLama3.1 8B
TP level: 2 GPUs.
KV Cache reuse: Enable

Checkpoint Conversion:

python convert_checkpoint.py --model_dir Meta-Llama-3.1-8B-Instruct \ --output_dir 01_Checkpoint_Dir \ --dtype float16 \ --tp_size 2 \ --workers 2

Build Step:

trtllm-build --checkpoint_dir 01_Checkpoint_Dir \ --output_dir 02_Engine_Dir \ --max_batch_size=2000 \ --kv_cache_type paged \ --gemm_plugin=auto \ --max_input_len 6000 \ --use_fused_mlp enable \ --use_paged_context_fmha enable

xxchauncey · 2024-12-13T09:35:13Z

ran into same problem, any updates?

WilliamOnVoyage · 2025-02-11T20:22:15Z

Hit same issue with ONNX backend

wangpeilin added the bug Something isn't working label Aug 28, 2024

KuntaiDu mentioned this issue Sep 5, 2024

[Performance]: reproducing vLLM performance benchmark vllm-project/vllm#8176

Closed

1 task

github-staff deleted a comment from ViniciusSCG Oct 1, 2024

github-staff deleted a comment from Superstar-IT Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error malloc(): unaligned tcache chunk detected Always Occur after tensorrt server handling a certain amount requests #587

Error malloc(): unaligned tcache chunk detected Always Occur after tensorrt server handling a certain amount requests #587

wangpeilin commented Aug 28, 2024

KuntaiDu commented Sep 3, 2024

KuntaiDu commented Sep 3, 2024

jfpichlme commented Dec 9, 2024

xxchauncey commented Dec 13, 2024

WilliamOnVoyage commented Feb 11, 2025

Error malloc(): unaligned tcache chunk detected Always Occur after tensorrt server handling a certain amount requests #587

Error malloc(): unaligned tcache chunk detected Always Occur after tensorrt server handling a certain amount requests #587

Comments

wangpeilin commented Aug 28, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

KuntaiDu commented Sep 3, 2024

KuntaiDu commented Sep 3, 2024

jfpichlme commented Dec 9, 2024

Checkpoint Conversion:

Build Step:

xxchauncey commented Dec 13, 2024

WilliamOnVoyage commented Feb 11, 2025