-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error malloc(): unaligned tcache chunk detected Always Occur after tensorrt server handling a certain amount requests #587
Comments
Same bug observed. Exact same behavior on 8xH100 with llama models. |
Let me provide more detail on my side. I am working on benchmarking vLLM together with TensorRT-LLM and encountered the same issue when running benchmark on 8xH100 (the same benchmark runs normally on 8xA100 on my side).
This code is from vLLM performance benchmark |
We are facing the same issue: Based on this, we also believe it is likely caused by a buggy cleanup process. Hardware-setup: DGX H100 Checkpoint Conversion:
Build Step:
|
ran into same problem, any updates? |
Hit same issue with ONNX backend |
System Info
Who can help?
@kaiyux
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
git clone https://github.com/NVIDIA/TensorRT-LLM.git
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
Expected behavior
All request are successfully processed and no error
actual behavior
When the server performs multiple inferences, such as 5000 times, it raise error

malloc(): unaligned tcache chunk detected
Signal (6) received.
Both continuous and intermittent (such as one day) inference will cause this error.
When I calls 8000 inferences in one test, it raise error
pinned_memory_manager.cc:170] "failed to allocate pinned system memory, falling back to non-pinned system memory
Finally I set parameter cuda-memory-pool-byte-size to 512M and pinned-memory-pool-byte-size to 512M and solve this problem, but these two parameters are not exposed in the script scripts/launch_triton_server.py, so I want to ask why this problem occurs and if there is any other way to solve this problem.
When I call the server with high concurrency it raise error

malloc_consolidate(): unaligned fastbin chunk detected
Signal (6) received.
Hope you can help me solve these problems, thanks very much!
additional notes
I think this seems to be because the server does not completely clean up the memory after each inference is completed.
The text was updated successfully, but these errors were encountered: