feat: Support tool calling for non-streaming chat completion in remote vLLM provider #1034

terrytangyuan · 2025-02-10T19:58:12Z

What does this PR do?

This PR adds support for tool calling for non-streaming chat completion. Prior to this, tool calls were not passed to chat completion requests and the tools object needs to be restructured properly to be compatible with vLLM provider.

Test Plan

LLAMA_STACK_BASE_URL=http://localhost:5002 pytest -v tests/client-sdk/inference/test_text_inference.py
================================================================= test session starts =================================================================
platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 -- /home/yutang/.conda/envs/distribution-myenv/bin/python3.10
cachedir: .pytest_cache
rootdir: /home/yutang/repos/llama-stack
configfile: pyproject.toml
plugins: anyio-4.8.0
collected 12 items                                                                                                                                    

tests/client-sdk/inference/test_text_inference.py::test_text_completion_non_streaming[meta-llama/Llama-3.1-8B-Instruct] PASSED                  [  8%]
tests/client-sdk/inference/test_text_inference.py::test_text_completion_streaming[meta-llama/Llama-3.1-8B-Instruct] PASSED                      [ 16%]
tests/client-sdk/inference/test_text_inference.py::test_completion_log_probs_non_streaming[meta-llama/Llama-3.1-8B-Instruct] XFAIL (remote:...) [ 25%]
tests/client-sdk/inference/test_text_inference.py::test_completion_log_probs_streaming[meta-llama/Llama-3.1-8B-Instruct] XFAIL (remote::vll...) [ 33%]
tests/client-sdk/inference/test_text_inference.py::test_text_completion_structured_output[meta-llama/Llama-3.1-8B-Instruct] PASSED              [ 41%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_non_streaming[meta-llama/Llama-3.1-8B-Instruct-Which planet do humans live on?-Earth] PASSED [ 50%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_non_streaming[meta-llama/Llama-3.1-8B-Instruct-Which planet has rings around it with a name starting with letter S?-Saturn] PASSED [ 58%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_streaming[meta-llama/Llama-3.1-8B-Instruct-What's the name of the Sun in latin?-Sol] PASSED [ 66%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_streaming[meta-llama/Llama-3.1-8B-Instruct-What is the name of the US captial?-Washington] PASSED [ 75%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_non_streaming[meta-llama/Llama-3.1-8B-Instruct] PASSED [ 83%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_streaming[meta-llama/Llama-3.1-8B-Instruct] FAILED [ 91%]
tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_structured_output[meta-llama/Llama-3.1-8B-Instruct] PASSED         [100%]

Signed-off-by: Yuan Tang <[email protected]>

terrytangyuan · 2025-02-10T19:59:43Z

Will need to parse the response to be compatible since there seems to be an issue to parse the completion result that includes tool calls:

Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content=None, refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='chatcmpl-tool-b1343807797b4b6aa972b186faa09620', function=Function(arguments='{"location": "San Francisco, CA"}', name='get_weather'), type='function')], reasoning_content=None), stop_reason=128008)

Error:


Traceback (most recent call last):
  File "/home/yutang/repos/llama-stack/llama_stack/distribution/server/server.py", line 182, in endpoint
    return await maybe_await(value)
  File "/home/yutang/repos/llama-stack/llama_stack/distribution/server/server.py", line 148, in maybe_await
    return await value
  File "/home/yutang/repos/llama-stack/llama_stack/providers/utils/telemetry/trace_protocol.py", line 91, in async_wrapper
    result = await method(self, *args, **kwargs)
  File "/home/yutang/repos/llama-stack/llama_stack/distribution/routers/routers.py", line 169, in chat_completion
    return await provider.chat_completion(**params)
  File "/home/yutang/repos/llama-stack/llama_stack/providers/utils/telemetry/trace_protocol.py", line 91, in async_wrapper
    result = await method(self, *args, **kwargs)
  File "/home/yutang/repos/llama-stack/llama_stack/providers/remote/inference/vllm/vllm.py", line 138, in chat_completion
    return await self._nonstream_chat_completion(request, self.client)
  File "/home/yutang/repos/llama-stack/llama_stack/providers/remote/inference/vllm/vllm.py", line 145, in _nonstream_chat_completion
    return process_chat_completion_response(r, self.formatter)
  File "/home/yutang/repos/llama-stack/llama_stack/providers/utils/inference/openai_compat.py", line 178, in process_chat_completion_response
    raw_message = formatter.decode_assistant_message_from_content(
  File "/home/yutang/.conda/envs/distribution-myenv/lib/python3.10/site-packages/llama_models/llama3/api/chat_format.py", line 170, in decode_assistant_message_from_content
    content = content.strip(" ")
AttributeError: 'NoneType' object has no attribute 'strip'

Signed-off-by: Yuan Tang <[email protected]>

terrytangyuan · 2025-02-11T02:51:45Z

Verified that all tests pass, including test_text_chat_completion_with_tool_calling_and_non_streaming. test_text_chat_completion_with_tool_calling_and_streaming does not pass since that will need to be worked on separately (created #1046 to track if anyone else is interested in working on it).

Signed-off-by: Yuan Tang <[email protected]>

yanxi0830 · 2025-02-11T05:48:55Z

Will need to parse the response to be compatible since there seems to be an issue to parse the completion result that includes tool calls:

Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content=None, refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='chatcmpl-tool-b1343807797b4b6aa972b186faa09620', function=Function(arguments='{"location": "San Francisco, CA"}', name='get_weather'), type='function')], reasoning_content=None), stop_reason=128008)

Error:


Traceback (most recent call last):
  File "/home/yutang/repos/llama-stack/llama_stack/distribution/server/server.py", line 182, in endpoint
    return await maybe_await(value)
  File "/home/yutang/repos/llama-stack/llama_stack/distribution/server/server.py", line 148, in maybe_await
    return await value
  File "/home/yutang/repos/llama-stack/llama_stack/providers/utils/telemetry/trace_protocol.py", line 91, in async_wrapper
    result = await method(self, *args, **kwargs)
  File "/home/yutang/repos/llama-stack/llama_stack/distribution/routers/routers.py", line 169, in chat_completion
    return await provider.chat_completion(**params)
  File "/home/yutang/repos/llama-stack/llama_stack/providers/utils/telemetry/trace_protocol.py", line 91, in async_wrapper
    result = await method(self, *args, **kwargs)
  File "/home/yutang/repos/llama-stack/llama_stack/providers/remote/inference/vllm/vllm.py", line 138, in chat_completion
    return await self._nonstream_chat_completion(request, self.client)
  File "/home/yutang/repos/llama-stack/llama_stack/providers/remote/inference/vllm/vllm.py", line 145, in _nonstream_chat_completion
    return process_chat_completion_response(r, self.formatter)
  File "/home/yutang/repos/llama-stack/llama_stack/providers/utils/inference/openai_compat.py", line 178, in process_chat_completion_response
    raw_message = formatter.decode_assistant_message_from_content(
  File "/home/yutang/.conda/envs/distribution-myenv/lib/python3.10/site-packages/llama_models/llama3/api/chat_format.py", line 170, in decode_assistant_message_from_content
    content = content.strip(" ")
AttributeError: 'NoneType' object has no attribute 'strip'

@terrytangyuan It seems like content=None is the culprit. What happens if you set content=""?

thoraxe · 2025-02-11T14:19:52Z

I tested this PR from a container build that @terrytangyuan provided and can confirm that the tools portion of the payload is now passed to vLLM where it was not previously:

{
    "messages": [
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are a helpful assistant with access to the following\nfunction calls. Your task is to produce a list of function calls\nnecessary to generate response to the user utterance. Use the following\nfunction calls as required."
                }
            ]
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What pods are in the namespace openshift-lightspeed?"
                }
            ]
        }
    ],
    "model": "meta-llama/Llama-3.2-1B-Instruct",
    "max_tokens": 4096,
    "stream": true,
    "temperature": 0.0,
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_object_namespace_list",
                "description": "Get the list of all objects in a namespace",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "kind": {
                            "type": "str",
                            "description": "the type of object"
                        },
                        "namespace": {
                            "type": "str",
                            "description": "the name of the namespace"
                        }
                    },
                    "required": [
                        "kind",
                        "namespace"
                    ]
                }
            }
        }
    ]
}

terrytangyuan · 2025-02-11T16:11:34Z

@terrytangyuan It seems like content=None is the culprit. What happens if you set content=""?

I tried but ran into another rabbit hole. Since the current fix works, should we merge it for now and investigate other issues later separately?

llama_stack/providers/utils/inference/openai_compat.py

hardikjshah

Looks good , lets get this in.

…e vLLM provider (meta-llama#1034) # What does this PR do? This PR adds support for tool calling for non-streaming chat completion. Prior to this, tool calls were not passed to chat completion requests and the tools object needs to be restructured properly to be compatible with vLLM provider. ## Test Plan ``` LLAMA_STACK_BASE_URL=http://localhost:5002 pytest -v tests/client-sdk/inference/test_text_inference.py ================================================================= test session starts ================================================================= platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 -- /home/yutang/.conda/envs/distribution-myenv/bin/python3.10 cachedir: .pytest_cache rootdir: /home/yutang/repos/llama-stack configfile: pyproject.toml plugins: anyio-4.8.0 collected 12 items tests/client-sdk/inference/test_text_inference.py::test_text_completion_non_streaming[meta-llama/Llama-3.1-8B-Instruct] PASSED [ 8%] tests/client-sdk/inference/test_text_inference.py::test_text_completion_streaming[meta-llama/Llama-3.1-8B-Instruct] PASSED [ 16%] tests/client-sdk/inference/test_text_inference.py::test_completion_log_probs_non_streaming[meta-llama/Llama-3.1-8B-Instruct] XFAIL (remote:...) [ 25%] tests/client-sdk/inference/test_text_inference.py::test_completion_log_probs_streaming[meta-llama/Llama-3.1-8B-Instruct] XFAIL (remote::vll...) [ 33%] tests/client-sdk/inference/test_text_inference.py::test_text_completion_structured_output[meta-llama/Llama-3.1-8B-Instruct] PASSED [ 41%] tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_non_streaming[meta-llama/Llama-3.1-8B-Instruct-Which planet do humans live on?-Earth] PASSED [ 50%] tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_non_streaming[meta-llama/Llama-3.1-8B-Instruct-Which planet has rings around it with a name starting with letter S?-Saturn] PASSED [ 58%] tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_streaming[meta-llama/Llama-3.1-8B-Instruct-What's the name of the Sun in latin?-Sol] PASSED [ 66%] tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_streaming[meta-llama/Llama-3.1-8B-Instruct-What is the name of the US captial?-Washington] PASSED [ 75%] tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_non_streaming[meta-llama/Llama-3.1-8B-Instruct] PASSED [ 83%] tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_streaming[meta-llama/Llama-3.1-8B-Instruct] FAILED [ 91%] tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_structured_output[meta-llama/Llama-3.1-8B-Instruct] PASSED [100%] ``` --------- Signed-off-by: Yuan Tang <[email protected]>

fix: Handle tool calling in remote vLLM provider

cc3bb09

Signed-off-by: Yuan Tang <[email protected]>

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 10, 2025

Handle response

b2a8653

Signed-off-by: Yuan Tang <[email protected]>

terrytangyuan changed the title ~~fix: Handle tool calling in remote vLLM provider~~ feat: Support tool calling for non-streaming chat completion in remote vLLM provider Feb 11, 2025

terrytangyuan marked this pull request as ready for review February 11, 2025 02:51

terrytangyuan requested review from ashwinb, yanxi0830, hardikjshah, dltn, raghotham, dineshyv, vladimirivic, sixianyi0721 and ehhuang as code owners February 11, 2025 02:51

Make utils non-public

34366f0

Signed-off-by: Yuan Tang <[email protected]>

terrytangyuan mentioned this pull request Feb 11, 2025

perf: ensure ToolCall in ChatCompletionResponse is subset of ChatCompletionRequest.tools #1041

Merged

terrytangyuan mentioned this pull request Feb 11, 2025

Support tool calling for streaming chat completion in remote vLLM provider #1046

Closed

raghotham reviewed Feb 12, 2025

View reviewed changes

llama_stack/providers/utils/inference/openai_compat.py Show resolved Hide resolved

terrytangyuan mentioned this pull request Feb 12, 2025

process_chat_completion_response() does not work well with tool calls for vLLM remote provider #1058

Open

Update openai_compat.py

227760d

hardikjshah approved these changes Feb 12, 2025

View reviewed changes

terrytangyuan merged commit dd37e58 into meta-llama:main Feb 12, 2025
3 checks passed

terrytangyuan deleted the fix-tool-calling-vllm branch February 12, 2025 02:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support tool calling for non-streaming chat completion in remote vLLM provider #1034

feat: Support tool calling for non-streaming chat completion in remote vLLM provider #1034

terrytangyuan commented Feb 10, 2025 •

edited

Loading

terrytangyuan commented Feb 10, 2025 •

edited by yanxi0830

Loading

terrytangyuan commented Feb 11, 2025 •

edited

Loading

yanxi0830 commented Feb 11, 2025

thoraxe commented Feb 11, 2025

terrytangyuan commented Feb 11, 2025 •

edited

Loading

hardikjshah left a comment

feat: Support tool calling for non-streaming chat completion in remote vLLM provider #1034

feat: Support tool calling for non-streaming chat completion in remote vLLM provider #1034

Conversation

terrytangyuan commented Feb 10, 2025 • edited Loading

What does this PR do?

Test Plan

terrytangyuan commented Feb 10, 2025 • edited by yanxi0830 Loading

terrytangyuan commented Feb 11, 2025 • edited Loading

yanxi0830 commented Feb 11, 2025

thoraxe commented Feb 11, 2025

terrytangyuan commented Feb 11, 2025 • edited Loading

hardikjshah left a comment

Choose a reason for hiding this comment

terrytangyuan commented Feb 10, 2025 •

edited

Loading

terrytangyuan commented Feb 10, 2025 •

edited by yanxi0830

Loading

terrytangyuan commented Feb 11, 2025 •

edited

Loading

terrytangyuan commented Feb 11, 2025 •

edited

Loading