[Performance] Share weights between sessions to accelerate inference #20172

xiong-qiao · 2024-04-01T23:42:34Z

Describe the issue

I'm trying to share base model weights between ort inference sessions to accelerate inference for adapter models. Based on the API documentation and previous conversation in #15301, I:

manually export base model and adapter weights
load base model weights into tensor
add them to the session_options object using add_initializer API
create an inference session using the session_options object in step 3
for a different adapter weights, repeat step 3-4

I expect the following calls to create an inference session to be much faster than the 1st call since we reuse the base model weights, but I'm getting the same latency for creating an inference session for all the following calls. Did I miss anything here?

To reproduce

Below is the sample python code snippet I used for testing:

def create_inference_session(base_model, adapter_tensors, base_model_tensors):
    # Add base model weights to session options
    start = time.time()
    opts = ort.SessionOptions()
    for name, data in base_model_tensors.items():
        opts.add_initializer(name, data)
    end = time.time()
    print(f'Base model adding time: {end - start}')
  
    start = time.time()
    for tensor in adapter_tensors:
        opts.add_initializer(tensor[0], tensor[1])
    end = time.time()
    print(f'Adapter adding time: {end - start}')
  
    # create a new inference session
    start = time.time()
    session = ort.InferenceSession(base_model.SerializeToString(), sess_options=opts, providers=['CPUExecutionProvider'])
    end = time.time()
    print(f'Session creation time: {end - start}')

Urgency

No response

Platform

Linux

OS Version

Ubuntu 22.04.4 LTS

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.15.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

No

The text was updated successfully, but these errors were encountered:

pranavsharma · 2024-04-02T00:33:29Z

The steps for sharing weights between models are as follows:

Create a session with session options such that the optimized model is serialized to the disk and the weights are externalized.
Henceforth, we'll use the serialized model. Create a session with this model after adding the external weights to the session options via the AddInitializer API and turning off all optimizations (ORT_DISABLE_ALL).

Let me know if you've measured the session creation time after following these steps. It's guaranteed that the session creation will be much faster than step 1. And this has little to do with sharing weights and more to do with the fact that you're not causing the optimizers to run at all because you already serialized the model after running them in step 1.

xiong-qiao · 2024-04-02T20:28:21Z

Thanks for the quick reply! I just followed the steps you mentioned and lowered the session creation time from 7s to <1s. This is a big savings by pre-doing the optimization step. But since our goal is to reduce model loading/switching time as much as we can, we have a few follow-up questions:

Just want to make sure we 100% understand what you mentioned. Since you said that we need to save some of the tensors as external. does this mean all of the tensors in the model, or just the ones that are going to be changed between inference sessions, i.e. adapter weights?
Since we will be repeatedly building large inference sessions that share much of the weights, is there also some time / memory savings that we could get by caching the initializers in memory? is that what the PrepackedWeights parameter is for? and would that work in cuda?.
"And this has little to do with sharing weights..." Right now we only have one inference session alive, I guess you mean sharing weights is more relevant if we maintain multiple inference sessions that share weights in between. if we want to do this in the future, how can we reuse base weights to reduce memory footprint and make initialization of following sessions even faster?
Some tensors are going to be overridden using add_initializer. where and how are we supposed to read these tensors from? could they be read from the original, unoptimized model file or do they have to be read from the optimized file? and is there some way to do that in C# or does it require the onnx python api?

pranavsharma self-assigned this Apr 2, 2024

pranavsharma added the core runtime issues related to core runtime label Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Share weights between sessions to accelerate inference #20172

[Performance] Share weights between sessions to accelerate inference #20172

xiong-qiao commented Apr 1, 2024

pranavsharma commented Apr 2, 2024 •

edited

Loading

xiong-qiao commented Apr 2, 2024

[Performance] Share weights between sessions to accelerate inference #20172

[Performance] Share weights between sessions to accelerate inference #20172

Comments

xiong-qiao commented Apr 1, 2024

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

pranavsharma commented Apr 2, 2024 • edited Loading

xiong-qiao commented Apr 2, 2024

pranavsharma commented Apr 2, 2024 •

edited

Loading