Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] Share weights between sessions to accelerate inference #20172

Open
xiong-qiao opened this issue Apr 1, 2024 · 2 comments
Open
Assignees
Labels
core runtime issues related to core runtime

Comments

@xiong-qiao
Copy link

Describe the issue

I'm trying to share base model weights between ort inference sessions to accelerate inference for adapter models. Based on the API documentation and previous conversation in #15301, I:

  1. manually export base model and adapter weights
  2. load base model weights into tensor
  3. add them to the session_options object using add_initializer API
  4. create an inference session using the session_options object in step 3
  5. for a different adapter weights, repeat step 3-4

I expect the following calls to create an inference session to be much faster than the 1st call since we reuse the base model weights, but I'm getting the same latency for creating an inference session for all the following calls. Did I miss anything here?

To reproduce

Below is the sample python code snippet I used for testing:

def create_inference_session(base_model, adapter_tensors, base_model_tensors):
    # Add base model weights to session options
    start = time.time()
    opts = ort.SessionOptions()
    for name, data in base_model_tensors.items():
        opts.add_initializer(name, data)
    end = time.time()
    print(f'Base model adding time: {end - start}')
  
    start = time.time()
    for tensor in adapter_tensors:
        opts.add_initializer(tensor[0], tensor[1])
    end = time.time()
    print(f'Adapter adding time: {end - start}')
  
    # create a new inference session
    start = time.time()
    session = ort.InferenceSession(base_model.SerializeToString(), sess_options=opts, providers=['CPUExecutionProvider'])
    end = time.time()
    print(f'Session creation time: {end - start}')

Urgency

No response

Platform

Linux

OS Version

Ubuntu 22.04.4 LTS

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.15.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

No

@pranavsharma
Copy link
Contributor

pranavsharma commented Apr 2, 2024

The steps for sharing weights between models are as follows:

  1. Create a session with session options such that the optimized model is serialized to the disk and the weights are externalized.
  2. Henceforth, we'll use the serialized model. Create a session with this model after adding the external weights to the session options via the AddInitializer API and turning off all optimizations (ORT_DISABLE_ALL).

Let me know if you've measured the session creation time after following these steps. It's guaranteed that the session creation will be much faster than step 1. And this has little to do with sharing weights and more to do with the fact that you're not causing the optimizers to run at all because you already serialized the model after running them in step 1.

@pranavsharma pranavsharma self-assigned this Apr 2, 2024
@pranavsharma pranavsharma added the core runtime issues related to core runtime label Apr 2, 2024
@xiong-qiao
Copy link
Author

Thanks for the quick reply! I just followed the steps you mentioned and lowered the session creation time from 7s to <1s. This is a big savings by pre-doing the optimization step. But since our goal is to reduce model loading/switching time as much as we can, we have a few follow-up questions:

  1. Just want to make sure we 100% understand what you mentioned. Since you said that we need to save some of the tensors as external. does this mean all of the tensors in the model, or just the ones that are going to be changed between inference sessions, i.e. adapter weights?
  2. Since we will be repeatedly building large inference sessions that share much of the weights, is there also some time / memory savings that we could get by caching the initializers in memory? is that what the PrepackedWeights parameter is for? and would that work in cuda?.
  3. "And this has little to do with sharing weights..." Right now we only have one inference session alive, I guess you mean sharing weights is more relevant if we maintain multiple inference sessions that share weights in between. if we want to do this in the future, how can we reuse base weights to reduce memory footprint and make initialization of following sessions even faster?
  4. Some tensors are going to be overridden using add_initializer. where and how are we supposed to read these tensors from? could they be read from the original, unoptimized model file or do they have to be read from the optimized file? and is there some way to do that in C# or does it require the onnx python api?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core runtime issues related to core runtime
Projects
None yet
Development

No branches or pull requests

2 participants