You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to share base model weights between ort inference sessions to accelerate inference for adapter models. Based on the API documentation and previous conversation in #15301, I:
manually export base model and adapter weights
load base model weights into tensor
add them to the session_options object using add_initializer API
create an inference session using the session_options object in step 3
for a different adapter weights, repeat step 3-4
I expect the following calls to create an inference session to be much faster than the 1st call since we reuse the base model weights, but I'm getting the same latency for creating an inference session for all the following calls. Did I miss anything here?
To reproduce
Below is the sample python code snippet I used for testing:
defcreate_inference_session(base_model, adapter_tensors, base_model_tensors):
# Add base model weights to session optionsstart=time.time()
opts=ort.SessionOptions()
forname, datainbase_model_tensors.items():
opts.add_initializer(name, data)
end=time.time()
print(f'Base model adding time: {end-start}')
start=time.time()
fortensorinadapter_tensors:
opts.add_initializer(tensor[0], tensor[1])
end=time.time()
print(f'Adapter adding time: {end-start}')
# create a new inference sessionstart=time.time()
session=ort.InferenceSession(base_model.SerializeToString(), sess_options=opts, providers=['CPUExecutionProvider'])
end=time.time()
print(f'Session creation time: {end-start}')
Urgency
No response
Platform
Linux
OS Version
Ubuntu 22.04.4 LTS
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.15.1
ONNX Runtime API
Python
Architecture
X64
Execution Provider
Default CPU
Execution Provider Library Version
No response
Model File
No response
Is this a quantized model?
No
The text was updated successfully, but these errors were encountered:
The steps for sharing weights between models are as follows:
Create a session with session options such that the optimized model is serialized to the disk and the weights are externalized.
Henceforth, we'll use the serialized model. Create a session with this model after adding the external weights to the session options via the AddInitializer API and turning off all optimizations (ORT_DISABLE_ALL).
Let me know if you've measured the session creation time after following these steps. It's guaranteed that the session creation will be much faster than step 1. And this has little to do with sharing weights and more to do with the fact that you're not causing the optimizers to run at all because you already serialized the model after running them in step 1.
Thanks for the quick reply! I just followed the steps you mentioned and lowered the session creation time from 7s to <1s. This is a big savings by pre-doing the optimization step. But since our goal is to reduce model loading/switching time as much as we can, we have a few follow-up questions:
Just want to make sure we 100% understand what you mentioned. Since you said that we need to save some of the tensors as external. does this mean all of the tensors in the model, or just the ones that are going to be changed between inference sessions, i.e. adapter weights?
Since we will be repeatedly building large inference sessions that share much of the weights, is there also some time / memory savings that we could get by caching the initializers in memory? is that what the PrepackedWeights parameter is for? and would that work in cuda?.
"And this has little to do with sharing weights..." Right now we only have one inference session alive, I guess you mean sharing weights is more relevant if we maintain multiple inference sessions that share weights in between. if we want to do this in the future, how can we reuse base weights to reduce memory footprint and make initialization of following sessions even faster?
Some tensors are going to be overridden using add_initializer. where and how are we supposed to read these tensors from? could they be read from the original, unoptimized model file or do they have to be read from the optimized file? and is there some way to do that in C# or does it require the onnx python api?
Describe the issue
I'm trying to share base model weights between ort inference sessions to accelerate inference for adapter models. Based on the API documentation and previous conversation in #15301, I:
add_initializer
APII expect the following calls to create an inference session to be much faster than the 1st call since we reuse the base model weights, but I'm getting the same latency for creating an inference session for all the following calls. Did I miss anything here?
To reproduce
Below is the sample python code snippet I used for testing:
Urgency
No response
Platform
Linux
OS Version
Ubuntu 22.04.4 LTS
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.15.1
ONNX Runtime API
Python
Architecture
X64
Execution Provider
Default CPU
Execution Provider Library Version
No response
Model File
No response
Is this a quantized model?
No
The text was updated successfully, but these errors were encountered: