-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve cache lookup overheads in FusionDefinitionWrapper #1843
Conversation
One thing I’d like to compare is how fast this is relative to: This code is what we’re using in nvFuser to cache sizes/strides internally in nvFuser. There’s unfortunately not an nvtx range on this so you can’t simply check without rebuilding nvFuser. We use this to generate a cache id and set it in fusion executor cache: https://github.com/NVIDIA/Fuser/blob/e4a93a5613b9b220d6e79c81257a8d585d581fb6/csrc/runtime/fusion_executor_cache.cpp#L111 If caching takes a while we could actually generate this once and set it directly into FEC. The only challenge is I don’t think this scales well to when contiguity is changing. As I don’t think contiguity changes are part of our concretization pass CC @jacobhinkle to check my knowledge here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. Contiguity is currently encoded as part of the fusion definition and we do not modify it when we get new inputs. Instead there will be a separate FEC when you have inputs with different contiguity/stride order. |
A related question then, is if it's worthwhile to select a fusion definition in C++, and so in Python we wouldn't have to query for nvFuser for keys and then lookup the appropriate fusion definition. nvFuser could instead return another layer of indirection that would do that work. |
Inspired by the changes from #1096, the next step is to utilize only the necessary information to construct contiguity and stride order information from runtime tensors. Therefore, the shape and dtype are now always derived from ProxyTensors and this information is available at the trace construction time.
Here are the results of measuring CPU time for the

to_descriptors
function which is used to build the cache key:Here's the breakdown for different components of the call method for the current PR:

"Other" is something that is not
to_descriptors
,get_fd
,fd.execute
. For example, version check, which is being removed in #1840.cc @tfogal