Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve cache lookup overheads in FusionDefinitionWrapper #1843

Merged
merged 3 commits into from
Mar 6, 2025

Conversation

IvanYashchuk
Copy link
Collaborator

@IvanYashchuk IvanYashchuk commented Mar 6, 2025

Inspired by the changes from #1096, the next step is to utilize only the necessary information to construct contiguity and stride order information from runtime tensors. Therefore, the shape and dtype are now always derived from ProxyTensors and this information is available at the trace construction time.

Here are the results of measuring CPU time for the to_descriptors function which is used to build the cache key:
image

Here's the breakdown for different components of the call method for the current PR:
image

"Other" is something that is not to_descriptors, get_fd, fd.execute. For example, version check, which is being removed in #1840.

cc @tfogal

@csarofeen
Copy link
Collaborator

One thing I’d like to compare is how fast this is relative to:
https://github.com/NVIDIA/Fuser/blob/e4a93a5613b9b220d6e79c81257a8d585d581fb6/csrc/runtime/fusion_cache_utils.cpp#L279-L326

This code is what we’re using in nvFuser to cache sizes/strides internally in nvFuser. There’s unfortunately not an nvtx range on this so you can’t simply check without rebuilding nvFuser.

We use this to generate a cache id and set it in fusion executor cache: https://github.com/NVIDIA/Fuser/blob/e4a93a5613b9b220d6e79c81257a8d585d581fb6/csrc/runtime/fusion_executor_cache.cpp#L111

If caching takes a while we could actually generate this once and set it directly into FEC.

The only challenge is I don’t think this scales well to when contiguity is changing. As I don’t think contiguity changes are part of our concretization pass CC @jacobhinkle to check my knowledge here.

Copy link
Collaborator

@mruberry mruberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

@csarofeen -- would you create an issue for follow-up?

@mruberry mruberry enabled auto-merge (squash) March 6, 2025 16:10
@mruberry mruberry merged commit 56b922a into main Mar 6, 2025
50 checks passed
@mruberry mruberry deleted the nvfuserex-to-descriptors branch March 6, 2025 16:10
@jacobhinkle
Copy link
Collaborator

The only challenge is I don’t think this scales well to when contiguity is changing. As I don’t think contiguity changes are part of our concretization pass CC @jacobhinkle to check my knowledge here.

That's right. Contiguity is currently encoded as part of the fusion definition and we do not modify it when we get new inputs. Instead there will be a separate FEC when you have inputs with different contiguity/stride order.

@mruberry
Copy link
Collaborator

mruberry commented Mar 6, 2025

The only challenge is I don’t think this scales well to when contiguity is changing. As I don’t think contiguity changes are part of our concretization pass CC @jacobhinkle to check my knowledge here.

That's right. Contiguity is currently encoded as part of the fusion definition and we do not modify it when we get new inputs. Instead there will be a separate FEC when you have inputs with different contiguity/stride order.

A related question then, is if it's worthwhile to select a fusion definition in C++, and so in Python we wouldn't have to query for nvFuser for keys and then lookup the appropriate fusion definition. nvFuser could instead return another layer of indirection that would do that work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants