Improve cache lookup overheads in FusionDefinitionWrapper #1843

IvanYashchuk · 2025-03-06T10:23:11Z

Inspired by the changes from #1096, the next step is to utilize only the necessary information to construct contiguity and stride order information from runtime tensors. Therefore, the shape and dtype are now always derived from ProxyTensors and this information is available at the trace construction time.

Here are the results of measuring CPU time for the to_descriptors function which is used to build the cache key:

Here's the breakdown for different components of the call method for the current PR:

"Other" is something that is not to_descriptors, get_fd, fd.execute. For example, version check, which is being removed in #1840.

cc @tfogal

csarofeen · 2025-03-06T13:55:14Z

One thing I’d like to compare is how fast this is relative to:
https://github.com/NVIDIA/Fuser/blob/e4a93a5613b9b220d6e79c81257a8d585d581fb6/csrc/runtime/fusion_cache_utils.cpp#L279-L326

This code is what we’re using in nvFuser to cache sizes/strides internally in nvFuser. There’s unfortunately not an nvtx range on this so you can’t simply check without rebuilding nvFuser.

We use this to generate a cache id and set it in fusion executor cache: https://github.com/NVIDIA/Fuser/blob/e4a93a5613b9b220d6e79c81257a8d585d581fb6/csrc/runtime/fusion_executor_cache.cpp#L111

If caching takes a while we could actually generate this once and set it directly into FEC.

The only challenge is I don’t think this scales well to when contiguity is changing. As I don’t think contiguity changes are part of our concretization pass CC @jacobhinkle to check my knowledge here.

thunder/executors/nvfuserex_impl.py

mruberry

Cool!

@csarofeen -- would you create an issue for follow-up?

jacobhinkle · 2025-03-06T16:26:58Z

The only challenge is I don’t think this scales well to when contiguity is changing. As I don’t think contiguity changes are part of our concretization pass CC @jacobhinkle to check my knowledge here.

That's right. Contiguity is currently encoded as part of the fusion definition and we do not modify it when we get new inputs. Instead there will be a separate FEC when you have inputs with different contiguity/stride order.

mruberry · 2025-03-06T16:29:56Z

The only challenge is I don’t think this scales well to when contiguity is changing. As I don’t think contiguity changes are part of our concretization pass CC @jacobhinkle to check my knowledge here.

That's right. Contiguity is currently encoded as part of the fusion definition and we do not modify it when we get new inputs. Instead there will be a separate FEC when you have inputs with different contiguity/stride order.

A related question then, is if it's worthwhile to select a fusion definition in C++, and so in Python we wouldn't have to query for nvFuser for keys and then lookup the appropriate fusion definition. nvFuser could instead return another layer of indirection that would do that work.

IvanYashchuk added 3 commits March 6, 2025 12:01

Get only contiguity and stride order from runtime torch.Tensors

cef71ec

Remove unused compute_tensor_descriptor

890dcea

Convert only tensor objects to runtime descriptors

46f079c

IvanYashchuk added the nvfuser label Mar 6, 2025

IvanYashchuk requested review from mruberry, lantiga and t-vi as code owners March 6, 2025 10:23

mruberry reviewed Mar 6, 2025

View reviewed changes

thunder/executors/nvfuserex_impl.py Show resolved Hide resolved

mruberry approved these changes Mar 6, 2025

View reviewed changes

mruberry enabled auto-merge (squash) March 6, 2025 16:10

mruberry merged commit 56b922a into main Mar 6, 2025
50 checks passed

mruberry deleted the nvfuserex-to-descriptors branch March 6, 2025 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve cache lookup overheads in FusionDefinitionWrapper #1843

Improve cache lookup overheads in FusionDefinitionWrapper #1843

IvanYashchuk commented Mar 6, 2025 •

edited by github-actions bot

Loading

csarofeen commented Mar 6, 2025

mruberry left a comment

jacobhinkle commented Mar 6, 2025

mruberry commented Mar 6, 2025

Improve cache lookup overheads in FusionDefinitionWrapper #1843

Improve cache lookup overheads in FusionDefinitionWrapper #1843

Conversation

IvanYashchuk commented Mar 6, 2025 • edited by github-actions bot Loading

csarofeen commented Mar 6, 2025

mruberry left a comment

Choose a reason for hiding this comment

jacobhinkle commented Mar 6, 2025

mruberry commented Mar 6, 2025

IvanYashchuk commented Mar 6, 2025 •

edited by github-actions bot

Loading