You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My understanding is that the consecutive program ids are mapped to 2D program ids such that:
the segments of the B matrix along the K dimension that are non-contiguous in the global GPU memory are more likely to be kept and reused in the L2 cache, and
the segments of the A matrix along the K dimension that are contiguous in the global GPU memory are more likely to be repeatedly fetched into the L2 cache.
A more efficient use of the L2 cache line would result in fewer misses and higher TFLOPS.
It is also stated that the programs are “launch[ed] in an order that promotes data reuse”. The execution order of thread blocks is undefined, unless co-scheduling in thread block clusters is used on an NVIDIA GPU.
Is the statement regarding the more than 10% improvement in TFLOPS based on experiment(s) where the improvement was specifically attributable to L2 cache and the launch and execution order?
How are programs “launch[ed and executed] in an order that promotes data reuse” in terms of thread blocks?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
My understanding is that the consecutive program ids are mapped to 2D program ids such that:
A more efficient use of the L2 cache line would result in fewer misses and higher TFLOPS.
It is also stated that the programs are “launch[ed] in an order that promotes data reuse”. The execution order of thread blocks is undefined, unless co-scheduling in thread block clusters is used on an NVIDIA GPU.
Is the statement regarding the more than 10% improvement in TFLOPS based on experiment(s) where the improvement was specifically attributable to L2 cache and the launch and execution order?
How are programs “launch[ed and executed] in an order that promotes data reuse” in terms of thread blocks?
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions