L2 cache optimization in matrix multiplication #5993

alfin3 · 2025-02-22T19:54:51Z

alfin3
Feb 22, 2025

My understanding is that the consecutive program ids are mapped to 2D program ids such that:

the segments of the B matrix along the K dimension that are non-contiguous in the global GPU memory are more likely to be kept and reused in the L2 cache, and
the segments of the A matrix along the K dimension that are contiguous in the global GPU memory are more likely to be repeatedly fetched into the L2 cache.

A more efficient use of the L2 cache line would result in fewer misses and higher TFLOPS.

It is also stated that the programs are “launch[ed] in an order that promotes data reuse”. The execution order of thread blocks is undefined, unless co-scheduling in thread block clusters is used on an NVIDIA GPU.

Is the statement regarding the more than 10% improvement in TFLOPS based on experiment(s) where the improvement was specifically attributable to L2 cache and the launch and execution order?

How are programs “launch[ed and executed] in an order that promotes data reuse” in terms of thread blocks?

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

L2 cache optimization in matrix multiplication #5993

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

L2 cache optimization in matrix multiplication #5993

alfin3 Feb 22, 2025

Replies: 0 comments

alfin3
Feb 22, 2025