TFactor: bulkerify GEMV (both MC and GPU) #1214

albestro · 2024-11-15T17:50:06Z

In this PR we try to increase parallelisation of GEMV step of TFactor, which is by construction serial (all results goes into the same block tile), by using workspaces for storing partial results and then reducing them before the final TRMV step.

Algorithmically, the main change is that stepGEMV loop has been replaced with a single stepGEMVAll, and the concept has been applied in a similar way for both backends, MC and GPU:

MC: pika::bulk splits input tiles over multiple task, each one stores the partial results in their workspace and just one at the end does the reduction.
GPU: similarly to the CPU, the work is forked over different pika tasks, each one computing a partial result on a different GPU stream. These tasks then join into a single task which performs the reduction.

In order to implement this solution, workspaces for intermediate results have been added (there is another option which does not require any additional workspace nor reduction, and that we might explore for MC in another PR).

TODO:

This work is based on TFactor: simplification and cleanup #1213
check with @msimberg if we can do something different instead of the "fork" (e.g. a bulk for the GPU but each pika thread should get a different cuda stream)
define better how many workspaces are needed (currently a random choice)
profile and test performance of this solution
Verify if there is anything to be "ported" from computeTFactor parallelise computation with bulk (MC Local and Distributed) #798
workers default value should be backend dependent

Close #798.

EDIT: since this is conceptually similar to #798 and it is going to be closed as soon as this gets merged, I migrated here the doc fixes happened there.

albestro · 2024-12-02T16:56:41Z

cscs-ci run

just to check if there is any major problems

rasolca

LGTM.
Please name new functions in snake case, and rename existing internal functions if possible.

rasolca · 2024-12-04T10:58:46Z

include/dlaf/factorization/qr/t_factor_impl.h

+    matrix::Matrix<T, Device::GPU> ws_T({nworkspaces * nrefls_step, nrefls_step},
+                                        {nrefls_step, nrefls_step});


All Ts are allocated at scheduling. Better reuse it.

msimberg

Only a few questions, nothing blocking.

include/dlaf/eigensolver/bt_reduction_to_band/impl.h

include/dlaf/factorization/qr/api.h

include/dlaf/factorization/qr/t_factor_impl.h

include/dlaf/factorization/qr/internal/get_tfactor_nworkers.h

include/dlaf/tune.h

albestro · 2024-12-10T14:37:50Z

cscs-ci run

albestro · 2024-12-11T16:16:27Z

Converted to draft in order to prevent merging, since I'm still doing some checks.

include/dlaf/tune.h

include/dlaf/factorization/qr/internal/get_tfactor_nworkers.h

src/init.cpp

albestro · 2025-02-05T13:32:46Z

include/dlaf/eigensolver/reduction_to_band/impl.h

@@ -1063,7 +1075,14 @@ Matrix<T, Device::CPU> ReductionToBand<B, D, T>::call(Matrix<T, D>& mat_a, const
    // TODO probably the first one in any panel is ok?


Random thought: just saw this comment and wondering if we should have a look and see if we can "merge" this workspace into the other one? @rasolca

I will leave it for another PR 😉

albestro · 2025-02-11T10:03:25Z

cscs-ci run

msimberg

Only a minor include-nit, otherwise this looks good to me.

@albestro make sure to merge latest master since it's got the new CI changes now.

include/dlaf/factorization/qr/api.h

albestro · 2025-02-12T09:23:42Z

cscs-ci run

albestro · 2025-02-12T12:26:53Z

cscs-ci run

albestro added the Type:Optimization label Nov 15, 2024

albestro added this to the Optimizations milestone Nov 15, 2024

albestro self-assigned this Nov 15, 2024

albestro force-pushed the alby/tfactor-optim/bulk branch from 6558780 to bf17fcc Compare November 18, 2024 13:16

Base automatically changed from alby/tfactor-optim/no-gemv-divergence to master November 22, 2024 11:28

albestro force-pushed the alby/tfactor-optim/bulk branch 4 times, most recently from 37ea75b to 2845339 Compare December 2, 2024 16:48

albestro requested review from msimberg, rasolca and RMeli December 3, 2024 16:08

albestro marked this pull request as ready for review December 3, 2024 16:08

rasolca requested changes Dec 4, 2024

View reviewed changes

albestro requested a review from rasolca December 5, 2024 11:22

msimberg approved these changes Dec 5, 2024

View reviewed changes

include/dlaf/eigensolver/bt_reduction_to_band/impl.h Outdated Show resolved Hide resolved

include/dlaf/factorization/qr/api.h Outdated Show resolved Hide resolved

include/dlaf/factorization/qr/t_factor_impl.h Outdated Show resolved Hide resolved

albestro force-pushed the alby/tfactor-optim/bulk branch from 05dc48d to f83f317 Compare December 5, 2024 15:00

rasolca reviewed Dec 6, 2024

View reviewed changes

include/dlaf/factorization/qr/internal/get_tfactor_nworkers.h Outdated Show resolved Hide resolved

msimberg reviewed Dec 9, 2024

View reviewed changes

include/dlaf/tune.h Outdated Show resolved Hide resolved

albestro force-pushed the alby/tfactor-optim/bulk branch from 0397c9e to d370894 Compare December 9, 2024 09:47

albestro requested a review from rasolca December 9, 2024 09:47

rasolca approved these changes Dec 10, 2024

View reviewed changes

msimberg approved these changes Dec 10, 2024

View reviewed changes

albestro marked this pull request as draft December 11, 2024 16:15

albestro commented Dec 11, 2024

View reviewed changes

include/dlaf/tune.h Outdated Show resolved Hide resolved

albestro added 2 commits December 13, 2024 14:21

factor out gemvLoop for MC

b1197be

mc-local: add stepGEMVAll with bulk using workspaces

9647ec8

msimberg reviewed Feb 4, 2025

View reviewed changes

include/dlaf/factorization/qr/internal/get_tfactor_nworkers.h Outdated Show resolved Hide resolved

src/init.cpp Outdated Show resolved Hide resolved

albestro force-pushed the alby/tfactor-optim/bulk branch 2 times, most recently from 3749abc to ffba9a4 Compare February 5, 2025 09:28

albestro commented Feb 5, 2025

View reviewed changes

albestro added 5 commits February 10, 2025 15:11

introduce separate parameter for GPU

b3e1db3

bug fix: reduce on gpu happens just on upper tile

42b120e

change from splitTile to "inline" subTileReference

d58b9f7

switch to panel

e95da08

bug fix: dead-end task

2c07f27

albestro force-pushed the alby/tfactor-optim/bulk branch 4 times, most recently from aaba87f to 0d2def8 Compare February 10, 2025 17:13

albestro added 2 commits February 10, 2025 18:19

bug fix: missing quick return

ab2d289

minor changes

c306078

albestro force-pushed the alby/tfactor-optim/bulk branch from 0d2def8 to c306078 Compare February 10, 2025 17:19

albestro requested review from msimberg and rasolca February 11, 2025 10:02

msimberg requested changes Feb 12, 2025

View reviewed changes

include/dlaf/factorization/qr/api.h Outdated Show resolved Hide resolved

rasolca approved these changes Feb 12, 2025

View reviewed changes

albestro added 2 commits February 12, 2025 10:21

remove unused headers

d66433b

Merge branch 'master' into alby/tfactor-optim/bulk

0c50fe8

msimberg approved these changes Feb 12, 2025

View reviewed changes

fix continues_on

591a070

rasolca merged commit c5c5280 into master Feb 12, 2025
5 checks passed

rasolca deleted the alby/tfactor-optim/bulk branch February 12, 2025 14:52

github-actions bot pushed a commit that referenced this pull request Feb 12, 2025

Doc: TFactor: bulkerify GEMV (both MC and GPU) (#1214)

38e2008

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TFactor: bulkerify GEMV (both MC and GPU) #1214

TFactor: bulkerify GEMV (both MC and GPU) #1214

albestro commented Nov 15, 2024 •

edited

Loading

albestro commented Dec 2, 2024

rasolca left a comment

rasolca Dec 4, 2024

msimberg left a comment

albestro commented Dec 10, 2024

albestro commented Dec 11, 2024

albestro Feb 5, 2025

albestro Feb 10, 2025

albestro commented Feb 11, 2025

msimberg left a comment

albestro commented Feb 12, 2025

albestro commented Feb 12, 2025

		matrix::Matrix<T, Device::GPU> ws_T({nworkspaces * nrefls_step, nrefls_step},
		{nrefls_step, nrefls_step});

		@@ -1063,7 +1075,14 @@ Matrix<T, Device::CPU> ReductionToBand<B, D, T>::call(Matrix<T, D>& mat_a, const
		// TODO probably the first one in any panel is ok?

TFactor: bulkerify GEMV (both MC and GPU) #1214

TFactor: bulkerify GEMV (both MC and GPU) #1214

Conversation

albestro commented Nov 15, 2024 • edited Loading

albestro commented Dec 2, 2024

rasolca left a comment

Choose a reason for hiding this comment

rasolca Dec 4, 2024

Choose a reason for hiding this comment

msimberg left a comment

Choose a reason for hiding this comment

albestro commented Dec 10, 2024

albestro commented Dec 11, 2024

albestro Feb 5, 2025

Choose a reason for hiding this comment

albestro Feb 10, 2025

Choose a reason for hiding this comment

albestro commented Feb 11, 2025

msimberg left a comment

Choose a reason for hiding this comment

albestro commented Feb 12, 2025

albestro commented Feb 12, 2025

albestro commented Nov 15, 2024 •

edited

Loading