Hi TI,
I'm using the MMA through the intrinsics.
I think I have a good understanding on how to use it now, however I still don't succeed in loading the C matrix properly.
Let's say I have 64x64 32b values (all ones for this example) that I want to load into the C matrix.
I would then setup the MMA as below (Only showing the relevant parameters) :
As the LDC instruction only handles 512b vectors, loading an entire C matrix row takes 4 calls of LDC.
Therefore, the swap and reset counters should be set to 4 x 64 (256) to fill the entire C matrix.
However when I get the C matrix rows back using the XFER then RCV intrinsics, I get the following result :
As we can (barely) see on the picture above, only the first 16 elements of the first row turned out to be set to 1.
This result is coherent though, as the CLSWPER and CLRSTPER counters are 8b registers.
Setting their values to 256 is like setting them to 0, leading to this behavior.
I have tried setting the load counters to 255, but as we might expect, I got zeros on the last 16 elements :
I have tried changing the LDDST register, but it did not solve my issue : I can't load the entire C matrix with 32b elements.
As I read in the documentation that loading the whole C matrix using the LDDST_X1 option was possible,
I am pretty sure that I miss something, but I can't find out what exactly...
Best regards,
Axel