TDA4VM: Unexpected negative results from MMA

Ross O'Connor

Part Number: TDA4VM

I have written some code for the MMA to simultaneously calculate the per-channel sum and sum of squares of an input image, as an exercise in learning the MMA.

Everything looks like it is working until the input data is negative, at which point I unexpectedly get a negative result in the sum of squares output - which should be impossible.

Can you help me understand what's going wrong?

Here is my initialisation code:

void mmaSumSumSqInit(const TIDL_bufParams3D_t *srcAddr, t_mmaSumSqConfig *config)
{
    const uint32_t matrixColumns = __MMA_B_COLS(sizeof(int8_t));
    const uint32_t matrixRowsPerRow = DIV_ROUND_UP(srcAddr->dim_x, matrixColumns);
    const uint32_t transfersPerRow = sizeof(int32_t) / sizeof(int8_t);

    config->numChannels = (uint32_t)srcAddr->dim_z;
    config->matrixRowsPerChannel = (uint32_t)srcAddr->dim_y * matrixRowsPerRow;

    config->seTemplate = __gen_SE_TEMPLATE_v1();
    config->seTemplate.ELETYPE = __SE_ELETYPE_8BIT;
    config->seTemplate.VECLEN  = c7x::se_veclen_from_traits<matrixColumns>::value;
    config->seTemplate.DIMFMT  = __SE_DIMFMT_3D;
    config->seTemplate.ICNT0   = srcAddr->dim_x;
    config->seTemplate.ICNT1   = srcAddr->dim_y;
    config->seTemplate.ICNT2   = srcAddr->dim_z;
    config->seTemplate.DIM1    = srcAddr->stride_y / sizeof(int8_t);
    config->seTemplate.DIM2    = srcAddr->stride_z / sizeof(int8_t);
 
    config->mmaConfig = __gen_HWA_CONFIG_REG_v1();
    config->mmaConfig.A_ATYPE = __MMA_A_CONFIG_ATYPE_INT8;
    config->mmaConfig.B_BTYPE = __MMA_B_CONFIG_SIZE8;
    config->mmaConfig.B_ORDER = __MMA_B_CONFIG_COL;                          // Load B down columns instead of across rows
    config->mmaConfig.B_BSWPER = config->matrixRowsPerChannel;               // Switch B to load between channels
    config->mmaConfig.B_BRSTPER = 1;                                         // Always load first column of B only
    config->mmaConfig.C_ATYPE = __MMA_C_CONFIG_ATYPE_SA;
    config->mmaConfig.C_BTYPE = __MMA_C_CONFIG_BTYPE_INT8;
    config->mmaConfig.C_OPERATION0 = __MMA_C_CONFIG_MUL;                     // C = AxB
    config->mmaConfig.C_OP0PER = 2;                                          //   Do this once per channel for sum and sumSq
    config->mmaConfig.C_OPERATION1 = __MMA_C_CONFIG_MULPLUS;                 // C += AxB
    config->mmaConfig.C_OP1PER = 2 * (config->matrixRowsPerChannel - 1);     //   Do this for the remaining ops per channel
    config->mmaConfig.C_BSWPER = 2 * config->matrixRowsPerChannel;           // Switch B to read between channels
    config->mmaConfig.C_CRSWPER = 2 * config->matrixRowsPerChannel;          // Switch C to read between channels
    config->mmaConfig.C_CWSWPER = 2 * config->matrixRowsPerChannel;          // Switch C to write between channels
    config->mmaConfig.C_CRRSTPER = 2;                                        // Alternate C read row between sum and sumSq
    config->mmaConfig.C_CWRSTPER = 2;                                        // Alternate C write row between sum and sumSq
    config->mmaConfig.X_XTYPE = __MMA_X_CONFIG_XTYPE_INT32;
    config->mmaConfig.X_CTYPE = __MMA_X_CONFIG_CTYPE_INT32;
    config->mmaConfig.X_CSWPER = 2 * transfersPerRow;                        // Switch C to transfer between channels
    config->mmaConfig.X_CRRSTPER = 2 * transfersPerRow;                      // Alternate C transfer row between sum and sumSq

    config->mmaOffset = __gen_HWA_OFFSET_REG();
}

And here is my execution code:

void mmaSumSumSqExec(const t_mmaSumSqConfig *config, const int8_t *data, int32_t *sums, int32_t *sumSquares)
{
    __SE0_OPEN((void *)data, config->seTemplate);
    __HWAOPEN(config->mmaConfig, config->mmaOffset, __MMA_OPEN_FSM_RESET);

    for(uint32_t ch = 0; ch < config->numChannels; ch++)
    {
        for(uint32_t i = 0; i < config->matrixRowsPerChannel; i++)
        {
            // A (row) = [1,1,...,1,1], B (col) = [in0,in1,...,in62,in63]
            __HWALDAB(1, __SE0(uchar64));

            // C row 0 col 0 = Sum(1 x in0 + ... + 1 x in63)
            __HWAOP(__MMA_A_LDA);

            // A (row) = [in0,in1,...,in62,in63]
            __HWALDA(__SE0ADV(uchar64));

            // C row 1 col 0 = Sum(in0 x in0 + ... + in63 x in63)
            __HWAOP(__MMA_A_LDA);
        }

        // Transfer C row 0 cols 0-15 to X
        __HWAXFER(__MMA_XFER_SRC_C);
        int16 sum = __as_int16(__HWARCV(0));

        // TODO: There might be a way to use offset config to avoid having
        //  to transfer remainder of row and discard it, but if we don't for now
        //  the subsequent channel (on this C bank) will start at the wrong column

        // Discard C row 0 cols 16-63
        __HWAXFER(__MMA_XFER_SRC_C);
        (void)__HWARCV(0);
        __HWAXFER(__MMA_XFER_SRC_C);
        (void)__HWARCV(0);
        __HWAXFER(__MMA_XFER_SRC_C);
        (void)__HWARCV(0);

        // Transfer C row 1 cols 0-15 to X
        __HWAXFER(__MMA_XFER_SRC_C);
        int16 sumSq = __as_int16(__HWARCV(0));

        // Discard C row 1 cols 16-63
        __HWAXFER(__MMA_XFER_SRC_C);
        (void)__HWARCV(0);
        __HWAXFER(__MMA_XFER_SRC_C);
        (void)__HWARCV(0);
        __HWAXFER(__MMA_XFER_SRC_C);
        (void)__HWARCV(0);

        sums[ch] = sum.s[0];
        sumSquares[ch] = sumSq.s[0];
    }

    __HWACLOSE(0);
    __SE0_CLOSE();
}

With positive input values, everything works correctly:

[   29.829513] logs[933]: [C7x_1 ]     43.745416 s: Input row: 90,90,90,90,90,90,90,90,90,89,89,89,89,89,89,89,89,89,89,89,89,88,89,89,89,89,90,89,89,90,89,89,89,88,88,89,89,89,89,89,89,89,89,88,88,88,8,
[   29.829723] logs[933]: [C7x_1 ]     43.745518 s: Status: 0x00010000000100000200000002000000020000000200000000020000000200000000000000000800000008000000000000000000000000000000000000000000
[   29.829830] logs[933]: [C7x_1 ]     43.745606 s: Status: 0x00010000000104000200000002000000020000000200000000020000000200000000000000000800000008000000000000000000000000000000000000000000
[   29.829898] logs[933]: [C7x_1 ]     43.745694 s: Status: 0x00010000000104000100000001000000010000000100000000010000000100000000000101000800000008000000000000000000000000000000000000000000
[   29.829988] logs[933]: [C7x_1 ]     43.745782 s: Status: 0x00010000000104000100000001000000010000000100000000010000000100000000000101000800000008000000000000000000000000000000000000000000
[   29.830062] logs[933]: [C7x_1 ]     43.745869 s: Status: 0x00010000000104000100000002000000020000000200000000020000000200000000380000000800000008000000000000000000000000000000000000000000
[   29.830137] logs[933]: [C7x_1 ]     43.745957 s: Status: 0x00010000000104000100000002000000020000000200000000020000000200000000380000000700000007000000000100000000000000000000000000000000
[   29.830201] logs[933]: [C7x_1 ]     43.746049 s: Status: 0x00010000000104000100000002000000020000000200000000020000000200000000380000000300000003000000000500000000000000000000000000000000
[   29.830268] logs[933]: [C7x_1 ]     43.746086 s: MMA Ch0 S/SumSq: 5682 (1632)/504494 (7b2ae)
[   29.830331] logs[933]: [C7x_1 ]     43.746107 s: Ref Ch0 S/SumSq: 5682 (1632)/504494 (7b2ae)

But with negative input values, we get a negative sum of squares(!):

[   29.799393] logs[933]: [C7x_1 ]     43.712012 s: Input row: 0,-128,0,127,-128,0,-128,0,0,-128,-128,-128,127,-128,0,127,-128,0,-128,-128,0,-128,-128,0,-128,-128,-128,0,-128,0,-128,-128,-128,-128,0,0,-,
[   29.799576] logs[933]: [C7x_1 ]     43.712120 s: Status: 0x00010000000100000200000002000000020000000200000000020000000200000000000000000800000008000000000000000000000000000000000000000000
[   29.799694] logs[933]: [C7x_1 ]     43.712210 s: Status: 0x00010000000104000200000002000000020000000200000000020000000200000000000000000800000008000000000000000000000000000000000000000000
[   29.799765] logs[933]: [C7x_1 ]     43.712300 s: Status: 0x00010000000104000100000001000000010000000100000000010000000100000000000101000800000008000000000000000000000000000000000000000000
[   29.799833] logs[933]: [C7x_1 ]     43.712388 s: Status: 0x00010000000104000100000001000000010000000100000000010000000100000000000101000800000008000000000000000000000000000000000000000000
[   29.799900] logs[933]: [C7x_1 ]     43.712476 s: Status: 0x00010000000104000100000002000000020000000200000000020000000200000000380000000800000008000000000000000000000000000000000000000000
[   29.799980] logs[933]: [C7x_1 ]     43.712564 s: Status: 0x00010000000104000100000002000000020000000200000000020000000200000000380000000700000007000000000100000000000000000000000000000000
[   29.800083] logs[933]: [C7x_1 ]     43.712656 s: Status: 0x00010000000104000100000002000000020000000200000000020000000200000000380000000300000003000000000500000000000000000000000000000000
[   29.800149] logs[933]: [C7x_1 ]     43.712695 s: MMA Ch0 S/SumSq: -3591 (fffff1f9)/-460537 (fff8f907)
[   29.800213] logs[933]: [C7x_1 ]     43.712718 s: Ref Ch0 S/SumSq: -3591 (fffffffffffff1f9)/686343 (a7907)

Although note that the HWASTATUS output is identical in both cases.

Furthermore, on host emulation, the MMA gives the correct result for the same input:

Status: 0x00010000000100000200000002000000020000000200000000020000000200000000000000000800000008000000000000000000000000000000000000000000
Status: 0x00010000000104000200000002000000020000000200000000020000000200000000000000000800000008000000000000000000000000000000000000000000
Status: 0x00010000000104000100000001000000010000000100000000010000000100000000000101000800000008000000000000000000000000000000000000000000
Status: 0x00010000000104000100000001000000010000000100000000010000000100000000000101000800000008000000000000000000000000000000000000000000
Status: 0x00010000000104000000000002000000020000000200000000020000000200000000380000000800000008000000000000000000000000000000000000000000
Status: 0x00010000000104000000000002000000020000000200000000020000000200000000380000000700000007000000000100000000000000000000000000000000
Status: 0x00010000000104000000000002000000020000000200000000020000000200000000380000000300000003000000000500000000000000000000000000000000
MMA Ch 0 S/SumSq: -5369 (ffffeb07)/686343 (a7907)
Ref Ch0 S/SumSq: -5369 (ffffffffffffeb07)/686343 (a7907)

Can you please help explain what's going on here?

Many Thanks,

Ross

over 2 years ago

0 Ross O'Connor over 2 years ago

Prodigy 210 points

Also to add, the sample data above is a 64x1x1 data input (as my first thought was overflowing the int32_t accumulator with too much data, so I wanted to shrink the input size to rule that out).

0 Ross O'Connor over 2 years ago

Prodigy 210 points

Hi,

Can someone take a look at this please? We're stuck with the MMA until this is resolved.

Thanks,

Ross

0 William Leven over 2 years ago in reply to Ross O'Connor

TI__Intellectual 2410 points

Hi Ross,

My apologies that it has taken us this long to reply. I've looked over your code and have one thing for you to try - I'm not sure this is the issue, but it's easy enough to do that its worth a quick try.

The MMA has output processing that is usually used to reduce the output bit width back down to the input bit width. However, you are trying to access the results at the accumulator (C panel) bit width, so you don't actually need this processing. __gen_HWA_CONFIG_REG_v1() has all output processing disabled, so that all lines up. However, there may be a requirement in the HW that some form of output processing is selected and the behavior is undefined if it is not (emphasizing that I'm guessing here, but its faster to just try than to look for definite answer in the design)

To try this, all you need to do is add the following to your MMA init code:

config->mmaConfig.X_SAT = 0x1;

Since you have set

config->mmaConfig.X_XTYPE = __MMA_X_CONFIG_XTYPE_INT32;
config->mmaConfig.X_CTYPE = __MMA_X_CONFIG_CTYPE_INT32;

this should have no mathematical impact; setting X_SAT simply means that the output is limited to the most positive or negative 32-bit integer, which was already true mathmatically. However, it may resolve an undefined situation in the hardware. Let me know if that helps or not and then we can look at next steps if necessary.

Best,

Will

0 Ross O'Connor over 2 years ago in reply to William Leven

Prodigy 210 points

Hi Will,

Thanks for the suggestion.

I've just tried it with X_SAT set as recommended but it doesn't change the result sadly. I still get correct answer when pixels are positive, but invalid negative result when pixels are negative (but with correct lower 15-16 bits).

Thanks,

Ross

0 William Leven over 2 years ago in reply to Ross O'Connor

TI__Intellectual 2410 points

Hi Ross,

Two vectors to proceed:

1. I sent a "friend" request so you can send a direct/individual message containing your .out file, if that's something you are willing to do.

2. Another way to approach this task would be to use the SE to up-convert the input stream to 32-bit data and then process the data as 32-bits in and 32-bits out. The SE can do this by setting

// for signed data
config->seTemplate.PROMOTE = __SE_PROMOTE_4X_SIGNEXT;

// for unsigned data
config->seTemplate.PROMOTE = __SE_PROMOTE_4X_ZEROEXT;

This will feed the MMA your 8-bit data in a 32-bit container. You'll need to update the A and B data types in the MMA. This should work (and is is close to code I've written in other applications). The only downside is that it has limited ability to scale-up if you want the input data to be 16- or 32-bits.

0 Ross O'Connor over 2 years ago in reply to William Leven

Prodigy 210 points

Hi Will,

Thanks - I was in the process of trying those and I discovered an issue in the code which was causing A_ATYPE to get reset to UINT8 instead of INT8.

It was only present in the target code, which is why the host emulation worked as expected.

So it looks like a bug on my end after all - sorry for the faff! At least I've learned a thing or two about it.

Not sure why it'd have this exact effect (including preserving the LS bits of the square sum), but resolving it seems to give the expected results finally (without need for saturation or promotion).

Thanks for your assistance!

Ross

+1 William Leven over 2 years ago in reply to Ross O'Connor

TI__Intellectual 2410 points

Hi Ross,

Thanks for the update and great to hear it is resolved!

Best,

Will

Processors

Processors forum

TDA4VM: Unexpected negative results from MMA