This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

l1d stall using floating point complex multiplication

Hi, I have a performance problem using floating point complex number complex multiplication (I'm working with 6670 DSP).

Using you cycle approximate simulator I found out this information:

The "CPU cycles" are very low thanks to the parallelism. (ii=4 for every cycle)

"Total cycles" are 10 times higher than "CPU cycles", because of "CPU.stall.mem.L1D"

Here the code:

pS, pH and r0 are __float2_t

{ loop 1200 times, every loop pH matrix is changing }

    _amemd8(&pS[0]) = _complex_mpysp(_amemd8(&pH[0][0][0]), r0);
    _amemd8(&pS[1]) = _complex_mpysp(_amemd8(&pH[0][1][1]), r1);
    _amemd8(&pS[2]) = _complex_mpysp(_amemd8(&pH[0][2][2]), r2);
    _amemd8(&pS[3]) = _complex_mpysp(_amemd8(&pH[0][3][3]), r3);

{ end loop }


The whole problem seems related to the pH matrix, how can I optimise the "total cycles"?

It doesn't look a cache bank issue, because the accessed pH elements are spaced 40 bytes.

Of course max caching is enabled. pS and pH are mapped in L2 memory, data_aligned(8)

  • Hi,
    I have assigned this thread to our experts for appropriate response.

    Note:

    We recommend you to use the EVM for testing and simulator support is de-focused. Thank you.

  • Hi, of course I'm also using a board and I've found out this performance issue monitoring the clock cycles between start and end of this part of the algorithm
    The simulator has shown the issue: L1D memory stalls
  • The first question I asked myself when I looked at your posting was  "Is it really the memory issue?"

    Of you have not done it already, enable keep assembly code and count how many instructions are in the look.  I wonder if the compiler is smart enough to use pointer or it calculates the location of the 3 dimension array every time you call the _complex_mpysp(_amemd8(&pH[0][0][0]), function

    It is always a good idea to define a local pointer and use the pointer in the loop.

    If this is what you (or the compiler) do then here is my advice:

    Put everything is L1D  (disable L1D cache and use smaller matrices) and then try again.  Calculate how many value the code reads or writes for each instruction and see if the timing makes sense. If not, move one of the matrices to a different bank

    Report back to here your observations

    Ran

  • Hi ran, thanks for your answer.
    Of course I've analysed the assembly code, the software pipeline schedule is ii=4, in fact with 1200 loops the CPU CYCLES are 1200*4 ~= 4800 cycles. The compiler would like to LDDW many doubleword in a few cycles and than CMPYSP them.

    I've tried to use a pointer instead of array elements, but nothing changes.
    I cannot "put everything" into L1D memory, because L1D is only 32k and I've 1200 4x4 floating point matrices, they must reside into L2 memory...
  • OK More suggestions:

    1. Make sure that the data is aligned on 8 bytes boundary and that you tell the compiler that the data is aligned.
    2. make sure that L1D cache is not trashed constantly. You have three arrays there (two read one write) so if the alignment of the three vectors is the same in L1D, (and L1D has 2-ways association) and if the result vector is already in L1D cache (L1D is write through, it does not allocate cache line on writes) the code will constantly trash the data in L1D
    3. Try to move the bank configuration. You can try and play with the pragma DATA_MEM_BANK and move one of the matrix around


    Ran
  • Hi Ran,

    1) data is aligned on 8 bytes with DATA_ALIGN and I'm using _nassert to tell this info to the compiler

    2) L1D cache doesn't seem to be trashed, the simulators shows zero L1D.miss.conflict and zero L1D.stall.write_buf_full. I can see a lot of L1D.miss.read and thousands of CPU.stall.mem.l1D

    3) unfortunately when i move the bank configuration I cannot use _amem8 anymore and the final results are worst than the previous one


    New info:

    1) nothing changes if I modify the algorithm using pointers instead of direct matrix access (pH+5 instead of ph[0][1][1] for example)

    2) I've by fare better results if I don't change pH matrix every loop

    For my application I need to increase pH by 16 and use the next 4x4 matrix, so pH+16 every loop (and I have such bad results)

    If I change the algorithm with pH+4 every loop, the results are incorrect of course but "totale cycles" are reduced almost by 3 (with exactly the same "CPU cycles")

    Do you have any idea? It's a pity to have such a great assembly but waste it with memory stalls

  • I fully agree with your last statement and I am running out of ideas.

    I do wonder if you always get data from L2 and not L1D. Please reconsider put everything in L1D (do it on few matrices) and see how fast you run. How many times the algorithm uses each value? If only once, then L1D cache does not help, right? When you increase ph by 4 you actually use the same value again, so you get speed up from the cache. Is it possible that you actually always read from L2 because you use each value only once?

    Please try to put soe matrices in L1D and repeat the experiment and then get back to the posting. This is really an interesting issue

    Best regards

    Ran
  • Hi Ran, putting all input data to L1RAM the performances are great, no L1D memory stall and "CPU cycles" are almost identical do "Total Cycles"

    > How many times the algorithm uses each value?
    Only once, for all inputs. L1D cache should help because when I load one dword I expect that also following dwords are loaded from L2 at the same time, am I right?

    Another information... since I use only the diagonal values of those matrix I've created a new buffer containing only those values.

    So the algorithm became:

    { loop 1200 times, every loop pH matrix is changing }

    _amemd8(&pS[0]) = _complex_mpysp(_amemd8(&pH[0][0]), r0);
    _amemd8(&pS[1]) = _complex_mpysp(_amemd8(&pH[0][1]), r1);
    _amemd8(&pS[2]) = _complex_mpysp(_amemd8(&pH[0][2]), r2);
    _amemd8(&pS[3]) = _complex_mpysp(_amemd8(&pH[0][3]), r3);

    { end loop }

    In memory all values are one after the other, and the results are much better than before
    Of course in order to use this solution I have to create another buffer and waste a lot of memory (and cpu time to fill it)...
  • So we actually understand what is going on

    The slow down is because data moves from L2 to L1 so the stall is waiting for cache lines to move from L2 to L1.
    And because of the structure of the matrix, even though the code needs only diagonal values, because the cache only moves cache lines, when you do not buffer the diagonals in a separate buffer, for every complex value the hardware moves the complete cache line. So this slows the execution

    If you agree please close the thread

    Ran
  • Yes I agree, it seems an L2 to L1 CPU stall waiting for data.
    Is there any way to "force" the DSP to load a certain amount of data to L1 cache?
    It would be great to load some data before starting executing the algorithm...

    I've found something on "DSP Cache user guide" about the assembly routine “touch”, that can be used to allocate length bytes of a memory buffer *array into L1D... do you think it could help?
  • Touch enables you to pre-load data into the cache. But of course it takes cycles to load the data. The advantage is for example if you have real-time system that is waiting for an even to start processing. If the system is not busy before the even occurs and then has to process very fast, it makes sense to pre-load all needed data using touch.

    Is this your case?

    Ran
  • No this is not my case, unfortunately.

    I suggest everybody with same issues to read carefully the "DSP Cache User Guide" that has tons of details about these kind of problems.

    The solution to my performance issue could be:
    1) create a new buffer containing just the matrix diagonal (the cache load works better)
    2) once the data has been loaded, use it for complex operation before moving to another load, because loading data could be longer than making complex operations like multiplications ore sums...

    Thanks for your help, bye