ASM problem in C6474

Mounir BAHTAT82529

Other Parts Discussed in Thread: TMS320C6474

Hello,

I've coded an assembly function on C64x+ of the EVM TMS320C6474 board, but the execution time is not as expected : the number of cycles my asm function performs should lead to 19 ms as the execution time (according to theory .. by counting the number of cycles of my function), however, i obtain on the board 37 ms .. the only explanation i found of that difference is because "memory stalls" especially that my asm function uses data on DDR and L2 .. i'm wandering then if it's normal to get those results, knowing that i'm activating the cache .. is memory stalls the only factor that could lead to that difference ?

note :

- data is aligned carefully so as to have adjacent data treated by the algorithm (no far jumps in DDR or L2)

over 14 years ago

0 DanRinkes over 14 years ago

TI__Expert 8055 points

Mounir,

It is normal to get some memory stalls when using cache. But I would suspect something is not right if it's causing a function to take twice as long as expected.

1- How many times are you running this code to get your numbers? If you only run it once, then cache will help a bit, but you'll get a miss on every new cache line simply because this is the first time these values have been accessed. If you run the code in a loop, and then take the average number of cucles, that will give you a more representative number.

2- How are you measuring the number of cycles? Are you starting a counter when you enter your .asm function, and then stopping it when you exit? If so, could there be any interrupts occurring while in your function?

3 - How big is your cache configured to be? Does increasing it's size make a difference?

4- How big is your .asm function? Is it a tight loop? How many times does it get executed in a single call of your function? Any branches? Each time you branch, no matter how far, you incur a delay where the pipeline needs to be flushed.

5 - Sorry, obligated to ask this question....why are you implementing this function in assembly? It's _very_ difficult to hand optimize assembly code that will execute as fast as C6x Compiler optimized C code. There are other types of stalls that could be occurring. Two that come to mind are Cross Path Stalls and Memory Bank Conflict stalls, These can occur in hand written assembly, but the C compiler would avoid them for the most part.

The pipeline is complicated and scheduling is tedious, without even getting into the pain of writing code in any assembly language. I understand, in some case, there just aren't constructs to represent some stuff in C. Which leads to my next question.

Are you using Linear Assembly or standard C6x assembly? Are you familiar with linear assembly format? It's very similar to regular C6x assembly, except you allow the compiler/assembler to allocate the appropriate CPU registers and allow the optimizer to do the instruction scheduling. If you _have_ to write assembly code, Linear Assembly will let you use Assembly Language constructs, but allow the compiler to do the scheduling and register allocation so as to avoid various types of stalls, including the two that I mentioned.

Regards,

Dan

0 Mounir BAHTAT82529 over 14 years ago in reply to DanRinkes

Prodigy 60 points

Thanks for your reply Dan,

The asm function i'm developping is a complex matrix multiply, I tried before several versions of the C implementation with all possible hints from the "optimization workshop", however it still doesn't meet my timing requirements (it gives at best 56ms to get the job done !). That first asm function i wrote turns to be more efficient than c-coded ones, despite providing only 50% of theoretical efficency. I'm not using the Linear Assembly, i preferred the standard C6x asm as i want to well control every detail happening in core/bancs levels. I wrote the assembly code for the whole algorithm, then, i'm not calling it in a certain loop, i executed it only once. The code is not interrupted, the cache is set to its maximum values : 256 KBytes of L2 and 32 KBytes of L1P/L1D. The asm file counts 410 code lines and is called from C.

The perfect performance is equal to (m/2)np cycles (where m,n,p defined constants) as i'm using the CMPY and i'm able to do 2 CMPY per cycle on a 1 C64x+ core .. i optimized the asm function til it should be done in (m/2+3)np, which i've never seen in practice !

The most internal loop of the asm code contains :

CMPY .M2X B27,A27,B29:B28

|| CMPY .M1X A26,B26,A29:A28

|| LDDW .D2 *B20++,B27:B26

|| LDDW .D1 *A22++,A27:A26

|| ADD .L1 A29,A25,A25

|| ADD .S1 A28,A24,A24

|| ADD .L2 B29,B25,B25

|| ADD .S2 B28,B24,B24

Which compute 2 complex multiplies, loads 4 complex numbers into the bancs and accumulate the results ..

That single execute packet should be looped (i don't see how to efficiently use the branching operation "B" without losing additional cycles, i didn't use SPLOOPD .. so, as the interal loop of the matrix multiply is repeated m/2=32 times which is not too big, i simply repeated the execute packet manually in the following manner)

...

CMPY .M2X B27,A27,B29:B28 ; 10

|| CMPY .M1X A26,B26,A29:A28

|| LDDW .D2 *B20++,B27:B26

|| LDDW .D1 *A22++,A27:A26

|| ADD .L1 A29,A25,A25

|| ADD .S1 A28,A24,A24

|| ADD .L2 B29,B25,B25

|| ADD .S2 B28,B24,B24

CMPY .M2X B27,A27,B29:B28 ; 11

|| CMPY .M1X A26,B26,A29:A28

|| LDDW .D2 *B20++,B27:B26

|| LDDW .D1 *A22++,A27:A26

|| ADD .L1 A29,A25,A25

|| ADD .S1 A28,A24,A24

|| ADD .L2 B29,B25,B25

|| ADD .S2 B28,B24,B24

...

I was aware about the cross path stalls, i don't think that is the case in my code (user guide says : "It should be noted that no stall is introduced if the register being read is the destination for data placed by an LDx instruction") .. also, i don't see how could a memory bank conflict could happen ..

Perhaps the fact that on each cycle i'm obliged to load 128 bits by LDDW (from L2[*B20] and DDR[*A22]) "could" explain this huge "additional" latency of about 16 ms [i was convinced of this at a certain time before i get confused], then, i recently coded in ASM another algorithm : the tiled[blocked] matrix multiply which exploits loaded data at max, and then reduces memory accesses .. effectively, in theory again, it reduces memory accesses by 75% (only 4 LDDW instructions are used now per 8 cycles), but IT STILL giving the "same" time to get done : about 36 ms !!

Any suggestions/explanations ?

Thanks . Mounir

0 DanRinkes over 14 years ago in reply to Mounir BAHTAT82529

TI__Expert 8055 points

Mounir,

I have a few questions/ideas/comments.

1. Is this just a standard complex matrix multiply? Or is there something special that it also does in the course of the algorithm. The reason that I ask is because the 64x+ DSP Algorithm Library (DSPLIB) contains hand optimized assembly versions of both a single precision floating point complex matrix multiply and a 16-bit int complex floating point multiply. At the very least, this might serve as a reference point for you. You can get the 64x+ DSP LIbrary here.

http://software-dl.ti.com/sdoemb/sdoemb_public_sw/dsplib/3_0_0_8/index_FDS.html

Also, see http://www.ti.com/lit/an/spra666/spra666.pdf for tips on optimizing loops with the C compiler. This goes into much more detail than the Optimization workshop does.

2. Where is your code located (.text)? Could there be delays in accessing it. Some cycles could potentially be lost because, without a branch or an SPLOOP, we keep needing to fetch new packets of instructions. This probably isn't a huge portion of the difference, but it's some.

Remember, you can branch

3. There is hardware available on chip that will allow you to detect and count various events (like cross path stalls, cache misses, etc) If you can figure out which of these is occurring the most often and make it simpler to find them. There is a tutorial here: http://processors.wiki.ti.com/images/0/0d/Ubm-ext-02.pdf Focus on the section about setting up a counter (Pg. 31) If you run your test a few times, counting different events each time, you should be able to identify which events are occuring the most often.

Regards,
Dan

0 Mounir BAHTAT82529 over 14 years ago in reply to DanRinkes

Prodigy 60 points

Hi Dan,

DSPLib seems not to offer C64x+ optimized matrix multipy on 32-bit precision (at the output) .. however, ti's code implementation in C is useful and seem to use the tiled matrix multiply .. besides, as u mentioned my matrix multiply is a bit different as it will include a modified arrangement of the output in memory ..

I put all the sections into L2 including .text, except the input/output in DDR (because they take 9 MBytes & 4.5 MBytes) ..

Thanks for your 3rd idea, i just tried mesuring memory stalls, misses .. and it seems that L1D presents the highest stalls, and explains almost all of the additional latency .. L1P and other stalls are relatively very small .. perhaps because i didn't optimize enough the L2 cache for the external memory ..

Best regards,

Mounir

Processors

Processors forum

ASM problem in C6474