AM5728: C66x reducing pipeline stalls when using L1 configured as cache and accessing data in L2 SRAM

Tom More

Part Number: AM5728

I am trying to implement time critical algorithm on C66x that requires to process about 32kB of data. The algorithm is relatively simple and clearly bonded by the width of data bus. Because in the target environment there are also other, more complex tasks running in the background, L1 memory has to be configured as cache only. L2 is configured as 128kB of cache and remaining part as SRAM. To better analyse the issue that I am observing, I am using the following assembly code:

   SPLOOP 2
   LDDW *AYPTR++[2], AH
|| LDDW *BYPTR++[2], BH

   LDDW *AYPTR++[2], AH
|| LDDW *BYPTR++[2], BH

   SPKERNEL
   NOP 9

Before running this code whole content of L1 data cache is invalidated. The trace for the SPLOOP is below.

Data accessed in loop is placed in L2 SRAM (BYPTR = 8 + AYPTR). It is easy to notice a lot of pipeline stalls occurring when access to the next L1 cache line happens. The stalls actually takes more time than actual processing. Is there a way to improve performance for this kind of algorithm? I know about the touch function, but it gives mixed results where no all data are evicted between consecutive runs. I also assume, that to use IDMA I need to configure L1 as SRAM? What else can be done to reduce the number of cycles wasted on pipeline stalls?

over 5 years ago

0 Tom More over 5 years ago

Prodigy 190 points

Any thoughts?

0 jian35385 over 5 years ago in reply to Tom More

TI__Mastermind 23125 points

Tom,

Just read your description and also discussed with my team.

Stalls in the trace data you provided are due to L1 cache load, and we can see they precisely happened at 64byte cache line width. they are expected.

I am curious to understand a bit more about your system - how are these 32KB data being consumed by the CPU? you test code sequentially load from L2 to the same set of registers, but in you real system, what is the pattern they get used? that is where the stall cycles get filled usually.

If you definitely need to load all data to max number of available registers, there may be a suggestion similar to the __touch:
1. Partition the L1D with 8KB SRAM
2. Do a manual "touch", where you use DSP to access first word of every 64bytes
As you may see already, this will only save, if both conditions are true:
1. The 8KB is used multiple times instead of one time. otherwise the cost is the same;
2. your background tasks can live with a smaller L1D cache

let me know if these notes helps. I will close the ticket to satisfy our internal tracking, but it should re-open once you reply.

regards
Jian

0 Tom More over 5 years ago in reply to jian35385

Prodigy 190 points

Jian,

jian35385 said:
I am curious to understand a bit more about your system - how are these 32KB data being consumed by the CPU? you test code sequentially load from L2 to the same set of registers, but in you real system, what is the pattern they get used? that is where the stall cycles get filled usually.

Of course the algorithm is more complex than that, but we can approximate it by MAC operation. The important part is that when all data are in L1 the performance is good. But when data are not already cached in L1, stalls take more time than actual processing. The algorithm has been already optimized so all data needs to be loaded only once and other processing units (.M, .L. .S) are used in parallel to .D. The next processing happens on the same data set. However, periodically the data sets are switched. This is the situation that I am trying to model in snippet in opening post.

Can you show me an example how to fill stall cycles caused by cache misses inside SPLOOP?

Thanks,

Tom

0 jian35385 over 5 years ago in reply to Tom More

TI__Mastermind 23125 points

Tom,

you mentioned touch function, but I was not sure if you have run across below document:
www.ti.com/.../sprugy8.pdf
There is a detailed example in Section 3.2.2. PP 3-6 with assembly routine.

Note that it is not possible to completely eliminate stalls. instead, if you mast all background traffic, all 32KB can be warmed into the L1D, and parallel D1 and D2 loading stalls save about half the stall cycles.

There are additional optimizations in the assembly routing you might noticed.

Jian

0 Tom More over 5 years ago in reply to jian35385

Prodigy 190 points

Jian,

you are right - accessing data in the order from the touch function in sprugy8 reduces number of stalls. I guess that this is the maximum performance we can get with L1 configured as cache.

I am exploring one more option. I am trying to use L1P as SRAM. I plan to use IDMA to transfer data to it from L2 SRAM. However, when the core is accessing the data in L1P there are still 3-4 cycles taken by pipeline stalls. I am not observing this issue when L1D is used as SRAM. Is this caused by larger number of wait states for L1P? What is the most common scenario that benefits from L1P configured as SRAM?

Thanks,

Tom

0 jian35385 over 5 years ago in reply to Tom More

TI__Mastermind 23125 points

Tom,
do you happen to have a similar trace like you captured before? we suspect stalls were due to program execution out of L2.
Jian

0 Tom More over 5 years ago in reply to jian35385

Prodigy 190 points

Jian,

you were right, however after moving the function from OCMC to L2 I am observing even more stalls:

Maybe it is because now L1P needs to be accessed for both code and data? I found also strange that buf data are not zeroed. I am attaching my CCS test project to exclude any possible configuration errors that I could made.

bx15c66_l1p.zip

Thanks,

Tom

0 Tom More over 5 years ago in reply to Tom More

Prodigy 190 points

Actually, the code should be buffered in sploop buffer, so I have no idea why there are so many stalls. Jian can you provide some insight?

0 jian35385 over 5 years ago in reply to Tom More

TI__Mastermind 23125 points

Tom,
I overlooked your post before the holidays. The only reason I can think of, are due to L1P being Dynamically powered down while the DSP execute out of SPLOOP buffer. There is a note in the CorePac spec., Sec. 2.7.2 talking about this feature and it seems not possible for use to disable the power down.
I can also go talk to our compiler team tomorrow when he is available.
regards
jian

0 jian35385 over 5 years ago in reply to jian35385

TI__Mastermind 23125 points

Tom,

I chatted with the compiler team and you may try to use
--disable:sploop
to disable SPLOOP buffer. Generated code will use branch instruction instead.
to verify if the 11 cycle stalls still occurs.
as usual I need to close the ticket for book-keeping but feel free to reopen.
regards
Jian

Processors

Processors forum

AM5728: C66x reducing pipeline stalls when using L1 configured as cache and accessing data in L2 SRAM