C67X Cycle true simulator vs. timer measurement

Bernhard Fuchs

Other Parts Discussed in Thread: TMS320C6701

Hello,

i hope somebody can help me understanding the results of the CCS4 Profiler.

After writing some code i wanted to know what the execution time of a code fragment is, so i started up the profiler. I found the "cycle.cpu" report and started to optimize my code.
Some optimizations later i tried the code inside the target hardware. Since profiling is not possible inside the target i used a timer to calculate the time.
I do know that the simulator can´t handle cache effects.
The result was frustrating.

Simulator - "cycle.cpu" - 31.500 Cycles
Target HW - 804.136 Cycles (CNT0 Register says 0x3114A -> 201034 * 4 = which means 804136. The timer has a clock-ratio 1:4)

This can not result by a cache effect since the DSP i am using has only a L2-Instr-Cache which is disabled since the code fits into the internal SRAM.

is the result of the profiler wrong or what else can be the reason?

should i post my question in the c6000-single core forum?

kind regards
Bernhard

over 14 years ago

0 Ki over 14 years ago

TI__Guru**** 472311 points

cycle.cpu does not include any memory access stalls and any other effects from the rest of the system. You are using a CPU simulator so in addition to cache not being simulated, no peripherals are being simulated either.

0 Bernhard Fuchs over 14 years ago in reply to Ki

Intellectual 320 points

thanks for your reply,

its clear for me that memory stalls can not be simulated. But in my case the Code is in the internal instr-mem and the data is inside the internal data-mem.
I did add some events to the profiler to check that no external memory access is done.

What is the usual delay for a internal data memory access resp. instruction memory access?

best regards
Bernhard

0 Nizamudheen A over 14 years ago in reply to Bernhard Fuchs

TI__Intellectual 1380 points

Hi,

The C67x simulator configuration, as clarified by Ki-Soo, does not model the memory-heirarchy. It models the ISA CPU and a fixed-latency beyond. It does not model cache latencies and latecy differences while accessing internal or external memorties.

So, this simulator will exhibit a property wherein the cycles consumed to access an internal memory Vs external memory vs cache memory will all be same. So, you cant use this simulator verstion to optimize your memory placement. However, you could use this simulator to optimize your computation.

And, coming to your specific problem... I will be able to help you better if i can look at the code. Can you kindly share your code?

Regards,

Nizam

0 Bernhard Fuchs over 14 years ago in reply to Nizamudheen A

Intellectual 320 points

Hello,

as i said before its clear to me that the simulator cannot model the memory-heirachy. Its also clear that it calculate with a fixed-latency.

But as i mentioned earlier i do have the complete code and the hole set of data inside the internal SRAM (which means on-chip). I dont wanna optimize the placement of my implementation. I want to evaluate if a given implementation is possible with the dsp.

Since this code is part of my thesis it would be unwise to publish it before the thesis was printed.

But consider the following fact.

1000 complex samples * 200 complex taps of a fir filter will take at minimum 800.000 multiplications since it is complex (1000*200*4). Because the dsp can multiply-add twice a clock cycle this would mean at min. 400.000 clock cycles (its clear to me that we need to add some overhead).

With the simulator i can see that it takes about 402.000 cycles which is fine for me (does the profiler output the value as decimal value or as hex?)

But what if the Code needs 10.000.000 cycles! in the target with only internal memory used for data and instructions. Then there might be an error in the implementation or some not defined behaviour of the memory.

How long does it take to fetch an operand / data from the internal SRAM. (i did only find information that the from instruction memory a fetch of a VLIW-Word (256bit) can happen at 1 cycle. And that 2 operands can be fetched at 1 cycle from the data-iram.

Do i have to declare a variable as "inside-the-internal-sram" to tell the compiler that there is no need to add 4 delay slots during a read/write?

king regards

Bernhard

0 Nizamudheen A over 14 years ago in reply to Bernhard Fuchs

TI__Intellectual 1380 points

>>> does the profiler output the value as decimal value or as hex?<<<<<

Ans: It is decimal.

>>>>>How long does it take to fetch an operand / data from the internal SRAM. (i did only find information that the from instruction memory a fetch of a VLIW-Word (256bit) can happen at 1 cycle. And that 2 operands can be fetched at 1 cycle from the data-iram.<<<

Ans: In C67x, an isolated load from the internal memory has a fixed delay of 4 CPU cycles. With back-to-back loads, the overall loads/cycle can be see as 1 loads every cycle.

And, two loads can be issued in a given cycle, in C67x.

>>>>>Do i have to declare a variable as "inside-the-internal-sram" to tell the compiler that there is no need to add 4 delay slots during a read/write?<<<<<

Ans: As indicated earlier, any memroy access (irrespective of the memory hierarchy the contents reside in) will always have a fixed 4 cycle delay-slot. And, so there is no ways to inform the compiler to insert lesser waits.

>>>>>But what if the Code needs 10.000.000 cycles! in the target with only internal memory used for data and instructions. Then there might be an error in the implementation or some not defined behaviour of the memory. <<<<<

Ans: Did you refer to harware, when refering to target? And, can you elaborate on the "error in the implementation ...." statement?

Regards,

Nizam

0 Bernhard Fuchs over 14 years ago in reply to Nizamudheen A

Intellectual 320 points

-> And, two loads can be issued in a given cycle, in C67x.

With the two loads, do you mean two loads from the data-sram at the same time?

-> Ans: As indicated earlier, any memroy access (irrespective of the memory hierarchy the contents reside in) will always have a fixed 4 cycle delay-slot. And, so there is no ways to inform the compiler to insert lesser waits.

The question arrised after reading the C-Optimization Handbook from TI. I read something about "near" and "far" keywords which were new to me since i come from a microcontroller-world.

-> Ans: Did you refer to harware, when refering to target? And, can you elaborate on the "error in the implementation ...." statement?

Yes, with target i mean running on the dsp. Connection is made via a XDS510 JTAG Emulator. The error statement is a bit shady. Lets say it in other words. Maybe my implementation is not optimal enougth. Can you give me a hint how to evaluate the error?

To sum up, does this mean that the "Cycle-True" simulator tells how many cycles are needed at minimum which means without any conclusion about memory even if internal?.

kind regards

Bernhard

0 Bernhard Fuchs over 14 years ago in reply to Bernhard Fuchs

Intellectual 320 points

in the meanwhile i tried something different.
i used the texas DSP library to evaluate the problem

Since texas offers this hand-optimized lib this is maybe a good start. I inserted a function call of DSPF_sp_FIR_GEN inside my testcode. As you might know, there is a calculation formula for clock cycles for every function implemented.
Additional to this i document, there is a second document with examples for this library (spra947a). On Page 9 of this document you can see a table with the calculated cycle cound and the observed. Those values differ only a bit.

Simulation result about 24.781 cycles
Running inside DSP about about 800.000 cycles

this is a factor of about 32!!!
I have to say that my DSP does not have any L1 cache (6701).

Do you have any idea how this result comes together?

regards

Bernhard

0 Bernhard Fuchs over 14 years ago in reply to Bernhard Fuchs

Intellectual 320 points

i made a mistake on the timervalue,

Simulation result about 24.781 cycles
Running inside DSP about 800.000 cycles * 4 since the timer operates at /4 clock

0 Bernhard Fuchs over 14 years ago in reply to Bernhard Fuchs

Intellectual 320 points

nobody has a clue?

0 Bernhard Fuchs over 14 years ago in reply to Bernhard Fuchs

Intellectual 320 points

someone has any idea whats happening here?

0 Nizamudheen A over 14 years ago in reply to Bernhard Fuchs

TI__Intellectual 1380 points

Hi,

Yes. The C674X simulator gives you the lower-bound cycle values. As you can appreciate, this varaint of hte simulator does not model the system-effects and hence cannot project the exact or upper bound. In summary, the C674x simulator will always gives you a best-case result (i.e, the application with no memory-access stalls).

And, regarding the SPRA947A

..... Do you see that the calculated cycle count (as per the formula in Page 9) matches the simulator results? Or, does it match the DSP result?

Regards,

Nizam

0 Bernhard Fuchs over 14 years ago in reply to Nizamudheen A

Intellectual 320 points

hi,

thanks for the response

the calculated count matches the simulator result. Its completely clear to me that the result of the simulator is just a lower-bound. But there has to be some plausible relation to the result when running to code inside the DSP.

Let me rephrase the original question:

Using the DSPLIB from TI. DSPF_sp_fir_cplx
If the code runs inside the simulator the cycle count is ok. It correlates with the calculation formula.
If the code runs inside the target (C6701-SP EVM) then its 302 times slower!

Same with the DSPF_sp_fir_gen but only 15 times slower

How is it possible that there is such a huge difference.
First i thought that this could be a problem because of the missing cache. Then i looked into the assembly code and found out that i cant be a cache issue because the code is written in a way to use it without any cache. It can`t also be a problem of memory-stalls since the mentioned functions of the DSPLIB are written in a way to eliminate such problems.

what is concerning me is that in the document spra947a on page 9 you can see an example of the cycle simulator and the DSP (different DSP 6713). The difference is very small!

i do not have the values of the DSPF_sp_fir_cplx filter on the table but i have them for the DSPF_sp_fir_gen:

DSPF_sp_fir_gen:
calculated cycles from formula:                         208.808
observed cycles L2-SRAM:                            3.164.688 (Timer measurement) - Majda Jovisic from TI told me to recompile the DSPLIB for the 674x because of some bugs
observed cycles L2-SRAM:                            2.428.000     <<---- we are here atm

still a factor 11!

i am trying to evaluate the maximum performance of the dsp for the use in a space-application. Unless you can show me any relation between the real world and the simulation the cycle-simulator looks quite useless for me.

do you understand my problem?

regards

Bernhard

0 Brad Griffis over 14 years ago in reply to Bernhard Fuchs

TI__Guru*** 125430 points

Bernhard,

If you put the code into L1 SRAM and the data into the internal RAM then you should see precisely the same performance as the simulator.

Can you post the map file from your application? I want to see the exact placement of both your code and data.

Bernhard Fuchs said:
Then i looked into the assembly code and found out that i cant be a cache issue because the code is written in a way to use it without any cache.

Please elaborate. Is cache enabled or disabled? Where is the code placed? If you place code in L2 with the cache dsiabled ("without any cache") then you will incur a cache miss penalty on each and every single CPU cycle.

Bernhard Fuchs said:
It can`t also be a problem of memory-stalls since the mentioned functions of the DSPLIB are written in a way to eliminate such problems.

It's impossible to eliminate memory stalls by simply adjusting your code. Memory stalls are 100% a function of data placement. If data is in external memory then you will incur a stall. You cannot possibly avoid it. The only way to "avoid" this issue is to use DMA to move the data into internal memory and then do your processing in internal memory.

0 Brad Griffis over 14 years ago in reply to Bernhard Fuchs

TI__Guru*** 125430 points

Bernhard Fuchs said:

DSPF_sp_fir_gen:
calculated cycles from formula:                         208.808
observed cycles L2-SRAM:                            3.164.688 (Timer measurement) - Majda Jovisic from TI told me to recompile the DSPLIB for the 674x because of some bugs
observed cycles L2-SRAM:                            2.428.000     <<---- we are here atm

If you simply took C code from the 674x dsplib and recompiled it, then I do NOT expect you to obtain the cycle counts listed in the formula. Those cycle counts apply only to the hand written assembly code.

There is a bug with DSPF_sp_fir_gen but it can be easily worked around by disabling interrupts before calling the function and then restoring interrupts afterward. I recommend going that route instead of recompiling other code. The performance of the hand written assembly will be superior.

0 Bernhard Fuchs over 14 years ago in reply to Brad Griffis

Intellectual 320 points

Hi Brad,

>>If you simply took C code from the 674x dsplib and recompiled it, then I do NOT expect you to obtain the cycle counts listed in the formula. Those cycle counts apply only to the hand written assembly code.

i did not recompile c code. It was assembler code.

>>If you put the code into L1 SRAM and the data into the internal RAM then you should see precisely the same performance as the simulator.

According to the documentation there is no L1-SRAM. The internal SRAM is referenced as L2-SRAM!

The map file is attached to the post.2364.C6701_MR_FIR_TEST.zip

>>Please elaborate. Is cache enabled or disabled? Where is the code placed? If you place code in L2 with the cache dsiabled ("without any cache") then you will incur a cache miss penalty on each and every single CPU cycle.

As already said... there is no L1 data cache available. Only L2-Cache if using the external memory via EMIF

Processor type is TMS320C6701 - no L1 SRAM, no L1-Cache!

best regards
Bernhard

0 Brad Griffis over 14 years ago in reply to Bernhard Fuchs

TI__Guru*** 125430 points

FYI, I'm on vacation right now but the kids are napping. I need to be brief!

I'm looking at the 6701 data sheet right now.

Page 1:

"512K-Bit Internal Program/Cache"

"512K-Bit Dual-Access Internal Data"

In other words, there are separate memories for program and data, 64KB program and 64KB data.

Page 2:

"Program memory consists of a 64K-byte block that is user-configurable as cache or memory-mapped program space."

In other words the program memory can be setup as L1 cache or you can disable the cache and use it as L1 SRAM.

You can find further information related to the internal memory and how to configure it in 620x/670x Program and Data Memory Controller Reference Guide. Specifically it describes how to configure the program memory as cache or as mapped memory through the PCC field of the CSR. However, looking at your map file it looks like you are allocating memory directly to address 0x0000 - 0xFFFF, so I assume you must have PCC configured so as to disable cache. Please confirm by looking at bits 7:5 of CSR in a watch window.

So according to your map file it looks like you are in fact putting your code directly into the program memory and your data directly into the data memory, i.e. you're not using external memory. That said, I assume you should match the simulator performance except perhaps for stalls related to simultaneous bank accesses, though that would never give the performance hit you're seeing.

The only other thing that comes to mind would be to know more about the exact mechanism you are using to make your measurements, i.e. are you measuring incorrectly or something along those lines. For example, if you have a printf anywhere that will kill your performance.

Brad

0 Bernhard Fuchs over 14 years ago in reply to Brad Griffis

Intellectual 320 points

Thanks for answering. Hope you enjoy your holidays :)

The Data Memory Controller reference document is something i was searching for a long time. I know understand how the memory bank are organized.
I checked the cache bits of CSR -> Program Cache is disabled.

We thought that it might be a bank conflict. But if i look at the code i can´t see any parallel load instructions.
Can a bank conflict also occur on interleaved load instructions if X loads are sequential processed like LDW,LDW,LDW. Since a load takes 5 cycles, and there are often many seq. loads can this cause a bank conflict?

The measurement is done in two ways. First using the internal timer which runs at 1/4 clock of the dsp and multiplied the timer register as decimal value * 4. At first only this way. After some unsuccessful days of testing i wanted to confirm the timer result so i used a scope to measure the timing. I'm using the XCLKX1 pin from MCBSP1 as general purpose I/O (theres a TTL converter on the path but thats ok). I could confirm the timer measurement with this technique. The internal clock was also confirmed using a scope at the AF22 pin (CLKOUT of the DSP). The DSP runs at 100MHz clock.

Pseudocode:

clear CNT Reg of timer0
start timer
call function
stop timer

OR

set pin HIGH
call function
set pin LOW

Whats confusing me is that if i trigger the pin seq. H,L,H,L,H,L i can see that a period is 6 Cycles but i am expecting 3 cycles since a STW takes 3 cycles until it is finished and 3 Cycles after that i should go down. But that's some other problem.
Since i could confirm the timer with the scope the timer-cnt value is good.

There are also no printf in the function. Only 1 printf after all tests. But i will remove it.

0 Brad Griffis over 14 years ago in reply to Bernhard Fuchs

TI__Guru*** 125430 points

Bernhard Fuchs said:
Can a bank conflict also occur on interleaved load instructions if X loads are sequential processed like LDW,LDW,LDW.

A conflict only occurs when there are two simultaneous accesses to the same bank, so having sequential loads would not cause a problem.

Bernhard Fuchs said:
Whats confusing me is that if i trigger the pin seq. H,L,H,L,H,L i can see that a period is 6 Cycles but i am expecting 3 cycles since a STW takes 3 cycles until it is finished and 3 Cycles after that i should go down.

The peripheral registers reside off the "configuration bus" which is not nearly as fast as the data bus. For this reason I would expect accesses to the timer and/or GPIO registers to be slow.

Bernhard Fuchs said:

Pseudocode:

clear CNT Reg of timer0
start timer
call function
stop timer

OR

set pin HIGH
call function
set pin LOW

Are you only calling the function once inside this loop? Try calling it 2 times and then 3 times to see how the number changes. In other words, I'd like to figure out the overhead of your benchmark.

0 Bernhard Fuchs over 14 years ago in reply to Brad Griffis

Intellectual 320 points

hi,

sorry for the delay. I was ill the last days.
The problem is solved! I had something to do with the variable placement. I need some time to verify this but i looks like a bug of the memory viewer.
It showed that the data was correct aligned , but it wasnt. So the performance was mainly decreased by memory bank hits. If i have time the next week i will verify this.

nevertheless thanks for your support :)

best regards

Bernhard

0 Brad Griffis over 14 years ago in reply to Bernhard Fuchs

TI__Guru*** 125430 points

I'm glad to hear you've made some break throughs. So do you finally have single-cycle operation where the simulator matches the hardware?

Code Composer Studio™︎

Code Composer Studio forum

C67X Cycle true simulator vs. timer measurement