OMAP-L138 CCS4.1 Simulator vs. Real Board

Alexey Tsirlin

Hi.

I have written simple function that converts buffer with fixed point values into a buffer with floating point values. I have compared number of cycles that the function was running on simulation (C674X CPU cycle accurate simulator) and the ZOOM expirementer board (XDS100v1 debugger). Both the code and data are placed in L2SRAM of the DSP, all the interrupts are disabled. I can see that simulator shows about twice less cycles than the real board.

Cycles measurment was done by using Target->Clock->Enable option of the CCS4.

How can this be explained? Is it possible that enabling L1 cache will solve the problem?

over 15 years ago

0 Mariana over 15 years ago

TI__Mastermind 24340 points

Hi Alexey,

are you still having this issue?

Can you attach your code so I can take a look?

0 Alexey Tsirlin over 15 years ago in reply to Mariana

Intellectual 555 points

Yes, I still have the same problem with different functions, for example:

void TestFunc(WORD wLength, PWORD restrict pwOutput)
{
WORD i;
int a,b;
for (i=0;i<wLength;i++)
{
  /* use intrinsic to convert the number to integer */
  a = _spint(sfIn1[i]);
  b = _spint(sfIn2[i]);

  /* saturate the number if required */
  a = _sshl(a, 32-10);
  b = _sshl(b, 32-10);

  /* fetch the lower 16 bits of the result */
  //*(pwOutput++) = _mpyhlu(b,1);
  //*(pwOutput++) = _mpyhlu(a,1);

*(pwOutput++) = _sshvr(b,32-10);
*(pwOutput++) = _sshvr(a,32-10);
}
}

I can see that simulator shows about 1000 cycles (with wLength set to 512) and real system shows about 1600 cycles. sfIn1, sfIn2 are global variables, data and code is placed in L2RAM

0 Nizamudheen A over 15 years ago in reply to Alexey Tsirlin

TI__Intellectual 1380 points

Hi Alexey,

Can you confirm if the Target->clock->view counts the cycles.Total event? It might happen that the Target->Clock->view may be couting some other events, may be the cycle.CPU, and hence the counts may be different.

Regards,
Nizam

0 Alexey Tsirlin over 15 years ago in reply to Nizamudheen A

Intellectual 555 points

Nizam,

In case of XDS100 emulator, the Target->clock->setup is set to "cycles" and in simulator, it set to cycle.CPU (I don't have cycles.Total option available in this setup).

Thanks, Alexey.

0 Nizamudheen A over 15 years ago in reply to Alexey Tsirlin

TI__Intellectual 1380 points

Hi Alexey,

I recommend you to use the C6747 Device Cycle Accurate simulator configuiration for your experimnets. Reason: The C674X CPU cycle accurate simulator configuration does not model the memory-system and its related stall cycles.

I am little worried your application (that reports 1600 cycles on the emulator), b/c it seems to incur a +600 cycles due to the memory-system overhead. Either you have placed data and/or code in external memory, or it is incurring too much cache penalties. There is a lot of scope to improve by optimizing the memory/code placements.

I hope the profiling in CCS v4 will help you to nail down the bottleneck in the code you have. LMK if you need more help regarding this.

Regards, Nizam

0 Alexey Tsirlin over 15 years ago in reply to Nizamudheen A

Intellectual 555 points

I have tried replacing the simulator I've used by C6747 Device Cycle Accurate and selected cycle.total in the setup. Now I am getting pretty equal results (about 2100 cycles for both, more than 1600 cycles I got from the last time, but may be I did some wrong measurment last time). Regarding memory system overhead, I can't really understand what is the source of it, all my data and code sections are placed in L2RAM and L1 cache is enabled (although disabling L1 cache does not change anyhting).

Can I use profiling in CCS4 in order to find out the source of the problem? Is there a tutorial available on how to use profiling option of CCS?

Thanks, Alexey.

0 Nizamudheen A over 15 years ago in reply to Alexey Tsirlin

TI__Intellectual 1380 points

Hi Alexey,

Great. You could get the same number across the two platform.

The video-presentation at http://software-dl.ti.com/dsps/dsps_public_sw/sdo_ccstudio/CCSv4/Demos/ccs4-prof_func.htm should help you get started on the CCS function profiler.

The following are the some of the key events that you should profile you application on the function profiler.

L1D misses, conflict misses, bank conflict stalls, CPU stalls (L1D , L1P)

Regards,
Nizam

0 Alexey Tsirlin over 15 years ago in reply to Nizamudheen A

Intellectual 555 points

Nizam,

Thank you very much for the help.

As I can see, the 1000 missing clocks (difference between hardware and simulator) were wasted on L1D CPU stall. Does this mean that I had 1000 L1 cache misses?

0 Nizamudheen A over 15 years ago in reply to Alexey Tsirlin

TI__Intellectual 1380 points

Hi Alexey,

Great to hear that you found good analysis data from the profiler.

The 1000 L1D cycle stalls may be because one of the following reasons

1. There could be 1000 L1D bank conflict stall (profile to the bank_conflict stall event to rule-out this possibility). In case of bank-conflict stalls, you need to ensure that the application code does not access the same-bank in the same-cycle. You may have to change your assembly code accordingly.

2. There could be some misses in L1D (need not be 1000 misses though). The misses may constitute to the stall events.

- Before you attempt to optimize the miss penalty, you need to know if these misses are cold-misses or conflict-misses. Cold-misses cannot be optimized. However, if you notice conflict misses (profile for the event L1D.miss.conflict to confirm this) then there is a way to avoid this penalty. As you already know, the L1D cache is a 2-way set associative cache and you can attempt to place the data-arrays in non-conflicting memory-regions. After the memory-placement change, rerun the simulation with the cache-conflict event an ensure that your change are taking effect as per your expectation.

Hope this helps. LMK the outcome of the experiment.

Regards,

Nizam

Processors

Processors forum

OMAP-L138 CCS4.1 Simulator vs. Real Board