TMDX570LS20SUSB RTI tick counter question

Timothy Canham

Other Parts Discussed in Thread: HALCOGEN, TMS570LC4357

I'm trying to characterize a software algorithm's performance on this R4 processor, so I did the following:

void main(void)
{
/* USER CODE BEGIN (3) */
	/* Initialize RTI driver */
	rtiInit();
    /* Start RTI Counter Block 0 */
	rtiStartCounter(rtiCOUNTER_BLOCK0);

	uint32 tick_pre = rtiGetCurrentTick(rtiNOTIFICATION_COUNTER0);
	run_my_code();
	uint32 tick_post = rtiGetCurrentTick(rtiNOTIFICATION_COUNTER0);

	printf("RTI Period is: %u\n",rtiGetPeriod(rtiNOTIFICATION_COMPARE0));
	printf("Pre: %u Post: %u\n",tick_pre,tick_post);

/* USER CODE END */
}

Looking at HalCoGen, the counter clock is set to 10Mhz. In my test, the code executes in 2.61M ticks, so I would surmise that the time it took to execute is 2.61M/10M seconds.

Is this correct? I did the same test on the supposedly faster TMS570LC4357, and the times are almost identical. I would expect the time on the TMS570LC4357 to be shorter.

Am I missing something?

over 10 years ago

0 Anthony F. Seely over 10 years ago

TI__Guru 68920 points

Timothy,

You should really use the PMU instead of the RTI for benchmarking. The PMU is more accurate ... it runs at the CPU clock frequency. The PMU is also accessible as a coprocessor register which means you don't need to load a pointer to it to read it (unlike RTI which is memory mapped).
This especially makes a difference w. shorter time measurements. Seems like your time interval is rather long but I still recommend using the PMU.

HalCoGen will generate code for you to access PMU. It's very easy to use. The cycle counter is dedicated.

Hard to comment on your comparison to the 4357 without knowing more details about what the algo is doing.

The 4357 might be faster for two reasons:

1) higher clock frequency

2) cached processor

But both of these can be disadvantages if your code doesn't map well to them.

For example if you are benchmarking code that polls IO this isn't usually going to run faster on one processor or another if you include the IO portion. But the amount of compute you can get done between IO might be a lot more with the higher freq.

Likewise, a cached processor works well if your alogrithm has loops that fit in the cache. If it's just one long series of sequential accesses you may not see any lift from the cache. You will definitely see a lift if you are pulling in larger functions from say the DSPLIB though...

0 Timothy Canham over 10 years ago in reply to Anthony F. Seely

Intellectual 440 points

The code I'm testing should be processor-bound. It is algorithmic and doesn't do any I/O. It is code ported in, so it doesn't use any DSP libraries.

I was thinking that perhaps the processor clock is defaulted to a low-power default value or something, and the R4/R5 have the same default, so their results would be close. Where do I check for processor clock?

0 Anthony F. Seely over 10 years ago in reply to Timothy Canham

TI__Guru 68920 points

Timothy,

If you use HalCoGen you can check the PLL / GCM tabs to see how your clock is configured.

Make sure to verify that the input clock frequency matches. We pretty much always put a 16MHz xtal on our boards so if you are using one of ours it should be 16Mhz.

You can output a clock on the ECLK pin to measure the CPU clock. You need to put in a clock divider though becuase 300MHz or even 160MHz is too fast to toggle one of the pins on the device.

0 Timothy Canham over 10 years ago in reply to Anthony F. Seely

Intellectual 440 points

For the R5, here are my settings (from HCG):

Osc Freq: 16 (as you saud)

PLL1: 300Mhz

GCLK: 300 Mhz (I think this is the CPU clock)

RTI1CLK: 75Mhz (VCLK is the source)

Counter 0 Clock: 9.375MHz

In HCG, that feeds into the Compare block.

In my code, I'm reading:

uint32 tick_pre = rtiGetCurrentTick(rtiREG1,rtiCOMPARE0);

So I think I should be reading the counter counting at 9.375Mhz. Correct?

0 Timothy Canham over 10 years ago in reply to Timothy Canham

Intellectual 440 points

For the R4, here are my settings:

Osc: 16Mhz

PLL: 100 MHz

GCLK: 100Mhz

RTI1CLK: 100Mhz (VCLK)

Counter 0 clock: 10Mhz

0 Timothy Canham over 10 years ago in reply to Timothy Canham

Intellectual 440 points

Based on the above, I would expect the R5 ticks for the same algorithm to be around 3 times as few as the R4, but they are the same. I'm a little baffled.

0 Anthony F. Seely over 10 years ago in reply to Timothy Canham

TI__Guru 68920 points

Timothy,

Please see http://e2e.ti.com/support/microcontrollers/hercules/f/312/t/377178.aspx

I've seen this a few times recently so it's a likely candidate.

-Anthony

0 Timothy Canham over 10 years ago in reply to Anthony F. Seely

Intellectual 440 points

I took from it that it was an error in the chipset selection for the processor that caused it. I'm pretty certain I have the correct one.

I was looking at the MPU settings for the TMS570LC4357ZWT in HalCoGen.

It shows the RAM region (0x0800_0000) set as NORMAL_OIWTNOWA_NONSHARED. Is that correct for RAM? It also has three entries for 0x0800_0000), which I don't understand. Why are there three?

0 Anthony F. Seely over 10 years ago in reply to Timothy Canham

TI__Guru 68920 points

Timothy,

My understanding is that one of the attributes was wrong, because the memory region also needs to be marked as cacheable in order for the code to get into the cache. Otherwise it will be accessed from the 2nd level flash or RAM on every access.

What I don't spot though is the error - so I'm checking with some of the folks on the team who found / fixed the issue for another customer to find out what it was.

The reason for multiple RAM regions is to break the RAM up into different areas where there are different attributes or access rights. There is a priority in the MPU w. the higher # channels and by either changing base addresses, sizes, or the sub-region disables you can create different areas in the RAM that have different properties.

These properties are not really tied to hardware but your application requirements - for example do you need some RAM that is accessible in privilege mode only? Are you planning to use the LDREX / STREX instructions for mutex in which case you might need some memory marked shareable? etc...

0 Timothy Canham over 10 years ago in reply to Anthony F. Seely

Intellectual 440 points

Anthony:

Thanks. I understand the idea of breaking up the RAM regions into different areas with different attributes. What I was talking about was that in the default Mpu configuration in HalCoGen, there were three regions defined that covered the same range of RAM, and I was trying to figure out if that was significant. I disabled two of them, and the program executed normally.

Arm-based microcontrollers

Arm-based microcontrollers forum

TMDX570LS20SUSB RTI tick counter question