This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMDX570LS20SUSB RTI tick counter question

Other Parts Discussed in Thread: HALCOGEN, TMS570LC4357

I'm trying to characterize a software algorithm's performance on this R4 processor, so I did the following:

void main(void)
{
/* USER CODE BEGIN (3) */
	/* Initialize RTI driver */
	rtiInit();
    /* Start RTI Counter Block 0 */
	rtiStartCounter(rtiCOUNTER_BLOCK0);

	uint32 tick_pre = rtiGetCurrentTick(rtiNOTIFICATION_COUNTER0);
	run_my_code();
	uint32 tick_post = rtiGetCurrentTick(rtiNOTIFICATION_COUNTER0);

	printf("RTI Period is: %u\n",rtiGetPeriod(rtiNOTIFICATION_COMPARE0));
	printf("Pre: %u Post: %u\n",tick_pre,tick_post);

/* USER CODE END */
}

Looking at HalCoGen, the counter clock is set to 10Mhz. In my test, the code executes in 2.61M ticks, so I would surmise that the time it took to execute is 2.61M/10M seconds.

Is this correct? I did the same test on the supposedly faster TMS570LC4357, and the times are almost identical. I would expect the time on the TMS570LC4357 to be shorter.

Am I missing something?

  • Timothy,

    You should really use the PMU instead of the RTI for benchmarking.  The PMU is more accurate ... it runs at the CPU clock frequency.   The PMU is also accessible as a coprocessor register which means you don't need to load a pointer to it to read it (unlike RTI which is memory mapped).
    This especially makes a difference w. shorter time measurements.  Seems like your time interval is rather long but I still recommend using the PMU.

    HalCoGen will generate code for you to access PMU.  It's very easy to use.  The cycle counter is dedicated.

    Hard to comment on your comparison to the 4357 without knowing more details about what the algo is doing.

    The 4357 might be faster for two reasons:

      1) higher clock frequency

      2) cached processor 

    But both of these can be disadvantages if your code doesn't map well to them.

    For example if you are benchmarking code that polls IO this isn't usually going to run faster on one processor or another if you include the IO portion.   But the amount of compute you can get done between IO might be a lot more with the higher freq. 

    Likewise,  a cached processor works well if your alogrithm has loops that fit in the cache.  If it's just one long series of sequential accesses you may not see any lift from the cache.   You will definitely see a lift if you are pulling in larger functions from say the DSPLIB though...

  • The code I'm testing should be processor-bound. It is algorithmic and doesn't do any I/O. It is code ported in, so it doesn't use any DSP libraries.

    I was thinking that perhaps the processor clock is defaulted to a low-power default value or something, and the R4/R5 have the same default, so their results would be close. Where do I check for processor clock?

  • Timothy,

    If you use HalCoGen you can check the PLL / GCM tabs to see how your clock is configured.

    Make sure to verify that the input clock frequency matches.  We pretty much always put a 16MHz xtal on our boards so if you are using one of ours it should be 16Mhz.

    You can output a clock on the ECLK pin to measure the CPU clock.  You need to put in a clock divider though becuase 300MHz or even 160MHz is too fast to toggle one of the pins on the device.

  • For the R5, here are my settings (from HCG):

    Osc Freq: 16 (as you saud)

    PLL1: 300Mhz

    GCLK: 300 Mhz (I think this is the CPU clock)

    RTI1CLK: 75Mhz (VCLK is the source)

    Counter 0 Clock: 9.375MHz

    In HCG, that feeds into the Compare block.

    In my code, I'm reading:

    uint32 tick_pre = rtiGetCurrentTick(rtiREG1,rtiCOMPARE0);

    So I think I should be reading the counter counting at 9.375Mhz. Correct?

  • For the R4, here are my settings:

    Osc: 16Mhz

    PLL: 100 MHz

    GCLK: 100Mhz

    RTI1CLK: 100Mhz (VCLK)

    Counter 0 clock: 10Mhz

  • Based on the above, I would expect the R5 ticks for the same algorithm to be around 3 times as few as the R4, but they are the same. I'm a little baffled.

  • Timothy,

    Please see http://e2e.ti.com/support/microcontrollers/hercules/f/312/t/377178.aspx

    I've seen this a few times recently so it's a likely candidate.

    -Anthony

  • I took from it that it was an error in the chipset selection for the processor that caused it. I'm pretty certain I have the correct one.

    I was looking at the MPU settings for the TMS570LC4357ZWT in HalCoGen.

    It shows the RAM region (0x0800_0000) set as NORMAL_OIWTNOWA_NONSHARED. Is that correct for RAM? It also has three entries for 0x0800_0000), which I don't understand. Why are there three?

  • Timothy,

    My understanding is that one of the attributes was wrong, because the memory region also needs to be marked as cacheable in order for the code to get into the cache.  Otherwise it will be accessed from the 2nd level flash or RAM on every access.

    What I don't spot though is the error - so I'm checking with some of the folks on the team who found / fixed the issue for another customer to find out what it was.

    The reason for multiple RAM regions is to break the RAM up into different areas where there are different attributes or access rights. There is a priority in the MPU w. the higher # channels and by either changing base addresses, sizes, or the sub-region disables you can create different areas in the RAM that have different properties.

    These properties are not really tied to hardware but your application requirements - for example do you need some RAM that is accessible in privilege mode only?   Are you planning to use the LDREX / STREX instructions for mutex in which case you might need some memory marked shareable?    etc...

     

  • Anthony:

    Thanks. I understand the idea of breaking up the RAM regions into different areas with different attributes. What I was talking about was that in the default Mpu configuration in HalCoGen, there were three regions defined that covered the same range of RAM, and I was trying to figure out if that was significant. I disabled two of them, and the program executed normally.