Hello to All
I have some doubts about the internal clock speed of DSP programming.
The original program that is running in one of my company boards uses a TMS320C6413-500 and it was not written by me.
This DSP configures the PLL using input ports and in this board we are using a 50 Mhz clock with a x10 multiplier.
I check the internal PLL registers of the DSP and everthing seems ok, at least the x10 value is correctly assigned in the DSP register
I am making small modifications in this code, but since I waiting for the emulador, I am testing the performance of the code using GPIO ports and a logic analyzer.
I notice, that a small GPIO logic switch between 0 and 1 and then return to 0 takes about 60 ns seconds.
The Code is the first thing that I execute in the program, but I also test it in the middle of the code.
I define new macros also shown bellow to speed up the logic switch, but I intend to use the orignal ones in the CSL library, but the original ones take more than 200 ns
Bellow is the code that I am using to access the GPIO 0 and 3
hGpio = GPIO_open(GPIO_DEV0,GPIO_OPEN_RESET);
GPIO_configArgs(hGpio,
GPGC_GP0_3, // gpgc Global control register value
GPEN_GP0_3, // gpen GPIO Enable register value
GPDIR_GP0_3, // gpdir GPIO Direction register value
GPVAL_GP0_3, // gpval GPIO Value register value
GPHM_GP0_3, // gphm GPIO High Mask register value
GPLM_GP0_3, // gplm GPIO Low Mask register value
GPPOL_GP0_3 // gppol GPIO Interrupt Polarity register value
);
GPIO_pinEnable(hGpio,GPIO_PIN0 | GPIO_PIN3); // old DMA0/1
// from here
GPIO_pinSet(GPIO_PIN0);
GPIO_pinSet(GPIO_PIN3);
error = 1; // dummy code to avoid compile to optimize code and remove the switch from 0 -> 1 -> 0
GPIO_pinReset(GPIO_PIN3);
GPIO_pinReset(GPIO_PIN0);
// to here = 60ns
The macros above are defined as
#define GPIO_pinSet(Pin) *((UINT32 *) 0x01B00008) |= Pin #define GPIO_pinReset(Pin) *((UINT32 *) 0x01B00008) &= ~Pin
#define
If the clock was running at 50 MHZ with instruction cycle of 20 ns, 60 ns to do these instrunctions could be fare, but at 2ns per instrunction is difficult to me to believe that it takes 30 instructions to switch from 0->1->0
best regards
Nuno
This test is a poor indicator of DSP clock speed for 2 reasons:
Here's a snippet from the data sheet which mentions these things:
If you want to know the clock speed, why not simply look at CLKOUT4 or CLKOUT6 on a scope? That's the best way to know the speed...
---------------------------------------------------------------------------------------------------------
Please click the Verify Answer button on this post if it answers your question.---------------------------------------------------------------------------------------------------------
Hello Brad
Thanks a lot for your response. Based on your response I can assume that, the switching time is not constant and it may differ depending the internal state, so I cannot use it for timing control.
Basically I would like to have a good mechanism to control the speed of some blocks of the code, and since this GPIO pins were available ...
The application of the board is time critical, and it must performe all the calculations in less then 2uS, for each fw cycle.
I try to profile the code, but since the parameter -O2 of the compiler is used, I was not able to trace the number of cycles of each block in the code.
I am using CCS 4.
Can you suggest a way to make more precise calculations regarding time inside the fw? Perhaps using other pins?
Regarding your idea, I do not have these 2 pins available in my mictor connectors, so it is dificult to me to measure these pins.
I look at schematics, and it seems to be a test point for CLKOUT4. I will check this on the board
Nuno Pereira Basically I would like to have a good mechanism to control the speed of some blocks of the code, and since this GPIO pins were available ...
How about using a timer?
Nuno Pereira Can you suggest a way to make more precise calculations regarding time inside the fw? Perhaps using other pins?
I'm attaching some code I wrote a really long time ago which uses a timer to determine how many cycles a function takes. It was written for c6416 but should be pretty close to what you're doing. You can get rid of the cache invalidation code if you like. In other words, invalidating the cache will hurt your performance. I was trying to get a "cold cache" benchmark but of course performance will improve if something is cached.
Hi Brad
I check your source code and I realize that must of the work I already have available in my own project.
I also have a timer running, so I decide to create a macro to measure the difference between a certain read of the timer count with a next read a few instructions later.
Meanwhile, I manage to program clkout4 output and check the HW in a test point to confirm that the DSP is running at 500MHz.
Now I have two issues to undserstand.
1 - If I made a reading like "Init = TIMER_RGET(CNT2)" and then put a dummy instruction in the middle and calculate the diference between the new timer counter like "TIMER_RGET(CNT2) - Init", I realize that the diference is a number like 16. Since the timer clock is divided by 8, this means that the DSP takes about 16 x 8 x 2ns to execute the code between the readings. I think this value is huge. Any Ideas?
I do not "printf" the value, since "printf" will delay a lot my code, so I basically write the value in the memory space and read it later.
2 - My code does not have any interrupt programmed, and each time that I read a complete cycle of the main loop I get values like 0x5A to the number of timer counts.
If I change the macro to give me the maximum value using _max2, between the current value in memory and the new timer counts, I get values like 0x203.
I notice that this increase of counts, just happen after a EDMA transfer, and if I put the init counter value a few instructions after the EDMA transfer, the values are again near 0x5A, but If I put it, just after the CIPRL check bit instruction, the values goes up to 0x203.
This do not happen in all the cycles, because even in this condition if I remove the _max2 primitive, the values are again near 0x5A, meaning that it just happens a few times, because I cannot get any 0x203. Any Ideas how to grab this abnormal increase of counts?
Nuno Pereira
Accesses to "configuration space", i.e. any peripheral registers, will be very slow in general. That's why you see such a large count with only a single instruction between timer reads. In other words, the timer reads themselves are taking a number of instructions. There are some limits to the precision with this method. You can improve the accuracy a bit by figuring out the overhead of the benchmark itself by doing a couple timer reads with zero instructions in between. You can subtract that number from all subsequent readings (or just keep it in mind at least).
I would guess that accesses to the configuration bus are also responsible for the increase in counts related to EDMA activity. Interrupt overhead is another thing I would generally consider though you said you don't have any interrupts enabled.
Hi guysSorry to hijack this thread, but see when using Brad's timer code, I take it if you divide the result of the timer code by the clock speed of the processor, then that will give you the execution time of your code in seconds? Sound right?
Cheers
David
Hello David
The problem that I was facing, appears because I am trying to measurer DSP performance for a very short period of instructions, meaning values around a couple of hundreds nanosecs and not seconds.
The problem is that by accessing TIMER registers, the instruction itself take a lot of time, because I believe that this can be called a slow register for the DSP internally.
To avoid all of this, I am currently using another process the measure DSP performance, that works in my case.
I start to use QDMA. With QDMA you can simulate a write to a destination address that will generate a CS in unused memory block.
With an oscilloscope or data analyzer I was able to measure the time between two CS and have a highly accurate value.
You can initialy set up all the values for one trigger QDMA transfer and then, every time you want to measure something, just trigger a new dummy QDMA at the begging of the code and another at the end.
It works quite nice for me :)
Hi Nuno
Thanks for getting back to me. I'm not using the QDMA of the DSP, I was just looking for a quick way to time my code and I came across this thread. Do you know if I divide the cycles by the CPU clock I will get the time in seconds for the code to execute?
Thanks for your help
Hi David
You have to take in consideration the type of DSP you are using
In my case, since I use a 500MHz clock each instruction takes 2ns, but the clock tick is divided by 8 (you can check in the datasheet, waht is your case), meaning that the value that you get by reading the CNT register must be multipled by 2ns x 8 = 16ns.
I'm using the C6416T which runs at 1GHz with the tick divided by 8. So does that mean I multiple the result by 8 (to get cycles) and then divide by 1x10^9 to get seconds? So for example, the cycles I got from the timer for my code execution was 5926880 (after being multiplied by 8). So I multiply this by 1x10^-9 to get the time in seconds? Which was 0.0474 seconds. Do you agree?
I agree. Timer cycles x8 is CPU cycles (for 6416). CPU cycles times cpu period gives seconds
Hi David & Brad
I also agree, but watch out for timer overlaps. You must set the PRD register to a period time with a value big enough to avoid overlaps, otherwsie the value that you are reading can be only a fraction of your time.
Hi Nuno and Brad
Thanks for getting back to me. I'm a bit of a noob when it come to timers so I was just using the setup contained in the code Brad posted. Do you know if that setup is ok and will not result in any overlap?
Thanks again
My code maxes out the period. You would have to run for more than one minute to overflow. I think it would be quite obvious.
If you have many functions that you're benchmarking you might want to tweak the way the code works. Currently the code zeros out the timer and starts it running from 0 at the start of the benchmark. If you wanted to instead benchmark many functions you might instead want to simply leave it running all the time and simply do a difference of the time stamp at the start and end of the function you're benchmarking.
I only have the one function I'm benchmarking and I call the start and stop before and after respectively. By maxing out the period do you mean it gives the shortest period and therefore the greatest accuracy?