Calcuation of Worst case execution time in C6678

sreenivasan m

Hi Team,

We are working on Worst case execution time calcuation for C6678.

Configuration used is:

CPU frequency		1000Mhz

Cache configuration
L1P	32KB
L1D	32KB
L2D	0KB

Memory Configuraiton
Code	L2SRAM
Data	L2SRAM
Stack	L2SRAM

The code used is attached for reference and we running it using CCS5.5.
We didn't changed the other configuration things like bandwidth management etc. So all are in default configuration(including compiler settings in CCS5.5).

We grouped all 8 cores in a single group and used internal core timers for calculating the time.
We put a break point at "TSCL= 0" statement, so, internal timers will be started.

Now, we continue to run the single group and calculated the time.

The excel sheet is attached for reference.

Note: We ensured that all cores tries to access same bank in MSMCSRAM region.

Now, please answer the following questions:

1. As per the document, we could see that all cores have a separate port access towards MSMC SRAM region. So, under what point does serialization will happen?

2. If we look into the column E7,K7,Q7,W7,AC7,AI7,AO7,AU7, which are nothing but, the difference between current core time value(time taken between MSMC SRAM region access) and minimum time taken for MSMC SRAM region access among all cores.

Now, if we took first set of data (i.e., row number 7, core 0 complets the operation with 64 clock signals where as core 1 takes 91 clocks.

So, if we look into all the values, it appears like a difference of 3 to 4 clock signals is present between current core and its previous completion core.

(Ex: core 0 and core 7, we have only 3 clock signals).

Now, why there is only 3 clock signals? Can you explain us a bit more on this why it is taking this much time?

3. As we are running the group core using CCS5.5 with "run" button, will this ensure that all cores are started parallely or will this operation is serialized?

over 8 years ago

0 sreenivasan m over 8 years ago

Genius 3705 points

sample_source.txt

#include "stdio.h"
#include <ti/sysbios/family/C66/Cache.h>
#include <c6x.h>

int ch1,ch2;
#define core_num 0
#define MPAXL2 0x08000010
#define MPAXH2 0x08000014

void main()
{
    volatile int *ptr;
    volatile unsigned int * mpaxl2 = (volatile unsigned int *)MPAXL2;
    volatile unsigned int * mpaxh2 = (volatile unsigned int *)MPAXH2;
    unsigned int st_bef[50]={0};
    unsigned int st_aft[50]={0};
    unsigned int value;
    int i = 0, j = 0;
    Cache_Size *size;
    Cache_getSize(size);

    ptr = (volatile unsigned int *)((0x0c000002) + (0x1000 *core_num));

    *(mpaxh2) = (0x0c00000b + (0x1000 * core_num));
    *(mpaxl2) = (0x0c0000bf + (0x1000 * core_num));
    *(mpaxh2);
    *(mpaxl2);
    *(ptr) = 100;

	TSCL = 0;
  	Cache_wbInvAll();
    st_bef[i] = TSCL;
   	*(ptr);
	st_aft[i] = TSCL;
    printf("ch value is 0x%x 0x%x 0x%x\n",st_bef[i],st_aft[i],*ptr);
}

0 sreenivasan m over 8 years ago

Genius 3705 points

C6678_WCET1.xls

7673.sample_source.txt

#include "stdio.h"
#include <ti/sysbios/family/C66/Cache.h>
#include <c6x.h>

int ch1,ch2;
#define core_num 0
#define MPAXL2 0x08000010
#define MPAXH2 0x08000014

void main()
{
    volatile int *ptr;
    volatile unsigned int * mpaxl2 = (volatile unsigned int *)MPAXL2;
    volatile unsigned int * mpaxh2 = (volatile unsigned int *)MPAXH2;
    unsigned int st_bef[50]={0};
    unsigned int st_aft[50]={0};
    unsigned int value;
    int i = 0, j = 0;
    Cache_Size *size;
    Cache_getSize(size);

    ptr = (volatile unsigned int *)((0x0c000002) + (0x1000 *core_num));

    *(mpaxh2) = (0x0c00000b + (0x1000 * core_num));
    *(mpaxl2) = (0x0c0000bf + (0x1000 * core_num));
    *(mpaxh2);
    *(mpaxl2);
    *(ptr) = 100;

	TSCL = 0;
  	Cache_wbInvAll();
    st_bef[i] = TSCL;
   	*(ptr);
	st_aft[i] = TSCL;
    printf("ch value is 0x%x 0x%x 0x%x\n",st_bef[i],st_aft[i],*ptr);
}

0 Raja over 8 years ago in reply to sreenivasan m

TI__Guru* 81335 points

For Questions 1 & 2, will ask someone to comment here.

3. As we are running the group core using CCS5.5 with "run" button, will this ensure that all cores are started parallely or will this operation is serialized?

Please refer below thread about grouping core and execution.

https://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/t/468777

Thank you.

0 sreenivasan m over 8 years ago in reply to Raja

Genius 3705 points

Hi Raja,

For confirming it, I could see that from CCS5 onwards, there is no option to enable " enable synchronous mode". i.e.,

"CCSv5 does not have a button to enable synchronous operations on all CPUs in debug view. User can create synchronous groups by selecting more than one (or all) cpus, right clicking and selecting "group". This action will create a grouping of CPUs. Selecting the group node and executing load program,run, halt, step will issue the command to all cpus that belong to that group."

And for creating fixed groups, I could not see "Set Debug Scope" feature when we press right click with mouse among cores. So, can we assume that they are in fixed groups in CCS5.5?

And I believe, C6678 supports synchronous execution. Please confirm.

0 Raja over 8 years ago in reply to sreenivasan m

TI__Guru* 81335 points

My apologies, there was some issue with the wiki page where it was redirecting to some old content. We have updated it. It should have the correct information for v5 now. Thank you.

0 sreenivasan m over 8 years ago in reply to Raja

Genius 3705 points

Hi Raja,

Thanks for the information. Can you please answer to questions 1 and 2 as well.

Also, please confirm that C6678 supports synchronous execution or not.

Just want to double check that in ccs5.5, if we group the core, provided with controller supports synchronous execution, we can say that all commands(ex: run) will be sent to all cores at the same time(i.e,, parallel execution and all cores start at the same point).

0 ran35366 over 8 years ago in reply to sreenivasan m

TI__Genius 12805 points

I try to answer your posting but I need more answers from you

1. I looked at the sample code and I am confused what you try to do with the MPAX registers. If you want to access different banks of memory (as I think this is what you try to do) for each core, it is much easier to assign different addresses to the pointers. Look again at www.ti.com/.../sprugw7a.pdf figure 2.3 and see what are the base address of each of the memory banks. In 6678 there are only 4 banks, so you will assign two cores to each bank or use only 4 cores

2. The instruction *(ptr); is very confusing to me and to a compiler. If p1 was a function pointer, the execution will go there, but it is not, so the code access the location but does not store the results anywhere. A compiler might delete this instruction as a "dead" code. Do you disable any optimization? Why don't you do any assignment like yy=*ptr and define yy as volatile as well?

3. C6678 supports synchronization between cores in hardware and in software (if this is what you ask). Of course synchronization from CCS does not matter (you do not run the code on real system with CCS, right?), but the software supports synchronization of all cores using IPC (look at the IPC documentations in the release). In addition using OpemMP (supported by the compiler) has a built-in synchronization. The hardware has multiple mechanism to build synchronization between cores.

Please explain your test code or modify and explain it and we will try and answer your posting

Ran

0 sreenivasan m over 8 years ago in reply to ran35366

Genius 3705 points

Hi Ran,

Please find answers:

1) Initially, the code was designed to access shared region in bank 0 by different cores but i think, you can ignore the configuration thing as of now as we are accessing the shared region one time at different locations.

2) I have declared ptr as a volatile type, so, I don't think compiler will optimize it. I even checked the dis assembly file and could see that it is not optimized.
The reason why I didn't considered assignment because it involves additional clock signals for storing the value, which I don't want it.
Did you try with the sample code and test it ?

3) So, if I group all cores and issue run command, will all cores start at the same point of time?

0 ran35366 over 8 years ago in reply to sreenivasan m

TI__Genius 12805 points

1. So what do you try to measure? All cores read from the same bank of memory or different banks? It is not clear when you say different location if you want the access to be to the same bank or not.

2. You understand that each access does not read a single value but a complete cache line. You can run the loop 8 times, the first time it will be cache miss and the following 7 times will be cache hit. Then, by the way, if you write yy = *p++ the assignment will be done in parallel with the read (except the first read)

3. I do not think that there is a mechanism to synchronize multiple cores to the exact clock. Think how you would implement a feature like this? you need a hardware line that is connected to 8 different interrupts and even then I am not sure that you can guarantee exactly the same clock responding to the interrupt. For run time the synchronization means that all cores will wait at a certain point until all cores are there and then continue. But I think that you are not interested in run time synchronization. You want to use CCS run to measure the different cycles between cores. But if you use CCS then the limiting factor is how fast the PC can implement the RUN instruction across multiple cores. And I am sure that the access time from the PC to the USB line is more than the DSP cycle time.

Regards Ran

0 sreenivasan m over 8 years ago in reply to ran35366

Genius 3705 points

Hi Ran,

1. We will measure from the same memory bank (bank -0) by accessing different locations by cores.
ex: core 0 access 0x0c000002 and core 1 access 0x0c001002 and so on,
2. I understand that we can have assignment instruction as well, but as we have experimented without this, I think it should be ok.
3. We are not using common clock here. All we did is utilization of own core timer for calculating how many clock signals taken for MSMC SRAM region when all cores try to access it simultaneously. So, we what we did, the sample piece of code provided is loaded on each core with core numbers varying from 0 to 7. And put a break point near TSCL =0 statement, such that all cores reach here. Now, issued run command(by grouping of cores) and calculated the time. I hope that this operation triggers the execution of run command at the same instant.

Note: Our target is to calculate how much time a core will take when all cores tries to access shared region.
Here, we considered all access to bank - 0.

Please let me know if you need further information.
Have you tried with code snippet and did you see the time differences are reasonable as mentioned in excel sheet.

0 sreenivasan m over 8 years ago in reply to sreenivasan m

Genius 3705 points

Hi Again,

Request to resolve this issue by answering the questions, as we need this info sooner.
Thanks for understanding...

0 sreenivasan m over 8 years ago in reply to sreenivasan m

Genius 3705 points

Hi,

Do you have any updates on this?

0 ran35366 over 8 years ago

TI__Genius 12805 points

1. The priority scheme of the MSMC memory is described in chapter 2.3.3 - MSMC Bandwidth Management of SPRUGW7A - it is a public document

2.each core has two priorities, one for regular read (pre-fetch) and one for urgent access. All cores have the same default priority but this can be changed. see chapter 8 of PRUGW0C

3. Each memory bank granted access based on the core priority. This is done at the bank boundaries

4. The run operation is serialized (but only because the CCS writes several interrupts).

0 sreenivasan m over 8 years ago in reply to ran35366

Genius 3705 points

Hi Ran,

1. I think, we already read this document. Our question is, each core has a separate port for accessing MSMC SRAM, but where does serialization does exactly, in specific?

4. How can we achieve parallel execution of cores using CCS5.5? As per the initial information mentioned in the case, when controller supports synchronous execution and if we use fixed grouping of cores, then we can expect the commands issued are executed parallely on all cores at the same instant. In CCS5.5, we don't have the option for fixed grouping and by default all cores are in fixed grouping if we group it.

As questioned earlier, if we took first set of data (i.e., row number 7, core 0 completes the operation with 64 clock signals where as core 1 takes 91 clocks.
So, if we look into all the values, it appears like a difference of 3 to 4 clock signals is present between current core and its previous completion core.(Ex: core 0 and core 7, we have only 3 clock signals).
And we didn't set any priorities as such. It was run with default priority configuration.
Now, why there is only 3 clock signals difference between core 0 and core 7(assuming they are executing parallel with the CCS5.5 set up used, i.e., grouping the cores and issuing "run" command)? Can you explain us a bit more on this why it is taking this much time?

0 sreenivasan m over 8 years ago in reply to sreenivasan m

Genius 3705 points

Hi Again,

Where can we find PRUGW0C document?

0 Raja over 8 years ago in reply to sreenivasan m

TI__Guru* 81335 points

I think it should be sprugw0c,

http://www.ti.com/lit/sprugw0

Thank you.

0 sreenivasan m over 8 years ago in reply to Raja

Genius 3705 points

Hi Raja,

As per the information shared earlier in this case, if we group the cores in CCS5.5, it will be considered as a fixed group and as C6678 supports synchronous execution, we can say when we issue a command like "run" to the group core, it will be executed parallely.

Can you confirm this?

0 sreenivasan m over 8 years ago in reply to sreenivasan m

Genius 3705 points

Hi,

Any updates on this?

0 ran35366 over 8 years ago in reply to sreenivasan m

TI__Genius 12805 points

Sorry

SPRUGW0C can be found at:

www.ti.com/.../sprugw0c.pdf

I believe that I answered all your questions. Please close the thread

0 Raja over 8 years ago in reply to sreenivasan m

TI__Guru* 81335 points

As per the information shared earlier in this case, if we group the cores in CCS5.5, it will be considered as a fixed group and as C6678 supports synchronous execution, we can say when we issue a command like "run" to the group core, it will be executed parallely.

Can you confirm this?

I recommend you to post this question in CCS forum and close this thread.
Thank you.

0 sreenivasan m over 8 years ago in reply to Raja

Genius 3705 points

Hi Raja,

Apart from CCS question, there are other questions which are left unanswered. Please answer to those..

0 sreenivasan m over 8 years ago in reply to sreenivasan m

Genius 3705 points

Hi Again,

I checked in CCS forum and got the below information:

"
Fixed Groups
Once a debug session is started, the user may create a more permanent group This 'Fixed Group' has a specific node in the 'Debug' view that has its own debug context. Selecting this group debug context will cause debug commands to be sent to all group members without the need to select them individually. Note, that while the commands will be sent simultaneously, how synchronously the commands are executed depends on if the HW target itself supports synchronous execution."

Also, C6678 supports synchronous execution. So, by this, we can say that all cores are running simultaneously in our experiment set up used.
Now, the questions are:

1. As per the document(controller document), we could see that all cores have a separate port access towards MSMC SRAM region. So, under what point does serialization will happen?

2. If we look into the column E7,K7,Q7,W7,AC7,AI7,AO7,AU7(in excel sheet attached), which are nothing but, the difference between current core time value(time taken between MSMC SRAM region access) and minimum time taken for MSMC SRAM region access among all cores.

Now, if we took first set of data (i.e., row number 7, core 0 completes the operation with 64 clock signals where as core 1 takes 91 clocks.

So, if we look into all the values, it appears like a difference of 3 to 4 clock signals is present between current core and its previous completion core.

(Ex: core 0 and core 7, we have only 3 clock signals).

Now, why there is only 3 clock signals? Can you explain us a bit more on this why it is taking this much time?

Processors

Processors forum

Calcuation of Worst case execution time in C6678