This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Question about the simulation on DM6446

Hi,

I want to test the performance of some loop like the following.

#include <std.h>

#include <log.h>
#include <clk.h>
#include "Cachecfg.h"

#define N 1024

int a[N], b[N], c[N];

Void myTask();

/*
* ======== main ========
*/
Void main()
{
LOG_printf(&trace, "hello world!");

/* fall into DSP/BIOS idle loop */
return;
}

Void myTask()
{
// int a[N], b[N];
int i, sum;
// Float timeout, milliSecsPerIntr, cycles;
LgUns start, stop, result;

sum = 0;
for(i = 0; i < N; i++)
{
a[i] = i;
b[i] = i;
}

start = CLK_gethtime();

#pragma MUST_ITERATE(N, , N)
for(i = 0; i < N; i++)
sum += a[i] * b[i];

stop = CLK_gethtime();
result = stop - start;

LOG_printf(&trace, "The result is: %d\n", result);
LOG_printf(&trace, "The sum is: %d\n", sum);
}

And the pipeline information is like this.

;*----------------------------------------------------------------------------*
;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop found in file : ../clk.c
;* Loop source line : 44
;* Loop opening brace source line : 45
;* Loop closing brace source line : 45
;* Known Minimum Trip Count : 1024
;* Known Maximum Trip Count : 1024
;* Known Max Trip Count Factor : 1024
;* Loop Carried Dependency Bound(^) : 0
;* Unpartitioned Resource Bound : 1
;* Partitioned Resource Bound(*) : 1
;* Resource Partition:
;* A-side B-side
;* .L units 0 0
;* .S units 0 0
;* .D units 1* 1*
;* .M units 1* 0
;* .X cross paths 1* 0
;* .T address paths 1* 1*
;* Long read paths 0 0
;* Long write paths 0 0
;* Logical ops (.LS) 0 0 (.L or .S unit)
;* Addition ops (.LSD) 1 0 (.L or .S or .D unit)
;* Bound(.L .S .LS) 0 0
;* Bound(.L .S .D .LS .LSD) 1* 1*
;*
;* Searching for software pipeline schedule at ...
;* ii = 1 Schedule found with 10 iterations in parallel
;* Done
;*
;* Loop will be splooped
;* Collapsed epilog stages : 0
;* Collapsed prolog stages : 0
;* Minimum required memory pad : 0 bytes
;*
;* Minimum safe trip count : 1
;*
;* If you know that this loop will always execute at a multiple of <2048> and at least <2048> times, try adding "#pragma MUST_ITERATE(2048, ,2048)" just before the loop.
;*----------------------------------------------------------------------------*
$C$L4: ; PIPED LOOP PROLOG
.dwpsn file "../clk.c",line 44,column 0,is_stmt

SPLOOPD 1 ;10 ; (P)
|| MV .L1 A10,A5
|| MV .L2 B10,B5 ; |44|
|| MVC .S2 B5,ILC

;** --------------------------------------------------------------------------*
$C$L5: ; PIPED LOOP KERNEL
$C$DW$L$_myTask$7$B:
.dwpsn file "../clk.c",line 45,column 0,is_stmt

SPMASK L1
|| MV .L1 A4,A11 ; |41|
|| LDW .D1T1 *A5++,A4 ; |45| (P) <0,0>
|| LDW .D2T2 *B5++,B4 ; |45| (P) <0,0>

NOP 4
MPY32 .M1X B4,A4,A3 ; |45| (P) <0,5>
NOP 2

SPMASK L1
|| ZERO .L1 A6 ; |34|

SPKERNEL 5,0
|| ADD .L1 A3,A6,A6 ; |45| <0,9>

$C$DW$L$_myTask$7$E:
;** --------------------------------------------------------------------------*
$C$L6: ; PIPED LOOP EPILOG
$C$DW$12 .dwtag DW_TAG_TI_branch
.dwattr $C$DW$12, DW_AT_low_pc(0x00)
.dwattr $C$DW$12, DW_AT_name("_CLK_gethtime")
.dwattr $C$DW$12, DW_AT_TI_call
CALL .S2 _CLK_gethtime ; |47|
ADDKPC .S2 $C$RL1,B3,3 ; |47|
MV .L2X A6,B10
$C$RL1: ; CALL OCCURS {_CLK_gethtime} {0} ; |47|

I notice that  ii = 1 . Then, I test the program on DM6446 with two types of simulation.

1. C64x+ CPU Cycle Accurate Simulator, Little Endian 

The result is 1045, which is as I expected.

2. C64x+ Megamodule Cycle Accurate Simulator, Little Endian

The result is 2069, which is much more than  I expected.

To avoid possible cache miss, I have touched array a and b just before the target loop. I wonder what causes the difference between the two types of simulation.