This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

  • Resolved

TMDSMDSK6455: Reading from memory takes too long

Prodigy 40 points

Replies: 6

Views: 152

Part Number: TMDSMDSK6455

Hi guys, I'm at the beginning of my journey in DSP programming. I have a DSK6455 and I'm going to make the following piece of code optimized.

#pragma MUST_ITERATE(NUMBER_OF_COLUMNS, NUMBER_OF_COLUMNS, 1)
#pragma UNROLL(2)
	for (col = 0; col < NUMBER_OF_COLUMNS; ++col)
	{
		int16_t current_cost = 0;
		for (row = 0; row < NUMBER_OF_ROWS; ++row)
		{
			int16_t ix = (data_lookup_table[row][col] - data_lookup_table[row][which_column]) * 10.0;

			current_cost += sinlut[ix];//sine looking up.
		}

		if (current_cost > current_max)
		{
			current_max = current_cost;
			current_max_index = col;
		}
	}
	t1 = TSCL;

I tried to profile this piece of code, line by line, and noticed that reading from memory (for example data_lookup_table[row][col]) and storing them in a temporary variable, takes 80 to 120 clocks. I watched the C6000 optimization video series and there, It's told that reading from memory takes 4 clocks. Am I wrong? Can I make the program faster?

  • Hello!

    Fragment of code presented is not enough to make certain conclusion. Nevertheless, there are some points to consider.

    Number one is where do you place your data and lookup tables. It that was L2, your number looks so suspicious. If that was external memory, then caching is next to consider.

    We don't see the way your data_lookup_table was allocated. If that was static 2D array, or array of pointers, then in such organization elements of the row are neighbors in memory, whine elements of column are not. Depending on row length it may happen that accessing your matrix column-wise would result in cache miss, read slow multicycle performance.

    Next we don't see what was type of data in your data_lookup_table, why it matters to multiply by floating 10.0 if result was assigned to short.

  • In reply to rrlagic:

    Thanks man for your reply. This is my whole code. (Data are trancated)

    #include <stdio.h>
    #include <stdlib.h>
    #include <math.h>
    #include <stdint.h>
    #include <c6x.h>
    
    #define NUMBER_OF_ROWS 6
    #define NUMBER_OF_COLUMNS 100
    
    static const int16_t data_lookup_table[NUMBER_OF_ROWS][NUMBER_OF_COLUMNS] = {
    		977,973,969,979,974,969,980,975,970,980,975,969,978,972,982,975,969,...
    		-1783,-1789,-1796,-1778,-1786,-1795,-1776,-1786,-1794,-1775,-1785,...
    		-1080,-1089,-1099,-1075,-1087,-1100,-1073,-1087,-1099,-1074,-1088,...
    		16,37,58,78,98,117,140,158,179,201,221,240,261,279,303,320,339,362,...
    		-1778,1797,1773,1773,1749,1724,1724,1698,1673,1672,1646,1620,1619,...
    		-1090,-1131,-1171,-1177,-1218,-1259,-1267,-1308,-1350,-1359,-1401,...
    };
    
    #define SINMAX 9000
    static uint8_t sinlut[SINMAX + 1];
    void SinCreate ()
    {
       int i;
       double angle, angleinc;
    
       angleinc = 3.1415926535 / 2.0 / (SINMAX);
       for (i = 0, angle = 0.0; i <= SINMAX; ++i, angle += angleinc)
       {
          sinlut[i] = cos(angle) * 100;
       }
    }
    
    void alg(int which_column)
    {
    	SinCreate();
    	int row, col;
    	int current_max_index = 0;
    	int16_t current_max = 0;
        unsigned int t0, t1;
    
        TSCL = 0;
        t0 = TSCL;
    
    #pragma MUST_ITERATE(NUMBER_OF_COLUMNS, NUMBER_OF_COLUMNS, 1)
    #pragma UNROLL(2)
    	for (col = 0; col < NUMBER_OF_COLUMNS; ++col)
    	{
    		int16_t current_cost = 0;
    		for (row = 0; row < NUMBER_OF_ROWS; ++row)
    		{
    			int16_t ix = (data_lookup_table[row][col] - data_lookup_table[row][which_column]) * 10.0;
    
    			current_cost += sinlut[ix];
    		}
    
    		if (current_cost > current_max)
    		{
    			current_max = current_cost;
    			current_max_index = col;
    		}
    	}
    	t1 = TSCL;
    	printf("%d\n", t1-t0);
    
    	if (current_max_index == which_column)
    		printf("succeed\n");
    	else
    		printf("failed\n");
    }
    
    int main(void)
    {
    	alg(10);
    
    	return 0;
    }
    

  • In reply to saeed dadkhah:

    Hello!

    What you show is 6*100*sizeof(short) = 1200 bytes, and sinlut of 9001 bytes, so total size of data is well below 32KB capacity of l1D cache.

    Next, you data table has 100 columns of shorts, that's 200 bytes. Because cache line on C64+ is just 64 bytes, depending on 'which_column' value calculation of  (data_lookup_table[row][col] - data_lookup_table[row][which_column]) parenthesized difference may land on same cache line but may not. 

    As I mentioned before, because column-wise access to data table, each new row access will result in compulsory miss. So if your data isn't static table, but rather random data coming in run time, then the way they are organized and processed is not the best possible.

    However, my major concern is how did you measure memory access time, we don't see that in your code.

    So I went a bit further and took you example, complement with some thrash data of your size, compiled and run on C6670, as I have no C64+ at my desk.

    On the first attempt it took me 39397 cycles. Then I moved SinCreate(); to main and called alg() several times in main with different arguments, still receiving frustrating 39235 cycles. Next I applied -O3 optimization, and it gives 1339, 1170, 1166 cycles.

    Next I replaced multiplication by 10.0 with integer 10, as your data were integers, now scale down to 823, 672, 667 cycles.

    You did not tell us where your data were located, so I made sure my whole stuff was in L2 with the following linker command file:

    -c
    -heap  0x1000
    -stack 0xa000
    
    MEMORY
    {
        L2SRAM (RWX)   : org = 0x0800000, len = 0x100000
    }
    
    
    SECTIONS
    {
        .sysmem > L2SRAM
        .cinit > L2SRAM
        .stack > L2SRAM
        * > L2SRAM
    }
    

    So once again, the whole double loop of  600 iterations costs me something under 700 cycles. Your device would be somewhat less efficient. Nevertheless, if you insist you see 80 to 250 clocks on just mem access, something is terribly wrong on your side.

  • In reply to rrlagic:

    Hi, I really appreciate you for your responses. This is my final code. I replaced 10.0 with 10, and the number of cycles decreased from 105963 to 3400. Isn't it strange?! I also enabled -O3 optimization flag. One more question. Where should I place linker command file which you mentioned in your previous answer? Should I do anything else? Thanks for your contributions.

    #include <stdio.h> 
    #include <stdlib.h>
    #include <math.h>
    #include <stdint.h>
    #include <c6x.h>
    
    #define NUMBER_OF_ROWS 6
    #define NUMBER_OF_COLUMNS 100
    
    static const int16_t data_lookup_table[NUMBER_OF_ROWS][NUMBER_OF_COLUMNS] = {
    		977,973,969,979,974,969,980,975,970,980,975,969,978,972,982,975,969,978,971,980,973,981,973,982,973,981,972,980,987,977,985,973,980,986,974,980,986,991,977,982,986,990,994,978,981,984,986,988,989,990,991,991,991,990,989,988,986,984,1002,999,996,992,988,1003,998,994,988,1002,996,990,1003,996,1007,1000,992,1003,995,1005,997,1006,997,1006,997,1005,996,1003,1011,1001,1008,998,1005,1011,1001,1006,1012,1001,1006,1012,1001,1006,
    		-1783,-1789,-1796,-1778,-1786,-1795,-1776,-1786,-1794,-1775,-1785,-1795,-1778,-1789,-1771,-1784,-1795,-1778,-1791,-1775,-1788,-1773,-1787,-1771,-1787,-1773,-1790,-1775,-1762,-1781,-1768,-1788,-1776,-1765,-1788,-1777,-1768,-1758,-1783,-1775,-1767,-1760,-1754,-1784,-1779,-1774,-1771,-1768,-1766,-1764,-1763,-1763,-1764,-1766,-1768,-1771,-1774,-1779,-1748,-1753,-1760,-1767,-1775,-1748,-1757,-1766,-1777,-1753,-1764,-1775,-1753,-1766,-1747,-1759,-1774,-1756,-1770,-1753,-1769,-1754,-1769,-1755,-1771,-1758,-1775,-1762,-1749,-1768,-1755,-1773,-1762,-1752,-1771,-1762,-1752,-1772,-1763,-1754,-1772,-1765,
    		-1080,-1089,-1099,-1075,-1087,-1100,-1073,-1087,-1099,-1074,-1088,-1102,-1079,-1096,-1070,-1089,-1106,-1083,-1101,-1079,-1098,-1077,-1098,-1076,-1099,-1079,-1104,-1084,-1066,-1093,-1075,-1104,-1088,-1073,-1105,-1090,-1078,-1064,-1101,-1089,-1079,-1070,-1062,-1104,-1097,-1091,-1087,-1083,-1081,-1079,-1078,-1079,-1080,-1083,-1086,-1091,-1096,-1102,-1060,-1068,-1077,-1088,-1099,-1062,-1075,-1088,-1103,-1070,-1085,-1102,-1072,-1090,-1063,-1081,-1102,-1076,-1097,-1073,-1095,-1074,-1095,-1077,-1099,-1080,-1104,-1087,-1068,-1095,-1078,-1102,-1087,-1073,-1099,-1087,-1073,-1101,-1088,-1076,-1101,-1091,
    		16,37,58,78,98,117,140,158,179,201,221,240,261,279,303,320,339,362,379,402,421,443,460,485,501,524,539,564,587,602,626,639,662,686,696,721,743,767,777,800,823,845,868,873,894,916,937,959,978,999,1018,1037,1056,1074,1093,1110,1127,1143,1182,1197,1212,1226,1240,1278,1291,1303,1316,1352,1362,1373,1408,1418,1453,1461,1470,1504,1511,1544,1551,1584,1589,1622,1626,1658,1662,1694,1725,1728,1758,1760,1790,-1779,-1777,-1748,-1719,-1718,-1689,-1661,-1661,-1633,
    		-1778,1797,1773,1773,1749,1724,1724,1698,1673,1672,1646,1620,1619,1592,1590,1563,1536,1533,1505,1502,1473,1470,1440,1436,1406,1401,1370,1364,1358,1326,1319,1287,1279,1270,1238,1227,1218,1207,1173,1161,1149,1137,1124,1089,1075,1060,1046,1030,1015,998,982,966,948,931,912,894,874,856,854,834,815,795,774,768,747,727,705,698,678,656,648,626,615,595,572,561,540,528,506,493,473,459,438,425,404,389,377,355,341,322,307,292,271,255,242,220,206,191,172,156,
    		-1090,-1131,-1171,-1177,-1218,-1259,-1267,-1308,-1350,-1359,-1401,-1444,-1453,-1496,-1507,-1550,-1594,-1606,-1651,-1663,-1709,-1722,-1769,-1784,1768,1752,1706,1688,1671,1622,1603,1554,1534,1514,1465,1442,1421,1397,1346,1322,1297,1271,1245,1195,1169,1140,1112,1082,1054,1023,994,964,932,902,869,838,804,772,755,723,690,656,623,601,566,534,499,477,444,411,385,351,324,292,257,231,199,170,137,107,77,45,15,-14,-45,-76,-105,-136,-166,-194,-225,-257,-286,-319,-349,-378,-409,-441,-467,-499
    };
    
    #define SINMAX 9000
    static uint8_t sinlut[SINMAX + 1];
    void SinCreate ()
    {
       int i;
       double angle, angleinc;
    
       angleinc = 3.1415926535 / 2.0 / (SINMAX);
       for (i = 0, angle = 0.0; i <= SINMAX; ++i, angle += angleinc)
       {
          sinlut[i] = cos(angle) * 100;
       }
    }
    
    int alg(int which_column)
    {
        int row, col;
        int current_max_index = 0;
        int16_t current_max = 0;
    
    #pragma MUST_ITERATE(NUMBER_OF_COLUMNS, NUMBER_OF_COLUMNS, 1)
    #pragma UNROLL(2)
        for (col = 0; col < NUMBER_OF_COLUMNS; ++col)
        {
            int16_t current_cost = 0;
            for (row = 0; row < NUMBER_OF_ROWS; ++row)
            {
                int16_t ix = (data_lookup_table[row][col] - data_lookup_table[row][which_column]) * 10;
    
                current_cost += sinlut[ix];
            }
    
            if (current_cost > current_max)
            {
                current_max = current_cost;
                current_max_index = col;
            }
        }
    
        return current_max_index;
    }
    
    int main(void)
    {
        SinCreate();
    
        unsigned int t0, t1;
    
        TSCL = 0;
        t0 = TSCL;
        int current = alg(10);
        t1 = TSCL;
        printf("current: %d\n", current);
        printf("%d\n", t1-t0);
        return 0;
    }
    

  • In reply to saeed dadkhah:

    I found the linker command file:

    MEMORY
    {
        L2RAM:      o = 0x00800000  l = 0x00200000  /* 2MB L2 Internal SRAM */ 
        L1PRAM:     o = 0x00E00000  l = 0x00008000  /* 32kB L1 Program SRAM/CACHE */ 
        L1DRAM:     o = 0x00F00000  l = 0x00008000  /* 32kB L1 Data SRAM/CACHE */
        EMIFA_CE2:  o = 0xA0000000  l = 0x00800000  /* 8MB EMIFA CE2 */ 
        EMIFA_CE3:  o = 0xB0000000  l = 0x00800000  /* 8MB EMIFA CE2 */ 
        EMIFA_CE4:  o = 0xC0000000  l = 0x00800000  /* 8MB EMIFA CE2 */ 
        EMIFA_CE5:  o = 0xD0000000  l = 0x00800000  /* 8MB EMIFA CE2 */
        DDR2_CE0:   o = 0xE0000000  l = 0x20000000  /* 512MB EMIFB CE0 */ 
    } 
    
    SECTIONS
    {
        .text          >  L2RAM
        .stack         >  L2RAM
        .bss           >  L2RAM
        .cio           >  L2RAM
        .const         >  L2RAM
        .data          >  L2RAM
        .switch        >  L2RAM
        .sysmem        >  L2RAM
        .far           >  L2RAM
        .args          >  L2RAM
        .ppinfo        >  L2RAM
        .ppdata        >  L2RAM
      
        /* COFF sections */
        .pinit         >  L2RAM
        .cinit         >  L2RAM
      
        /* EABI sections */
        .binit         >  L2RAM
        .init_array    >  L2RAM
        .neardata      >  L2RAM
        .fardata       >  L2RAM
        .rodata        >  L2RAM
        .c6xabi.exidx  >  L2RAM
        .c6xabi.extab  >  L2RAM
    }

    Should I change it according to my board specifications? 

  • Guru 66190 points

    In reply to saeed dadkhah:

    Hi,

    The suggestion from  rrlagic was to put the code/data into L2 memory for faster access than put in external memory (DDR). The memory map is given by the linker command file you shown above. You need to modify the lcf based on your board. For the first glance, the L2 starts from 0x0080_0000 with 2MB size, this looks right.

    Regards, Eric 

This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.