This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

slow execution time for c67fastMath functions

I am using the atansp_i function from c67 fastMath 2.01.00.0 on the DSP core of an L138 Logic PD EVM running at 300MHz with an XDS510 ICE, CCS4.2.4. Build is set to release mode and I have tried all of the speed optimisation settings I can find. The code is running from internal RAM.

According to the bechmark table in the c67xfastRTS user guide this should execute in 19 clock cycles, however it is taking aprox 0.3 us (timed using one of the h/w timers) or 90 clock cycles. The standard RTS atanf takes 8 us. I cannot find any benchmark source code to verify the execution speed. I have also tried to use the profile clock in CCS4 but this seems to give 20000000 counts for a single instruction for some reason. All other code also seems to be running slower than I would expect. There is a Linux benchmark test supplied with the EVM which shows an 8k sample MATH_atansp running in 19us but I do not have the source for this and I am not sure what they mean by 8k, seems unlikely that it is doing 8k function calls.

Has anyone else seen slower than expected code execution?

Does anyone have any benchmark code with known results which I can run to check execution speed?

Thanks.

  • Keith,

    In your test code, did you achieve inlining of the atansp_i function? Did you use it in a small loop that iterated 128 times?

    These are the comments in the box at the end of the performance table. The fact that your actual performance is similar to that of the non-inlined versions leads me to believe that your code is not acheiving this inlined and pipelined performance.

    My assumption is that the appropriate test would be a for (i=0; i<128; i++) loop with a single call to this function in the loop body that steps through an input array and output array or perhaps the output would be a summation - I am not sure which would work out best.

    Regards,
    RandyP

  • Randy,

    The inlining is working, I see no call in the disassembly view. I have tried your suggestions and the fastest I can get is the summation version which is still taking 50 clock cycles, the output array version takes 59 cycles. Here is my code and the .asm output for that function. Any more ideas?

    Thanks.

    void

    TestAtansp_iTime(float *TanIn , float*AngleOut )

    {

    int Count = 0;

    for ( Count = 0 ; Count < 128 ; Count ++ )

    {

    *AngleOut += atansp_i ( *(TanIn + Count) );

    }

    }

    ;******************************************************************************

    ;* FUNCTION NAME: TestAtansp_iTime *

    *** Ed: please see second post below for assembly listing.  ***

  • The formatting of your embedded text makes it difficult to visualize the assembly lines. There is a "Paste from Word" button that is supposed to help with formatting from Microsoft Office products. I generally paste everything into an ASCII editor like WordPad or NotePad and then copy/paste from there into the forum. Just FYI.

    There may be additional optimizations you can do with #pragmas and compiler switches, you can use all local variables / stack variables / internal memory, too. Look at the Wiki pages and search for "C6000 optimization" (no quotes) to find additional techniques you might use.

    You have reduced the time from 90 to 50, which is a big decrease. Without the exact test case that the author used, it is hard to predict what else must be done to get to the 19 cycle point.

    But in the end, duplicating a benchmark is not what your project needs. It needs the fastest possible implementation that reads from where you have the data and writes to where you want the data to go.

    How well does you application run at this point?

    Regards,
    RandyP

  • Randy,

    Thanks for the rapid reply. It is not just this benchmark, I just used this as an example of a "known" execution time, I initially used the CCS simulator and the timings I obtained from this were approximately 10x faster than those I am achieving on the real hardware, which is why I am wondering if I have done something silly somewhere. Some of this code was implemented on an Analog Devices 21L065 about 10 years ago and I had to write some of the functions in assembler to make them fast enough, I was hoping not to have to do the same again. I am investigating your processors as it looked as if TI have concentrated more on higher performance and floating point designs on their newer parts, but from the results I am getting this may not be the case. I have already searched the wikis, looked at the optimisation section of the C6000 programmers guide and followed the suggestions from the .asm file. I am running from internal memory (L2) but I do have to use some global variables for the isr functions.

    In order to achieve minimum execution time I think that I will also need to use direct interrupt control rather than using SYS/BIOS, I have posted another question about how to do this (http://e2e.ti.com/support/dsp/omap_applications_processors/f/42/t/163492.aspx) but have not received any replies.

    I have copied the asm ouput via notepad below, it is now much easier to read.

    Thanks.

    ;******************************************************************************
    ;* FUNCTION NAME: TestAtansp_iTime                                            *
    ;*                                                                            *
    ;*   Regs Modified     : A0,A1,A2,A3,A4,A5,A6,A7,A8,A9,B0,B1,B2,B3,B4,B5,B6,  *
    ;*                           B7,B8,B9,B10,B11,SP,A16,A17,A18,A19,A20,A21,A22, *
    ;*                           A23,A24,A25,A26,A27,A28,A29,B16,B17,B18,B19,B20, *
    ;*                           B21,B22,B23,B24,B25,B26,B27,B28,B29,B30,B31      *
    ;*   Regs Used         : A0,A1,A2,A3,A4,A5,A6,A7,A8,A9,B0,B1,B2,B3,B4,B5,B6,  *
    ;*                           B7,B8,B9,B10,B11,DP,SP,A16,A17,A18,A19,A20,A21,  *
    ;*                           A22,A23,A24,A25,A26,A27,A28,A29,B16,B17,B18,B19, *
    ;*                           B20,B21,B22,B23,B24,B25,B26,B27,B28,B29,B30,B31  *
    ;*   Local Frame Size  : 0 Args + 0 Auto + 8 Save = 8 byte                    *
    ;******************************************************************************
    _TestAtansp_iTime:
    ;** --------------------------------------------------------------------------*
    ;**   -----------------------    V$0 = *AngleOut;
    ;**   -----------------------    K$15 = 1.0F;
    ;**   -----------------------    U$8 = TanIn;
    ;** 141 -----------------------    L$1 = 128;
    ;**   -----------------------    K$64 = 3.14159274101257324219F;
    ;**   -----------------------    K$63 = 1.57079637050628662109F;
    ;**   -----------------------    K$56 = -0.0723566934466361999512F;
    ;**   -----------------------    K$51 = 0.0393708795309066772461F;
    ;**   -----------------------    K$48 = -0.0139455096796154975891F;
    ;**   -----------------------    K$46 = 0.00230158213526010513306F;
    ;**   -----------------------    K$42 = -0.333329290151596069336F;
    ;**   -----------------------    K$40 = 0.199893012642860412598F;
    ;**   -----------------------    K$37 = -0.141750767827033996582F;
    ;**   -----------------------    K$34 = 0.105214990675449371338F;
    ;**   -----------------------    K$21 = 2.0F;
    ;**   -----------------------    K$11 = 0.0F;
    ;**   -----------------------    #pragma MUST_ITERATE(128, 128, 128)
    ;**   -----------------------    #pragma LOOP_FLAGS(4096u)
    ;** -----------------------g2:
    ;** 143 -----------------------    a = *U$8++;
    ;** 91 -----------------------    an = a < K$11;  // [7]
    ;** 94 -----------------------    U$9 = a;  // [7]
    ;** 94 -----------------------    if ( _fabsf(U$9) > K$15 ) goto g4;  // [7]
    ;** 86 -----------------------    temp = K$15;  // [7]
    ;** 85 -----------------------    s = 0;  // [7]
    ;**   -----------------------    goto g5;
    ;** -----------------------g4:
    ;** 96 -----------------------    temp = U$9;  // [7]
    ;** 97 -----------------------    a = K$15;  // [7]
    ;** 98 -----------------------    s = 1;  // [7]
    ;** -----------------------g5:
    ;** 62 -----------------------    C$10 = _rcpsp(temp);  // [6]
    ;** 62 -----------------------    C$9 = (K$21-C$10*temp)*C$10;  // [6]
    ;** 62 -----------------------    C$8 = (K$21-C$9*temp)*(C$9*a);  // [6]
    ;** 62 -----------------------    C$7 = C$8*C$8;  // [6]
    ;** 62 -----------------------    g4 = C$7*C$7;  // [6]
    ;** 71 -----------------------    C$6 = C$7*g4;  // [6]
    ;** 71 -----------------------    C$5 = g4*g4;  // [6]
    ;** 71 -----------------------    pol = (C$5*K$34+C$6*K$37+(g4*K$40+C$7*K$42)+((g4*K$46+C$7*K$48+K$51)*(C$5*g4)+C$6*g4*K$56))*C$8+C$8;  // [6]
    ;** 74 -----------------------    C$4 = s == 0;  // [6]
    ;** 74 -----------------------    P$11 = C$4 ? K$64 : K$63;  // [6]
    ;** 74 -----------------------    C$4 ? (C$3 = K$11) : (C$3 = P$11);  // [6]
    ;** 74 -----------------------    an ? (C$2 = -C$3) : (C$2 = C$3);  // [6]
    ;** 74 -----------------------    P$12 = s ? C$2-pol : C$2+pol;  // [6]
    ;** 74 -----------------------    V$0 += (a == K$11) ? K$11 : P$12;  // [6]
    ;** 141 -----------------------    if ( L$1 = L$1-1 ) goto g2;
    ;**   -----------------------    *AngleOut = V$0;
               MVKL    .S2     0x3fc90fdb,B7
               MVKL    .S2     0x3d21435c,B6

               MVKL    .S1     0xbeaaaa23,A16
    ||         MVKL    .S2     0x3e4cb0c1,B5

               MVKH    .S2     0x3fc90fdb,B7
    ||         MVKL    .S1     0xbc647bb5,A21

               MVK     .S2     0x80,B8           ; |141|
    ||         MVKL    .S1     0xbd942fbf,A19

               MVKH    .S2     0x3d21435c,B6
    ||         MVKL    .S1     0x3dd77af5,A24

               MVKH    .S2     0x3e4cb0c1,B5
    ||         MVKL    .S1     0xbe11271d,A23
    ||         ZERO    .L2     B9

               MVKH    .S1     0xbeaaaa23,A16
    ||         SET     .S2     B9,0x17,0x1d,B31
    ||         MV      .L1X    B4,A28            ; |138|
    ||         STW     .D2T2   B11,*SP--(8)      ; |138|

               DINT                              ; interrupts off
    ||         MVKH    .S1     0xbc647bb5,A21
    ||         ZERO    .L2     B4
    ||         MVK     .L1     0x6,A2            ; init prolog collapse predicate
    ||         MV      .D1X    B3,A29            ; |138|

               MV      .L1X    B7,A25
    ||         SET     .S2     B4,0x1e,0x1e,B3
    ||         MVKH    .S1     0xbd942fbf,A19
    ||         ZERO    .D1     A17               ; |74|

               MV      .L1X    B6,A20
    ||         MVKL    .S2     0x3b16d624,B10
    ||         STW     .D2T2   B10,*+SP(4)       ; |138|
    ||         MVKH    .S1     0x3dd77af5,A24
    ||         ZERO    .D1     A18               ; |74|

               MVKH    .S2     0x3b16d624,B10
    ||         MV      .L1X    B5,A22
    ||         LDW     .D1T1   *A28,A26
    ||         MVKH    .S1     0xbe11271d,A23
    ||         ADD     .L2     4,B8,B1
    ||         MV      .D2X    A4,B30

    ;*----------------------------------------------------------------------------*
    ;*   SOFTWARE PIPELINE INFORMATION
    ;*
    ;*      Loop source line                 : 141
    ;*      Loop opening brace source line   : 142
    ;*      Loop closing brace source line   : 144
    ;*      Known Minimum Trip Count         : 128                   
    ;*      Known Maximum Trip Count         : 128                   
    ;*      Known Max Trip Count Factor      : 128
    ;*      Loop Carried Dependency Bound(^) : 4
    ;*      Unpartitioned Resource Bound     : 10
    ;*      Partitioned Resource Bound(*)    : 10
    ;*      Resource Partition:
    ;*                                A-side   B-side
    ;*      .L units                     1        0    
    ;*      .S units                     1        6    
    ;*      .D units                     0        1    
    ;*      .M units                    10*      10*   
    ;*      .X cross paths              10*       6    
    ;*      .T address paths             0        1    
    ;*      Long read paths              0        0    
    ;*      Long write paths             0        0    
    ;*      Logical  ops (.LS)           8        5     (.L or .S unit)
    ;*      Addition ops (.LSD)          7        5     (.L or .S or .D unit)
    ;*      Bound(.L .S .LS)             5        6    
    ;*      Bound(.L .S .D .LS .LSD)     6        6    
    ;*
    ;*      Searching for software pipeline schedule at ...
    ;*         ii = 10 Register is live too long
    ;*         ii = 10 Did not find schedule
    ;*         ii = 11 Register is live too long
    ;*         ii = 11 Register is live too long
    ;*         ii = 11 Did not find schedule
    ;*         ii = 12 Register is live too long
    ;*         ii = 12 Register is live too long
    ;*         ii = 12 Did not find schedule
    ;*         ii = 13 Register is live too long
    ;*         ii = 13 Register pressure too high: 5
    ;*         ii = 13 Cannot allocate machine registers
    ;*                   Regs Live Always   :  9/6  (A/B-side)
    ;*                   Max Regs Live      : 32/30
    ;*                   Max Cond Regs Live :  2/3
    ;*         ii = 13 Did not find schedule
    ;*         ii = 14 Register is live too long
    ;*         ii = 14 Schedule found with 7 iterations in parallel
    ;*      Done
    ;*
    ;*      Epilog not entirely removed
    ;*      Collapsed epilog stages       : 5
    ;*      Collapsed prolog stages       : 6
    ;*      Minimum required memory pad   : 20 bytes
    ;*      Minimum threshold value       : -mh24
    ;*
    ;*      Minimum safe trip count       : 1
    ;*----------------------------------------------------------------------------*
    $C$L6:    ; PIPED LOOP PROLOG
    ;** --------------------------------------------------------------------------*
    $C$L7:    ; PIPED LOOP KERNEL
    $C$DW$L$_TestAtansp_iTime$3$B:
     .dwpsn file "../src/hello.c",line 142,column 0,is_stmt

       [!B0]   MV      .L1     A5,A27            ; |74| <0,74>
    || [ B0]   XOR     .S1     A5,A6,A27         ; |74| <0,74>
    ||         MV      .S2     B19,B20           ; |91| <1,60> Split a long life
    ||         MV      .L2     B18,B19           ; |91| <2,46> Split a long life
    ||         MVD     .M2     B23,B24           ; |97| <2,46> Split a long life
    ||         MPYSP   .M1X    B6,A3,A4          ; |71| <2,46>
    ||         MV      .D2     B28,B5            ; |97| <4,18> Split a long life

               ADDSP   .L1X    A3,B8,A6          ; |71| <2,47>
    ||         MVD     .M2     B7,B23            ; |97| <3,33> Split a long life
    ||         MV      .D2     B5,B7             ; |97| <4,19> Split a long life
    ||         ABSSP   .S2     B28,B5            ; |94| <5,5>  ^

       [ A1]   SUBSP   .L2X    A27,B23,B29       ; |74| <0,76>
    ||         MPYSP   .M1X    A22,B6,A8         ; |71| <2,48>
    ||         MV      .D2     B17,B18           ; |91| <3,34> Split a long life
    ||         MPYSP   .M2     B27,B27,B6        ; |62| <3,34>
    ||         CMPGTSP .S2     B5,B31,B0         ; |94| <5,6>  ^

               ADDSP   .L1     A6,A4,A3          ; |71| <1,63>
    ||         MPYSP   .M1X    A24,B8,A6         ; |71| <2,49>
    ||         MV      .L2     B24,B17           ; |91| <4,21> Split a long life
    ||         MPYSP   .M2     B4,B5,B11         ; |62| <4,21>
    || [ B0]   MV      .S2     B28,B4            ; |96| <5,7>  ^
    || [!B0]   MV      .D2     B31,B4            ; |86| <5,7>

       [!A1]   ADDSP   .L2X    B23,A27,B29       ; |74| <0,78>
    ||         MPYSP   .M2     B6,B8,B5          ; |71| <2,50>
    ||         MPYSP   .M1X    B6,A4,A3          ; |71| <2,50>
    ||         MV      .D2     B2,B8             ; |94| <4,22> Post-sched spill
    ||         RCPSP   .S2     B4,B26            ; |62| <5,8>

               MV      .S2X    A3,B4             ; |94| <1,65> Post-sched spill
    ||         MPYSP   .M1     A23,A4,A7         ; |71| <2,51>
    ||         ADDSP   .L1     A20,A6,A9         ; |71| <2,51>
    ||         MV      .S1X    B4,A4             ; |94| <5,9> Post-sched spill
    ||         MV      .D2     B0,B2             ; |94| <5,9> Post-sched spill
    ||         MPYSP   .M2     B4,B26,B6         ; |62| <5,9>

               CMPEQSP .S2X    B11,A18,B0        ; |74| <0,80>
    ||         MPYSP   .M2     B6,B6,B6          ; |62| <3,38>

       [ B0]   MV      .S2X    A18,B11           ; |74| <0,81>
    ||         MPYSP   .M1X    B16,A3,A5         ; |71| <1,67>
    ||         MPYSP   .M2     B7,B5,B24         ; |62| <4,25>
    ||         SUBSP   .L2     B3,B11,B22        ; |62| <4,25>

       [!B0]   MV      .L2     B29,B11           ; |74| <0,82>
    || [ B1]   BDEC    .S2     $C$L7,B1          ; |141| <0,82>
    ||         MVD     .M2     B25,B11           ; |97| <1,68> Split a long life
    ||         MPYSP   .M1     A19,A3,A4         ; |71| <2,54>
    ||         MV      .S1X    B6,A3             ; |62| <3,40> Define a twin register

               MV      .D2     B4,B0             ; |94| <1,69> Post-sched spill
    ||         MVD     .M2     B24,B25           ; |97| <2,55> Split a long life
    ||         ADDSP   .S1     A5,A8,A9          ; |71| <2,55>
    ||         ADDSP   .L1     A7,A6,A7          ; |71| <2,55>
    ||         MPYSP   .M1X    B5,A9,A8          ; |71| <2,55>
    ||         MV      .S2X    A4,B4             ; |94| <5,13> Post-sched spill
    ||         SUBSP   .L2     B3,B6,B5          ; |62| <5,13>

       [!A2]   ADDSP   .L1X    B11,A26,A26       ; |74| <0,84>  ^
    || [!B0]   ZERO    .S1     A1                ; |85| <1,70>
    || [ B0]   MVK     .D1     0x1,A1            ; |98| <1,70>
    ||         MV      .S2     B22,B5            ; |94| <2,56> Split a long life
    ||         MPYSP   .M2     B10,B6,B8         ; |71| <3,42>
    ||         LDW     .D2T2   *B30++,B28        ; |143| <6,0>  ^

       [ A2]   SUB     .D1     A2,1,A2           ; <0,85>
    ||         SET     .S1     A17,31,31,A6      ; |74| <1,71>
    ||         CMPEQ   .L1     A1,0,A0           ; |74| <1,71>
    ||         MV      .D2     B21,B22           ; |94| <3,43> Split a long life
    ||         MPYSP   .M1     A21,A3,A3         ; |71| <3,43>
    ||         MV      .S2     B8,B21            ; |94| <4,29> Split a long life
    ||         MPYSP   .M2     B24,B22,B27       ; |62| <4,29>

       [!A0]   MV      .D1     A25,A5            ; |74| <1,72>
    ||         ADDSP   .L2X    B16,A5,B23        ; |71| <1,72>
    ||         MVD     .M1X    B5,A3             ; |94| <2,58> Post-sched spill
    ||         MV      .D2     B9,B16            ; |62| <2,58> Split a long life
    ||         MV      .S2     B27,B9            ; |62| <3,44> Split a long life
    ||         MPYSP   .M2     B6,B6,B8          ; |71| <3,44>

     .dwpsn file "../src/hello.c",line 144,column 0,is_stmt

               MV      .L2     B20,B0            ; |91| <1,73> Post-sched spill
    || [ A0]   MV      .D1     A18,A5            ; |74| <1,73>
    ||         ADDSP   .S1     A9,A7,A4          ; |71| <2,59>
    ||         ADDSP   .L1     A4,A8,A6          ; |71| <2,59>
    ||         MPYSP   .M1     A16,A3,A5         ; |71| <3,45>
    ||         CMPLTSP .S2X    B28,A18,B24       ; |91| <5,17>
    || [ B2]   MV      .D2     B31,B28           ; |97| <5,17>  ^
    ||         MPYSP   .M2     B26,B5,B5         ; |62| <5,17>

    $C$DW$L$_TestAtansp_iTime$3$E:
    ;** --------------------------------------------------------------------------*
    $C$L8:    ; PIPED LOOP EPILOG
    ;**   -----------------------    return;

               LDW     .D2T2   *+SP(4),B10       ; |145|
    || [!B0]   MV      .L1     A5,A27            ; |74| (E) <6,74>
    || [ B0]   XOR     .S1     A5,A6,A27         ; |74| (E) <6,74>

    ;** --------------------------------------------------------------------------*
               CMPEQSP .S2X    B11,A18,B0        ; |74| (E) <6,80>
       [!A1]   ADDSP   .L2X    B23,A27,B29       ; |74| (E) <6,78>
       [ A1]   SUBSP   .L2X    A27,B23,B29       ; |74| (E) <6,76>
       [ B0]   MV      .L2X    A18,B11           ; |74| (E) <6,81>
               RINT                              ; interrupts on
               NOP             1
       [!B0]   MV      .L2     B29,B11           ; |74| (E) <6,82>
               NOP             1
               ADDSP   .L1X    B11,A26,A3        ; |74| (E) <6,84>  ^
               NOP             3
    $C$DW$48 .dwtag  DW_TAG_TI_branch
     .dwattr $C$DW$48, DW_AT_low_pc(0x04)
     .dwattr $C$DW$48, DW_AT_TI_return

               STW     .D1T1   A3,*A28
    ||         RET     .S2X    A29               ; |145|

               LDW     .D2T2   *++SP(8),B11      ; |145|
     .dwpsn file "../src/hello.c",line 145,column 1,is_stmt
               NOP             4
               ; BRANCH OCCURS {A29}             ; |145|

    $C$DW$49 .dwtag  DW_TAG_TI_loop
     .dwattr $C$DW$49, DW_AT_name("C:\projects\L138 EVM with gel from SD\pgi interp test\Release\hello.asm:$C$L7:1:1329131350")
     .dwattr $C$DW$49, DW_AT_TI_begin_file("../src/hello.c")
     .dwattr $C$DW$49, DW_AT_TI_begin_line(0x8d)
     .dwattr $C$DW$49, DW_AT_TI_end_line(0x90)
    $C$DW$50 .dwtag  DW_TAG_TI_loop_range
     .dwattr $C$DW$50, DW_AT_low_pc($C$DW$L$_TestAtansp_iTime$3$B)
     .dwattr $C$DW$50, DW_AT_high_pc($C$DW$L$_TestAtansp_iTime$3$E)
     .dwendtag $C$DW$49

     .dwattr $C$DW$45, DW_AT_TI_end_file("../src/hello.c")
     .dwattr $C$DW$45, DW_AT_TI_end_line(0x91)
     .dwattr $C$DW$45, DW_AT_TI_end_column(0x01)
     .dwendtag $C$DW$45

  • Keith,

    Someone else can discuss competitors' parts and DSP architecture with you. That is not the direction I will be able to help.

    Your assembly listing is much easier to read now, thank you. I removed the previous listing to make the thread shorter. I hope you do not mind that edit.

    Your inner loop is indeed done using software pipelining. The comments say that ii=14 which means that each pass through the loop uses 14 instruction execution packets. This can be confirmed by counting the execution packets in the loop kernel, 14.

    If the loop is being executed a large number of times and the measured average time per input is still 50, there must be some big memory stalls or other instruction-related stalls. The instruction stalls are usually minimized (mostly removed) by the compiler. Try running this function and the data for it out of L1P/D SRAM instead of cache to see if that is where your problem is. If you are reading sequentially, the L1D cache should have been minimizing that effect, and the loop code would easily be cached in L1P after the first pass through. Also check your linker command file to make sure where your program and data are located.

    Regards,
    RandyP

  • Randy,

    Thank you have found the problem, but I have some more questions. When I set all sections in linker_dsp.cmd to a combination of L1P and L1D (will not fit in just one) I am now getting the 128x atansp_i() to run in 15 cycles and my other code is also running much faster. I had assumed that the code would have automatically been loaded into the fastest available RAM as a normal part of the cache operation during execution. Looking at the L138 tech ref manual again I see that this needs to be setup with the Program and Data Memory Controllers, however I cannot find anything in the manual to say how to do it, where is this information? Also if I need to create more sections in my code rather than just the default .text I cannot find anything in the compiler manual which describes how to do this.

    May I also make a suggestion - I have used the hello world example project supplied with the eval board as a starting point for my development tests. I suspect this is what many other people will also do, and it obviously has given a very poor impression of the DSP performance. Had it not been for your help I would probably have ended up having to write assembler functions, or possibly abandoning TI altogether. Most people will be using a DSP because they need fast maths, so why not give example projects setup to show the performance.

    Thanks again.

  • Keith,

    It is interesting that your measurement shows faster than the documentation. This implies that the docs were not run exactly the way that you did it but that they made other tradeoffs.

    Depending on the application, there are different tradeoffs to make with the use of L1P and L1D as cache or SRAM. The same is true for the amount of cache to allocate in L2. Most people will use L1P and L1D as 100% cache, and choose some percentage of L2 that fits their requirements. But there are cases where L1D set as 50% cache can be a powerful way to speed up some operations; for tight loops like the one you have, L1P being 50% cache may not make much difference since the main program loop will quickly get into L1P cache and then will run just as fast as if it were in L1P SRAM.

    Keith Hall said:
    I had assumed that the code would have automatically been loaded into the fastest available RAM as a normal part of the cache operation during execution.

    The cache controller does automatically keep the most recently used program and data in cache so it can be reused. There is overhead getting it into the cache the first time, and this is what you avoid by loading things directly into L1D or L1P SRAM instead of waiting for cache to get it.

    Keith Hall said:
    Also if I need to create more sections in my code rather than just the default .text I cannot find anything in the compiler manual which describes how to do this.

    Check out the C Compiler User's Guide, #pragma DATA_SECTION and CODE_SECTION. You also have to add sections to the linker command file, which is discussed in the Assembly Language Tools Reference Guide.

    Thanks for the suggestions. Your comments will be seen by the right people. I am not sure of all the tradeoffs they make in creating the examples, but your feedback is a big part of the equation.

    Regards,
    RandyP

     

    If you need more help, please reply back. If this answers the question, please click  Verify Answer  , below.

  • Randy,

    I still have the question of how to setup the cache. I have looked at the cache user guide (sprug82a) and this says to use CSL functions CACHE_L1pSetSize() etc. but from what I undertsand you no longer provide a CSL. The BSL which came with the EVM does not seem to contain anything to do with the cache and I cannot find anything in the GEL file. From other posts which I have found it appears as if the power on state is L1 cache enabled but as my code execution (before I located to L1) is the same each time I run it this suggests that L1 cache is not enabled. Also I cannot see any of the cache control registers (L1PCFG etc.) in the CCS register window to see what they are set to. The L138 data sheet gives a memory address of 0x01840020 for L1PCFG, looking at this in CCS bits 0-2 are all 1 which from the C674x megamodule data sheet should mean maximal cache. I am confused!

    Thanks.

  • Keith,

    Please go to the TI Wiki Pages and search for "C67 CSL" (no quotes). You will find the available portion of CSL as part of the PSP.

    Keith Hall said:
    I cannot see any of the cache control registers (L1PCFG etc.) in the CCS register window to see what they are set to. The L138 data sheet gives a memory address of 0x01840020 for L1PCFG

    The Registers view is not for the Memory Mapped Registers but for the DSP core's internal registers. Using the Memory Browser is the correct way to view these registers. There are also structs available in the CSL that can be used to view MMRs in the Expressions window, which is easier for me to do since you get labels for those registers.

    Run to main and set those bits to 0, then see if you get the same performance, after relocating back to L2 of course.

    Regards,
    RandyP

  • Randy,

    This search produces no results, however I did eventually find http://focus.ti.com/docs/toolsw/folders/ which leads to http://software-dl.ti.com/dsps/dsps_public_sw/psp/BIOSPSP/index.html which has a CSL and also example code (QuickStart) including cache setup for the L138. I have downloaded this but not tried it yet.

    I also came across references to Starterware on some of the forum posts saying that it also gives cache control functions. Would it be better to use this as it seems to be more recent?

    Thanks.