This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4VH-Q1: Help needed to replicate DSPLIB cycle counts on C7x DSP

Part Number: TDA4VH-Q1
Other Parts Discussed in Thread: TDA4VH

Tool/software:

We are trying to replicate the cycle counts reported in DSPLIB user guide performance section for vector operations on C7x but our results do not match. Our questions are at the end. Below are the details about our setup.

  1. Using J784S4 RTOS SDK 10.01.00.04 on Ubuntu 22.04.1 host
  2. Have followed the DSPLIB build instructions at this link: https://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-j784s4/10_01_00_04/exports/docs/dsplib/docs/user_guide/build_instructions_linux.html 
  3. Have followed the CCS baremetal instructions at this link: https://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-j784s4/latest/exports/docs/psdk_rtos/docs/user_guide/ccs_setup_j784s4.html#debugging-without-hlos-running-on-a72-rtos-only-baremetal 
  4. Using a new J784S4XG01EVM rev PROC141E5(001)

We used the TSC register in the example, DSPLIB_add (dsplib/examples/DSPLIB_add/DSPLIB_add_examples.cpp) as shown in the code snippet below and we measured the following.

 The results do not match the results published in the DSPBIL user guide here: https://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-j784s4/10_01_00_04/exports/docs/dsplib/docs/user_guide/performance_summary.html#DSPLIB_grouped 

Now a couple questions.

  1. C71x_0 always runs faster than the other DSP cores (C71x_1/2/3). This is not expected, is this? If not, is this an artifact of the launch.js script? 
  2. DSPLIB_add is a simple example in that the size is only 14. So, I expected EVM cycles to be ~100 per the DSPLIB performance summary, however, what is measured is roughly 3 times larger. Can you please help me figure out what changes are needed in DSPLIB_add_examples.cpp or in build command to achieve the cycle counts in the performance summary?

 

  • Hi,
    I am trying to reproduce the issue from my end. I will update you within a day.

    Regards,
    Shabary

  • Hi,

    DSPLIB_add is a simple example in that the size is only 14. So, I expected EVM cycles to be ~100 per the DSPLIB performance summary, however, what is measured is roughly 3 times larger. Can you please help me figure out what changes are needed in DSPLIB_add_examples.cpp or in build command to achieve the cycle counts in the performance summary?

    In the DSPLIB_add_example file,we are calculating the cycle counts using TSC and here features like cache and MMU are not enabled. However, looking into the test main in d.c which uses TI Profile APIs with cache and MMU enabled, which helps to produce lower cycle counts. You can refer to the TI_profile.h and TI_profile.c files for more details on the profiling setup.

    Could you please validate this by running the DSPLIB_add test on c7x_0 and comparing the warm cycle counts with those provided in the DSPLIB user guide.

    Regards,
    Shabary

  • Shabary thank you for your reply. It looks like the example for DSPLIB_add (code and data) could fit entirely in L2, so if L1P and L1D are enabled, we should be able to replicate the best case performance with a simpler code-base. This is the goal because it allows us to then extend the TSC-based cycle counting technique to other custom DSP code.

    Regarding your request, 

    Could you please validate this by running the DSPLIB_add test on c7x_0 and comparing the warm cycle counts with those provided in the DSPLIB user guide.

    I ran test_DSPLIB_add (with the MSMC to L2 changes you provided) on both C7X_0 and C7X_1 and I am attaching the outputs. Please show me how the warm cycle counts from these outputs (focus on C7X_0) map to the performance number in the DSPLIB userguide. I don't know how to make that correlation because the CIO output only shows the vector size and it doesn't list the data types for each test run. If I just look at the vector sizes and cycle counts I can not identify which test run corresponds to a specific row in the DSPLIB users guide.

    [C71X_0] ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                   DSPLIB_add testing starts.
    ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    | No  | ID  | Status | Num pt  | Kernel Init   | Kernel Compute  | NatC Compute  | Arch. Compute | Efficiency  | Est.  Compute | Accuracy    | Description
    |     |     |        |         |  cyc          |  cyc            |  cyc          | cyc (est.)    | vs Arch.(%) | cyc (est.)    | vs Est.(%)  |            
    ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    
    Warning at: row=0, col=0, val1=-1.395429, val2=-1.395429
    |   1 |   1 | PASS   |     256 |           365 |             359 |          5786 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 256
    
    Warning at: row=0, col=9, val1=-7.658801, val2=-7.658801
    |   2 |   2 | PASS   |     512 |           362 |             514 |         11219 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 512
    
    Warning at: row=0, col=2, val1=7.396252, val2=7.396253
    |   3 |   3 | PASS   |    1024 |           264 |             661 |         22439 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 1024
    
    Warning at: row=0, col=0, val1=10.839034, val2=10.839033
    |   4 |   4 | PASS   |    2048 |           455 |            1199 |         44760 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 2048
    
    Warning at: row=0, col=0, val1=-2.200345, val2=-2.200345
    |   5 |   5 | PASS   |   10240 |           465 |            4761 |        223701 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 10240
    |   6 |   6 | PASS   |     256 |           316 |             542 |          3851 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 256
    |   7 |   7 | PASS   |     512 |           228 |             683 |          7365 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 512
    |   8 |   8 | PASS   |    1024 |           467 |            1212 |         15116 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 1024
    |   9 |   9 | PASS   |    2048 |           373 |            1995 |         30119 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 2048
    |  10 |  10 | PASS   |   10240 |           462 |            9258 |        149454 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 10240
    |  11 |  11 | PASS   |     256 |           385 |             303 |          5090 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 256
    |  12 |  12 | PASS   |     512 |           317 |             421 |          9790 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 512
    |  13 |  13 | PASS   |    1024 |           488 |             541 |         19537 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 1024
    |  14 |  14 | PASS   |    2048 |           435 |            1030 |         39464 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 2048
    |  15 |  15 | PASS   |   10240 |           449 |            3415 |        196488 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 10240
    |  16 |  16 | PASS   |     256 |           409 |             273 |          4813 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 256
    |  17 |  17 | PASS   |     512 |           262 |             208 |         10264 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 512
    |  18 |  18 | PASS   |    1024 |           285 |             299 |         20042 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 1024
    |  19 |  19 | PASS   |    2048 |           506 |             566 |         40416 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 2048
    |  20 |  20 | PASS   |   10240 |           405 |            1827 |        236399 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 10240
    |  21 |  21 | PASS   |     256 |           330 |             244 |          4776 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 256
    |  22 |  22 | PASS   |     512 |           219 |             352 |         10374 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 512
    |  23 |  23 | PASS   |    1024 |           225 |             305 |         22045 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 1024
    |  24 |  24 | PASS   |    2048 |           481 |             400 |         41144 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 2048
    |  25 |  25 | PASS   |   10240 |           554 |             929 |        207183 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 10240
    Test Pass!
    Test    0: Cold Cycles =      359, Warm Cycles =      203, Warm Cycles WRB =      265
    Test    1: Cold Cycles =      514, Warm Cycles =      327, Warm Cycles WRB =      363
    Test    2: Cold Cycles =      661, Warm Cycles =      537, Warm Cycles WRB =      550
    Test    3: Cold Cycles =     1199, Warm Cycles =      999, Warm Cycles WRB =     1048
    Test    4: Cold Cycles =     4761, Warm Cycles =     4584, Warm Cycles WRB =     4627
    Test    5: Cold Cycles =      542, Warm Cycles =      327, Warm Cycles WRB =      356
    Test    6: Cold Cycles =      683, Warm Cycles =      539, Warm Cycles WRB =      569
    Test    7: Cold Cycles =     1212, Warm Cycles =      999, Warm Cycles WRB =     1038
    Test    8: Cold Cycles =     1995, Warm Cycles =     1883, Warm Cycles WRB =     1925
    Test    9: Cold Cycles =     9258, Warm Cycles =     9068, Warm Cycles WRB =     9140
    Test   10: Cold Cycles =      303, Warm Cycles =      169, Warm Cycles WRB =      201
    Test   11: Cold Cycles =      421, Warm Cycles =      261, Warm Cycles WRB =      292
    Test   12: Cold Cycles =      541, Warm Cycles =      412, Warm Cycles WRB =      418
    Test   13: Cold Cycles =     1030, Warm Cycles =      745, Warm Cycles WRB =      765
    Test   14: Cold Cycles =     3415, Warm Cycles =     3305, Warm Cycles WRB =     3357
    Test   15: Cold Cycles =      273, Warm Cycles =      143, Warm Cycles WRB =      165
    Test   16: Cold Cycles =      208, Warm Cycles =      169, Warm Cycles WRB =      201
    Test   17: Cold Cycles =      299, Warm Cycles =      262, Warm Cycles WRB =      267
    Test   18: Cold Cycles =      566, Warm Cycles =      412, Warm Cycles WRB =      478
    Test   19: Cold Cycles =     1827, Warm Cycles =     1709, Warm Cycles WRB =     1784
    Test   20: Cold Cycles =      244, Warm Cycles =      111, Warm Cycles WRB =      141
    Test   21: Cold Cycles =      352, Warm Cycles =      143, Warm Cycles WRB =      172
    Test   22: Cold Cycles =      305, Warm Cycles =      171, Warm Cycles WRB =      175
    Test   23: Cold Cycles =      400, Warm Cycles =      264, Warm Cycles WRB =      286
    Test   24: Cold Cycles =      929, Warm Cycles =      891, Warm Cycles WRB =     1088
    |  26 |1000 | PASS   |       0 |             0 |               0 |             0 |             0 |         nan |             0 |         nan | COVERAGE TEST
    |  27 |1001 | PASS   |       0 |             0 |               0 |             0 |             0 |         nan |             0 |         nan | COVERAGE TEST
    |  28 |1002 | PASS   |       0 |             0 |               0 |             0 |             0 |         nan |             0 |         nan | COVERAGE TEST
    |  29 |1003 | PASS   |       0 |             0 |               0 |             0 |             0 |         nan |             0 |         nan | COVERAGE TEST
    |  30 |1004 | PASS   |       0 |             0 |               0 |             0 |             0 |         nan |             0 |         nan | COVERAGE TEST
    |  31 |1005 | PASS   |       0 |             0 |               0 |             0 |             0 |         nan |             0 |         nan | COVERAGE TEST
    |  32 |1006 | PASS   |       0 |             0 |               0 |             0 |             0 |         nan |             0 |         nan | COVERAGE TEST
    Test Pass!
    
    [C71X_1] ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                   DSPLIB_add testing starts.
    ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    | No  | ID  | Status | Num pt  | Kernel Init   | Kernel Compute  | NatC Compute  | Arch. Compute | Efficiency  | Est.  Compute | Accuracy    | Description
    |     |     |        |         |  cyc          |  cyc            |  cyc          | cyc (est.)    | vs Arch.(%) | cyc (est.)    | vs Est.(%)  |            
    ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    
    Warning at: row=0, col=0, val1=-1.395429, val2=-1.395429
    |   1 |   1 | PASS   |     256 |           438 |             506 |          6961 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 256
    
    Warning at: row=0, col=9, val1=-7.658801, val2=-7.658801
    |   2 |   2 | PASS   |     512 |           362 |             666 |         13542 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 512
    
    Warning at: row=0, col=2, val1=7.396252, val2=7.396253
    |   3 |   3 | PASS   |    1024 |           264 |             800 |         27097 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 1024
    
    Warning at: row=0, col=0, val1=10.839034, val2=10.839033
    |   4 |   4 | PASS   |    2048 |           531 |            1330 |         48586 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 2048
    
    Warning at: row=0, col=0, val1=-2.200345, val2=-2.200345
    |   5 |   5 | PASS   |   10240 |           541 |            4924 |        271085 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 10240
    |   6 |   6 | PASS   |     256 |           316 |             684 |          6186 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 256
    |   7 |   7 | PASS   |     512 |           228 |             802 |          9685 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 512
    |   8 |   8 | PASS   |    1024 |           543 |            1335 |         24369 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 1024
    |   9 |   9 | PASS   |    2048 |           449 |            2143 |         48807 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 2048
    |  10 |  10 | PASS   |   10240 |           538 |            9409 |        242894 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 10240
    |  11 |  11 | PASS   |     256 |           385 |             448 |          6099 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 256
    |  12 |  12 | PASS   |     512 |           317 |             523 |         11220 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 512
    |  13 |  13 | PASS   |    1024 |           502 |             680 |         22686 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 1024
    |  14 |  14 | PASS   |    2048 |           435 |            1097 |         48146 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 2048
    |  15 |  15 | PASS   |   10240 |           537 |            3578 |        240178 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 10240
    |  16 |  16 | PASS   |     256 |           409 |             398 |          5297 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 256
    |  17 |  17 | PASS   |     512 |           262 |             358 |         11130 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 512
    |  18 |  18 | PASS   |    1024 |           288 |             453 |         22678 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 1024
    |  19 |  19 | PASS   |    2048 |           585 |             683 |         46630 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 2048
    |  20 |  20 | PASS   |   10240 |           481 |            2067 |        236841 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 10240
    |  21 |  21 | PASS   |     256 |           406 |             369 |          4866 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 256
    |  22 |  22 | PASS   |     512 |           219 |             508 |         11327 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 512
    |  23 |  23 | PASS   |    1024 |           216 |             437 |         24115 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 1024
    |  24 |  24 | PASS   |    2048 |           482 |             535 |         49195 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 2048
    |  25 |  25 | PASS   |   10240 |           631 |            1100 |        255101 |             0 |           0 |             0 |           0 | STATIC generated input | Data size = 10240
    Test Pass!
    Test    0: Cold Cycles =      506, Warm Cycles =      277, Warm Cycles WRB =      342
    Test    1: Cold Cycles =      666, Warm Cycles =      406, Warm Cycles WRB =      442
    Test    2: Cold Cycles =      800, Warm Cycles =      613, Warm Cycles WRB =      626
    Test    3: Cold Cycles =     1330, Warm Cycles =     1071, Warm Cycles WRB =     1120
    Test    4: Cold Cycles =     4924, Warm Cycles =     4658, Warm Cycles WRB =     4705
    Test    5: Cold Cycles =      684, Warm Cycles =      405, Warm Cycles WRB =      434
    Test    6: Cold Cycles =      802, Warm Cycles =      614, Warm Cycles WRB =      633
    Test    7: Cold Cycles =     1335, Warm Cycles =     1067, Warm Cycles WRB =     1112
    Test    8: Cold Cycles =     2143, Warm Cycles =     1958, Warm Cycles WRB =     2001
    Test    9: Cold Cycles =     9409, Warm Cycles =     9146, Warm Cycles WRB =     9320
    Test   10: Cold Cycles =      448, Warm Cycles =      247, Warm Cycles WRB =      279
    Test   11: Cold Cycles =      523, Warm Cycles =      330, Warm Cycles WRB =      354
    Test   12: Cold Cycles =      680, Warm Cycles =      494, Warm Cycles WRB =      499
    Test   13: Cold Cycles =     1097, Warm Cycles =      831, Warm Cycles WRB =      844
    Test   14: Cold Cycles =     3578, Warm Cycles =     3408, Warm Cycles WRB =     3440
    Test   15: Cold Cycles =      398, Warm Cycles =      213, Warm Cycles WRB =      229
    Test   16: Cold Cycles =      358, Warm Cycles =      246, Warm Cycles WRB =      278
    Test   17: Cold Cycles =      453, Warm Cycles =      359, Warm Cycles WRB =      359
    Test   18: Cold Cycles =      683, Warm Cycles =      494, Warm Cycles WRB =      539
    Test   19: Cold Cycles =     2067, Warm Cycles =     1817, Warm Cycles WRB =     1867
    Test   20: Cold Cycles =      369, Warm Cycles =      179, Warm Cycles WRB =      203
    Test   21: Cold Cycles =      508, Warm Cycles =      228, Warm Cycles WRB =      257
    Test   22: Cold Cycles =      437, Warm Cycles =      251, Warm Cycles WRB =      255
    Test   23: Cold Cycles =      535, Warm Cycles =      356, Warm Cycles WRB =      365
    Test   24: Cold Cycles =     1100, Warm Cycles =      989, Warm Cycles WRB =     1174
    |  26 |1000 | PASS   |       0 |             0 |               0 |             0 |             0 |         nan |             0 |         nan | COVERAGE TEST
    |  27 |1001 | PASS   |       0 |             0 |               0 |             0 |             0 |         nan |             0 |         nan | COVERAGE TEST
    |  28 |1002 | PASS   |       0 |             0 |               0 |             0 |             0 |         nan |             0 |         nan | COVERAGE TEST
    |  29 |1003 | PASS   |       0 |             0 |               0 |             0 |             0 |         nan |             0 |         nan | COVERAGE TEST
    |  30 |1004 | PASS   |       0 |             0 |               0 |             0 |             0 |         nan |             0 |         nan | COVERAGE TEST
    |  31 |1005 | PASS   |       0 |             0 |               0 |             0 |             0 |         nan |             0 |         nan | COVERAGE TEST
    |  32 |1006 | PASS   |       0 |             0 |               0 |             0 |             0 |         nan |             0 |         nan | COVERAGE TEST
    Test Pass!
    

  • Hi,

    . I don't know how to make that correlation because the CIO output only shows the vector size and it doesn't list the data types for each test run. If I just look at the vector sizes and cycle counts I can not identify which test run corresponds to a specific row in the DSPLIB users guide.

    You can refer to the test_cases_list.csv at ti-processor-sdk-rtos-j784s4-evm-10_01_00_04\dsplib\test\DSPLIB_idat_gen\DSPLIB_add to identify the data types for each test run.

    Regards,
    Shabary

  • Thank you for pointing out that .csv file Shabary. This is helpful for decoding the test outputs. One other thing I have a minor doubt about in test output is the correlation between the test run numbers 1 to 25 and the cold/warm cycle test run numbers 0 to 24. I will assume that they are just offset by one so that the warm cycles listed for test run 0 correspond to test run 1 from the earlier part of the output. 

    I have tabulated the results for your reference below. I am not able to see the correlation between the test output in the blue and green columns versus the DSPLIB User Guide performance cycles in the yellow column. When you run test_DSPLIB_add on your TDA4VH EVM do you observe similar results to the blue and green columns below or do the cycles match the User Guide performance values?

  • Shabary,

    I have an update on this. I have rerun the Release build of test_DSPLIB_add and I am now able to replicate the cycle counts from the DSPLIB user guide on C71X_0 as shown below. The previously reported numbers where with the Debug build profile so you can disregard that. I have shared an updated table below.

    Also, we made the changes to the example source code to move the data to L2 and now that example using TSC register also gives similar cycle count results on C71X_0 to those reported in DSPLIB user guide performance number. So this is now resolved. 

    However we would still like your help to understand why the cycle counts for C71X_1, 2 and 3 are higher than the cycle counts on C71X_0.

  • Hi,

    why the cycle counts for C71X_1, 2 and 3 are higher than the cycle counts on C71X_0

    I will check on that internally and update you.

    Regards,
    Shabary.

  • Hi Shabary. Have you been able to make any progress on understanding why C71X_1, 2 and 3 cycles are higher than the cycle counts on C71X_0? Thank you.

  • Hi,

    Have you been able to make any progress on understanding why C71X_1, 2 and 3 cycles are higher than the cycle counts on C71X_0?

    Yes values should be same for all the cores, I am checking the priorities at the interconnect level.Will get back to you.

    Regards,
    Shabary

  • Hi,

    I was checking whether any cache dependencies might be affecting the performance on core_0 to achieve faster results. However, even after disabling the cache, the issue still persists.
    I'll continue debugging and will update you.

    Best regards,
    Shabary



  • Hi,

    Core_0 is running faster compared to the other cores because the L2SRAM and MSMC addresses used in the linker script belong to Core_0.
    When you run the code on the other cores, they access memory addresses intended for Core_0, which introduces delays.

    You can refer to the following file to find the correct L2SRAM and MSMC addresses for each core:
    ti-processor-sdk-rtos-j784s4-evm-10_01_00_04\vision_apps\platform\j784s4\rtos\(c7x_1, c7x_2, c7x_3, c7x_4)\linker_mem_map.

    Regards,
    Shabary.

  • Shabary,  thank you for chasing down the answer to this. I created a separate build configuration for c7x_2 versus c7x_1 and confirmed that we get identical performance when we use the correct L2 and MSMC addresses for c7x_2. See results pasted below, and note that the core numbers start with 0 offset in the debugger so C71X_0 below maps to c7x_2 from the SDK reference you provided.

  • Shabary thanks again for the on-going help with this. Going back to an earlier post in this thread, I incorrectly stated that we are able to measure similar cycle counts with a simple example. While we have been able to reproduce the DSPLIB published cycle counts by using the test_DSPLIB_add code, we have not been able to get close to these cycle counts with a simple example. We have tried to follow the techniques used in the DSPLIB test to preload the operands into the L1D and warm-up L1D and branch prediction H/W by pre-running the DSPLIB_add kernel several times, however when using DSPLIB_add on 256 floats we still measure 274 cycles instead of the ~100 cycles from the DSPLIB user guide. I am attaching the source file with changes. If you build this with the example/DSPLIB_add configuration the results printed to CIO are "Number of clock cycles elapsed in 274".

    Can you please let us know what else we may need to do to reduce the cycles further to match the DSPLIB test results.

    DSPLIB_add_example_mods_256.cpp

  • Hi,
    I will check and update on that.

    Regards,
    Shabary.

  • Hi,

    Can you please let us know what else we may need to do to reduce the cycles further to match the DSPLIB test result

    To further reduce the clock cycles to the values mentioned in the user guide, you need to enable the cache, present in DSPLIB_TEST_init() in the d.c file.
    Could you please check that and also share the linker script corresponding to the .cpp code you shared, so that I can try it from my end?

    Regards,
    Shabary.

  • Thank you Shabary. I didn't realize the caches were not enabled in the DSPLIB example. In that case, my efforts to preload L1D and L1P are in vain. I will look into enabling cache per the DSPLIB_TEST_init code you referenced. The .cpp file I attached earlier can be dropped into the examples/DSPLIB_add folder of the DSPLIB install, and it will build using the usual cmake command below. The linker command file for this is at the relative path, "ti-processor-sdk-rtos-j784s4-evm-10_01_00_04/dsplib/cmake/linkers/C7120" 

    cmake -B build -DTARGET_PLATFORM="" -DBUILD_EXAMPLE="1" -DKERNEL_NAME="DSPLIB_add" -DSOC="j784s4" -DDEVICE="C7120" -DDSPLIB_DEBUGPRINT="0" -DCMAKE_EXPORT_COMPILE_COMMANDS="TRUE" -DCMAKE_BUILD_TYPE="Release"

  • Hi,

    . The .cpp file I attached earlier can be dropped into the examples/DSPLIB_add folder of the DSPLIB install, and it will build using the usual cmake command below. The linker command file for this is at the relative path, "ti-processor-sdk-rtos-j784s4-evm-10_01_00_04/dsplib/cmake/linkers/C7120" 

    Thank you,I will check from my end and update you.

    Regards,
    Shabary.

  • Hi,
    In the test code, both cache and MMU functions are enabled, which is why we observe such high performance.
    The example code is intended only to demonstrate the implementation of the APIs used within the kernel, so we cannot expect the same level of performance from an example or standalone code.
    If you need high performance, you can add your use case as a test case in the CSV file and run it as part of the test code.

    Regards,
    Shabary S Sundar.

  • Thanks Shabary. 

    Following your suggestion, we could achieve high performance on the DSPLIB example by adding the following lines from dsplib/test/common/c71/DSPLIB_TEST_init.c and by including the list of files below:
    c7x_simple_l1_l2_msmc_ddr_ptc.c
    DSPLIB_TEST_c7xecr.{h,asm} 
    enable_cache_mmu.{h,c}
    invalidate_tlb.{h,c}

    Just wanted to share this for other developers.

    Thanks,
    Ian



     

  • Hi Ian,
    Thanks for sharing the information.

    Regards,
    Shabary S Sundar.