This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM5726: Slow performance of dsp_1(c66x) compared to a15_0

Part Number: AM5726
Other Parts Discussed in Thread: SYSBIOS

Hi we measured time of dsp_1(c66x) vs a15_0 and the same code part is running triple time slower than on 15_0, 

Can you advise?

I attached the RTSC memory configuration of c66x .

and i'm using EXT_RAM for NDK in c66x like this,

Program.sectMap[".far:NDK_PACKETMEM"] = {loadSegment: "EXT_RAM", loadAlign: 128};
Program.sectMap[".far:NDK_MMBUFFER"] = {loadSegment: "EXT_RAM", loadAlign: 128};

  • What software are you using?  Is this RTOS on both ARM and DSP?

    Note that the speed referenced in that RTSC configuration screenshot is intended to be for you to inform the tool the speed at which the DSP is operating.  Specifying 750 MHz in the tool does not actually cause the DSP to be configured at that speed.  That's done by the bootloader, e.g. u-boot or SBL.

    We can check the speed of the DSP by running a JTAG script at run-time to collect the state of all the clock-registers.  I can then analyze them separately to see the speed of the DSP.  The script can be downloaded here:

    http://git.ti.com/sitara-dss-files/am57xx-dss-files/blobs/raw/main/am57xx-ctt.dss

    Directions on how to execute the script can be found here:

    http://git.ti.com/sitara-dss-files/am57xx-dss-files/blobs/main/README

    The script will output a *.rd1 file to your desktop.  Please rename to *.txt and attach to the thread.

    Other things to consider are compiler options, memory placement, and cache configuration.

    What code is it that's running so slowly?  Is it the NDK, signal processing, control code, etc.?

  • Hi Brad

    1. We use TI RTOS in both A15 and DSP

    2. We use CPU time stamp counter to measure time between interrupts. It matches actual interrupt rate if we divide this counter by 750 mhz.  Is it enough to conclude that CPU speed is 750 mhz?

    3. We post all memory and cache configuration to the original report. Is it enough to make a conclusion concerning correctness of our setup or there are some other setting that we have to do in code?

    4. We run general purpose code, which wraps control algorithms. We expect it to execute much faster. Control code is minor part and we will consider optimization later.

    Here is simplified control flow 

    1) Interrupt ticks every 31.25 uSec

    2) interrupt samples CPU time stamp counter

    3) Interrupt posts a semaphore 

    4) Task  wakes and executes

    5) simple control loop

    6) data recorder, which saves to the memory following data:

    a) time stamp stored by ISR

    b) current time

    c) sample counter that we run in FPGA

    90% is general purpose c++ code.

    We see that difference between time stamps are nearly 31.25 +/- 1 uSec, which means that we do not loose interrupts and time-stamp interpretation reflects 750 mhz.

    However, time spent (ISR-time stamp (2) - current time  (6.b) ) in execution of this code is much longer comparing to the same code (exactly the same) that runs on A15.

    Best regards

    Rasty

  • Rasty Slutsker said:
    2. We use CPU time stamp counter to measure time between interrupts. It matches actual interrupt rate if we divide this counter by 750 mhz.  Is it enough to conclude that CPU speed is 750 mhz?

    Yes, that is enough.  However, there is a lot more insight to be gained if you run the script I mentioned.  It allows us to see the configuration of ALL the clocks.

    Rasty Slutsker said:
    3. We post all memory and cache configuration to the original report. Is it enough to make a conclusion concerning correctness of our setup or there are some other setting that we have to do in code?

    You're not using a pointer to a shared memory location or anything like that are you?  The DSP has a set of registers called "Memory Attribute Registers" where you control the cacheability of a range of memory.  For memory ranges defined in the platform, sysbios will automatically mark those regions as cacheable, but if you were to access data anywhere else, those registers default to 0 (non cacheable) which will cause a huge performance degradation.

    Rasty Slutsker said:
    However, time spent (ISR-time stamp (2) - current time  (6.b) ) in execution of this code is much longer comparing to the same code (exactly the same) that runs on A15.

    Please see the following application notes on c6000 optimization techniques:

    Introduction to TMS320C6000 DSP Optimization
    http://www.ti.com/lit/sprabf2

    Optimizing Loops on the C66x DSP
    http://www.ti.com/lit/sprabg7

    Best regards,
    Brad

  • Brad here is the script output as your requested,

    I also got this error when running,

    js:> loadJSFile C:/am57xx-ctt.dss
    ID_CODE = 0x2b99002f
    AM572x SR2.0 detected.

    Data collection complete.
    Created file C:\Users\vadim.malinovsky/Desktop/am57xx-ctt_2021-02-04_112706.rd1
    Exception occurred while parsing the documentation, please verify the documentation is conforms to the javadoc standard.
    js:>

    is it normal? thanks.

    am57xx-ctt_2021-02-04_112706.txt
    DeviceName AM572x_SR2.0_SR1.1
    0x4a005560 0x00000002
    0x4ae06118 0x00000000
    0x4a008920 0x00070000
    0x4ae06190 0x00000000
    0x4a0052e4 0x00000000
    0x4ae061c8 0x00000000
    0x4ae06174 0x00000000
    0x4a009848 0x00030000
    0x4ae07888 0x00030000
    0x4a005228 0x00000208
    0x4a002360 0x00000000
    0x4a0051ec 0x00801301
    0x4a009770 0x00000101
    0x4a009908 0x00030000
    0x4ae07a04 0x00000001
    0x4a005254 0x00000000
    0x4ae06154 0x00000000
    0x4a009858 0x00000002
    0x4a00521c 0x00010a04
    0x4ae061bc 0x00000000
    0x4a009750 0x00000002
    0x4a009780 0x00000101
    0x4a0097a8 0x00030000
    0x4a008778 0x00000001
    0x4a009328 0x03040002
    0x4a005234 0x00000007
    0x4a005244 0x00000201
    0x4a0052b8 0x00000204
    0x4a009388 0x00070000
    0x4a005130 0x00000002
    0x4a005154 0x00000005
    0x4a0098e8 0x00030000
    0x4a008e40 0x00030000
    0x4a005550 0x00030000
    0x4a0093e8 0x00030000
    0x4ae07830 0x00000002
    0x4a00516c 0x00c25807
    0x4a009620 0x00030000
    0x4a005520 0x00040001
    0x4a005580 0x00030000
    0x4a008210 0x0000000f
    0x4a008e28 0x00030000
    0x4ae061a0 0x00000000
    0x4a009738 0x00000002
    0x4a005248 0x00000003
    0x4a009800 0x00000002
    0x4a0052c8 0x0000020a
    0x4a0052d8 0x00000005
    0x4a005140 0x0000003e
    0x4a0052e8 0x00000001
    0x4a0098a8 0x00030000
    0x4a005720 0x00070000
    0x4ae06180 0x00000000
    0x4a009798 0x00000002
    0x4ae061c0 0x00000000
    0x4a0051e0 0x00000007
    0x4ae06108 0x00000000
    0x4a0051f0 0x00000801
    0x4ae07840 0x00000002
    0x4ae06170 0x00000000
    0x4a009200 0x00000003
    0x4a0097d8 0x00030000
    0x4a005210 0x00000007
    0x4a0098f0 0x00030000
    0x4ae07838 0x00000001
    0x4a005220 0x00000202
    0x4a009868 0x00030000
    0x4a008158 0x00000003
    0x4a00519c 0x00000000
    0x4a0086a0 0x00000000
    0x4ae061d4 0x00000000
    0x4ae06138 0x00000000
    0x4ae061c4 0x00000000
    0x4a0051f4 0x00000001
    0x4ae06114 0x00000000
    0x4a008e50 0x02000001
    0x4a009760 0x00000101
    0x4a009720 0x00000002
    0x4ae061d0 0x00000000
    0x4a009878 0x00030000
    0x4a00814c 0x00006004
    0x4a008200 0x00000007
    0x4a005160 0x00000007
    0x4a0093b0 0x00070000
    0x4a005170 0x00000201
    0x4ae061d8 0x00000000
    0x4a009120 0x00070000
    0x4a008b30 0x00000001
    0x4a0052c0 0x00000228
    0x4a0051ac 0x00000000
    0x4ae061a8 0x00000000
    0x4a00815c 0x00000004
    0x4a005568 0x00000002
    0x4a0093d0 0x08000002
    0x4a009820 0x00000002
    0x4a009220 0x00070000
    0x4a002534 0x00000000
    0x4a009020 0x00070000
    0x4a0098d0 0x00030000
    0x4a0056e0 0x00070000
    0x4a0093e0 0x00030000
    0x4a009130 0x00070000
    0x4a009850 0x00030000
    0x4a005158 0x00000004
    0x4a005764 0x00070000
    0x4a009788 0x00030000
    0x4a008160 0x00000004
    0x4ae07878 0x00030000
    0x4a009030 0x00070000
    0x4a009890 0x00030000
    0x4a009870 0x00000002
    0x4a0097f0 0x00000002
    0x4ae06194 0x00000000
    0x4a009340 0x00070000
    0x4a0097c4 0x00030000
    0x4a009810 0x00000101
    0x4a0097b0 0x00000002
    0x4a0051dc 0x00000000
    0x4a008780 0x00000001
    0x4a005144 0x00000005
    0x4a008164 0x00000002
    0x4a008140 0x00000007
    0x4a0097c8 0x00030000
    0x4a008150 0x00000804
    0x4ae0610c 0x00000000
    0x4a0098c8 0x00030000
    0x4ae06198 0x00000000
    0x4ae06168 0x00000000
    0x4ae06184 0x00000000
    0x4ae06148 0x00000000
    0x4a0051a0 0x00000005
    0x4a0051b0 0x00000001
    0x4ae0619c 0x00000000
    0x4a0056a0 0x00070000
    0x4a009740 0x00000002
    0x4a00821c 0x00000000
    0x4a009808 0x00000002
    0x4a009728 0x00000002
    0x4a005290 0x00000000
    0x4ae061cc 0x00000000
    0x4a005558 0x0b000002
    0x4ae0612c 0x00000000
    0x4ae06178 0x00000000
    0x4a005620 0x00070000
    0x4a009830 0x00030000
    0x4a00818c 0x0405401b
    0x4a00820c 0x0402ee09
    0x4a008f28 0x00030000
    0x4a002544 0xf757fdc0
    0x4a009898 0x00030000
    0x4ae061b0 0x00000000
    0x4a009840 0x00030000
    0x4a009768 0x00000101
    0x4a009904 0x00030000
    0x4a00515c 0x00000006
    0x4ae061b8 0x00000000
    0x4a0052bc 0x0000020a
    0x4a0093b8 0x00070000
    0x4a008b38 0x00000001
    0x4ae061b4 0x00000000
    0x4a009778 0x00000101
    0x4a005570 0x00000002
    0x4a0097a0 0x00000002
    0x4a0052a4 0x00000000
    0x4a005660 0x00070000
    0x4ae06110 0x00000002
    0x4a009828 0x00000002
    0x4ae06164 0x00000000
    0x4a008b40 0x00000000
    0x4ae06160 0x00000000
    0x4ae0615c 0x00000000
    0x4ae06158 0x00000000
    0x4a009028 0x00070000
    0x4a0098e0 0x00030000
    0x4a005284 0x00000005
    0x4a005294 0x00000001
    0x4a009330 0x03040002
    0x4ae0614c 0x00000000
    0x4a005420 0x00000001
    0x4a0093f0 0x00070000
    0x4ae06150 0x00000000
    0x4a008180 0x00000005
    0x4a0052b4 0x0000fa04
    0x4a008190 0x00000002
    0x4a00512c 0x00010a04
    0x4a005100 0x00000110
    0x4a0097f8 0x00000002
    0x4a008f20 0x00070000
    0x4ae0618c 0x00000000
    0x4a0098a0 0x00030000
    0x4ae061e0 0x00000000
    0x4a0098c0 0x00030000
    0x4a009790 0x00000002
    0x4a009838 0x00030000
    0x4a009818 0x00000101
    0x4a0097b8 0x00000002
    0x4a008728 0x00000001
    0x4a0097d0 0x00030000
    0x4a008e20 0x00000001
    0x4a0098f8 0x00030000
    0x4ae07880 0x00000002
    0x4ae0616c 0x00000000
    0x4a009860 0x00030000
    0x4ae061ac 0x00000000
    0x4a0098b0 0x00030000
    0x4a0052c4 0x00000208
    0x4a00513c 0x00000204
    0x4ae06120 0x00000000
    0x4a009748 0x00000002
    0x4ae06144 0x00000000
    0x4a005578 0x00030000
    0x4a005240 0x00009603
    0x4a009718 0x00040002
    0x4a009730 0x00000002
    0x4a0052a8 0x00000007
    0x4a005120 0x00000007
    0x4a0086b0 0x00000000
    0x4a009910 0x00000102
    0x4ae06188 0x00000000
    0x4ae061a4 0x00000000
    0x4ae06128 0x00000000
    

  • That's an unusual error, but the rd1 file generated is properly formed.  I was able to have a look using Clock Tree Tool.  I was able to confirm MPU frequency at 1.5 GHz, DSP at 750 MHz, DDR at 532 MHz, GPMC at 266 MHz.  I sanity checked a few others too, e.g. to make sure PER_96M_GFCLK was operating at 96 MHz and PER_48M_GFCLK was operating at 48 MHz.  Everything looks as expected.  It's not a clocking issue...  I think the other items like compiler options, cache configuration, are the next places to check.

  • Hi,

    How do we check cache configuration?

    What information you would need from us to confirm cache configuration?

    Thanks

    Rasty

  • Hey Brad,

    We indeed do have Shared Memory SR_0, previously i used SharedRegion (from IPC) but we abandon it.

    SR_0 resides on EXT_RAM and indeed we didn't configure it on c66x platform so it will not be cacheable, also it is not cacheable on a15_0, we don't want shared memory to be cacheable because its volatile memory.

    Just for test i added the SR_0 to platform, the code run 20uSec faster, but it still very slow, but the data is invalid (no cache coherency) so it's no help.

  • Vadim Malinovsky said:
    the code run 20uSec faster, but it still very slow,

    Can you help quantify?  Previously you mentioned a 31.25us rate so an improvement of 20us sounds gigantic.  Or is the issue that you're nowhere close to meeting real-time requirements for that 31.25us interrupt rate?

    Vadim Malinovsky said:
    Just for test i added the SR_0 to platform, the code run 20uSec faster, but it still very slow, but the data is invalid (no cache coherency) so it's no help

    I recommend that you enable cache to these regions and:

    1. Pad any structures to make sure they're a multiple of 128 bytes (cache line size), and also aligned on a 128 byte boundary.

    2. Reads: Perform a block-invalidate prior to the read.

    3. Writes: Perform a block-writeback after the write.

  • Hi,

    For experiment Vadim cached also shared memory, which should be not cached in the real-life.

    This area is used for for exchange of few tens of 32-variables and impact shall not be significant. Once we get something working, we can play with cache invalidation.

    The Problem.

    We have a piece of code that takes 20 uSec on DSP, comparing to 6 uSec on ARM. Generic c++ code,  plus some simple digital filter.

    We did not expect such pure performance instead we were convinced that DSP is faster or equal to ARM.

    There is should be some fundament problem with setup/cache/else.

    Best regards

    Rasty

  • What were your compiler options?  That makes a huge difference.

  • Also, exactly what compiler version are you using?

  • Hey Brad,

    I use TIv8.3.8 compiler, and my options is,

    -mv6600 --include_path="C:/ti/pdk_am57xx_1_0_10/packages/ti/boot/sbl/soc/am57xx" --include_path="C:/ti/ndk_2_26_00_08/packages" --include_path="C:/ti/pdk_am57xx_1_0_10/packages/ti/boot/sbl/board/src/" --include_path="C:/ti/pdk_am57xx_1_0_10/packages/ti/drv/gpio/test/led_blink/src" --include_path="C:/ti/pdk_am57xx_1_0_10/packages/ti/csl/src/ip" --include_path="C:/ti/pdk_am57xx_1_0_10/packages/ti/drv/gpio" --include_path="C:/ti/pdk_am57xx_1_0_10/packages/ti/csl/soc/am572x/src" --include_path="C:/ti/pdk_am57xx_1_0_10/packages/ti/board/src/maxxAM572x/include" --include_path="C:/ti/pdk_am57xx_1_0_10/packages/ti/csl/soc/am572x/src" --include_path="C:/ti/pdk_am57xx_1_0_10/packages/ti/drv/gpio" --include_path="C:/Maxx_Firmware/projects/ti_c66" --include_path="C:/Maxx_Firmware/projects/ti_c66/inc" --include_path="C:/ti/ti-cgt-c6000_8.3.8/include" --include_path="C:/ti/pdk_am57xx_1_0_10/packages/ti/transport/ndk/nimu" --include_path="C:/Maxx_Firmware/inc" --include_path="C:/Maxx_Firmware/" --include_path="C:/Maxx_Firmware/src/app" --include_path="C:/Maxx_Firmware/src/cli" --include_path="C:/Maxx_Firmware/src/app/network" --include_path="C:/Maxx_Firmware/src/app/network/ftp" --include_path="C:/Maxx_Firmware/src/app/common" --include_path="C:/Maxx_Firmware/src/app/common" --include_path="C:/Maxx_Firmware/Expat_2_2_5" --include_path="C:/Maxx_Firmware/inc/HAL" --include_path="C:/Maxx_Firmware/inc/unitest" --include_path="C:/Maxx_Firmware/src/recorderparser" --include_path="C:/Maxx_Firmware/inc/HAL/maxxAM572x" --include_path="C:/Maxx_Firmware/inc/HAL/minGW" --include_path="C:/Maxx_Firmware/src/app/ti_a15_0" --include_path="C:/Maxx_Firmware/inc/Board" --include_path="C:/Maxx_Firmware/src/app/maxx_a15_v0" --include_path="C:/ti/pdk_am57xx_1_0_10/packages/ti/board/src/maxxAM572x/include" --include_path="C:/Maxx_Firmware/src/app/minGW" --include_path="C:/Maxx_Firmware/src/app/ti_c66x" --define=YY_NO_UNISTD_H --define=BUILD_CONF=Debug --define=TI_RTOS --define=USE_BIOS --define=SOC_AM572x --define=C66X --define=core1 --define=am5726 -g --diag_suppress=48 --diag_suppress=1051 --diag_suppress=614 --diag_suppress=869 --diag_warning=225 --diag_wrap=off --display_error_number

  • I don't see any optimization enabled.  I recommend bumping it to -o2.  You should still be able to debug but sometimes things might look like they're executing out of order, etc.  That should give a nice bump in performance.  You can get even better performance if you go to -o3 and you disable symbolic debug (remove -g).  You should only do that if everything is working perfectly.

  • Hi Brad,

    I believe that we can get up to 30% with optimization. 

    Instead we have some fundamental performance problem.

    Can you help to review with us cache configuration and other registers that can cause performance issues?

    Thanks

    Rasty

  • Rasty Slutsker said:
    I believe that we can get up to 30% with optimization. 

    Performance with no optimization at all tends to be really terrible.  I suspect it is much more than that.  Have you tested it?

    Rasty Slutsker said:
    Can you help to review with us cache configuration and other registers that can cause performance issues?

    Your original post shows your cache configuration in your configuration.  Unlike the CPU frequency, the cache values there actually result in TI-RTOS configuring registers.  If you have it configured for 32K L1P/L1D and 256KB L2, then you're in good shape.  The one caveat I mentioned is using memories that are not in that memory map.  In that case you must make sure the associated cacheability bit is set in the Memory Attribute Register.  TI-RTOS does that for you for memories which you have listed.  I have never seen an instance where this has somehow not been configured, so I don't think it's worthwhile digging through every register to verify these things.

  • A few questions,

    1. When it tried to make split OCMC_RAMx, so one part be the code/data and 64kB to be shared memory with a15_0 and dsp, i saw a problem,

    it seems when u use

    var Cache = xdc.useModule('ti.sysbios.family.c66.Cache');
    Cache.setMarMeta(0x405F0000, 0x10000, Cache.Mar_DISABLE);  // set shared memory non-cacheable
    

    it will turn of cache for the whole OCMC_RAMx, 1-3, but we have code there so nothing is cached and nothing works because its slow, 

    from DSP c66x Corepac i see that setMarMeta will set MAR64 to non-cache(PC=0) and non prefetch,

    which is 16MB from 0x4000_0000 to 40FF_0000,

    Can we set only part of OCMC_RAMx to be non-cache and not the whole 16MB?

    2. Can we share L2SRAM of DSP c66x internal memory with a15_0 somehow?

    With best regards,

    Vadim.

  • Vadim Malinovsky said:
    Can we set only part of OCMC_RAMx to be non-cache and not the whole 16MB?

    No, the DSP cacheability controls have 16MB granularity.  It is not possible to further partition a given block of memory.

    Vadim Malinovsky said:
    2. Can we share L2SRAM of DSP c66x internal memory with a15_0 somehow?

    Yes, DSP L2 memory is visible/accessible to the A15.  It would be possible to use this memory for that purpose.  Cache coherence for the DSP is handled in hardware for the internal memory.  That would give the DSP very fast access to the shared memory and you wouldn't have to worry (from the DSP) about coherence.  You would still need to maintain coherence on the ARM side.

    This strategy is reasonable if it's a very small chunk of shared memory.  If you have to reduce the L2-cache size in order to get enough  shared RAM then you're going to see performance degradation elsewhere.

    Do you have past experience handling cache coherence manually?  I've had a lot of cases where customers were reluctant to manually manage the cache operations, but once they actually did it they realized it wasn't nearly as difficult as they expected.  I still think that's the best option.

    Best regards,
    Brad

  • Brad Griffis said:

    Do you have past experience handling cache coherence manually?  I've had a lot of cases where customers were reluctant to manually manage the cache operations, but once they actually did it they realized it wasn't nearly as difficult as they expected.  I still think that's the best option.

    Best regards,
    Brad

    Hey Brad,

    I can just turn of cache at arm side, and maybe in future manage "coherence". but cache and shared memory is only part of a problem, i did some measurements and even when i "mockup" the shared memory data to internal OCMC_RAMx on DSP side our code is running slow, here is the measurements:

    code performance
      Optimization Shared Data Cache Time [uSec] Variables amount shared bytes amount
    O3 Internal (OCMC_RAMx) - mockup variables Yes 19.707348 20 0 bytes
    O3 External (EXT_RAM) Yes 18.65922 20 ~80 bytes
    O3 External (EXT_RAM) No 37.671332 20 ~80 bytes

    Here us "Memory allocation of DSP"

    Can you advise,

    With best regards Vadim.

  • Hi Brad, 

    I'd like to summarize what we have.

    With O2/O3 optimization, running everything from internal memory  we archive performance that suites rather Cortex M3 than 750 mHz floating point DSP.

    If we exchange of 80 bytes via non-cached DDR we do not meet deadline at all.

    There must be some fundamental problem with ether setup, CPU or somewhere else.

    Best regards

    Rasty 

  • Rasty,

    Cache and compiler are the areas where I typically see a major issue.  To go further I request that you please put together a small buildable example that demonstrates your issue.  Something simple that runs entirely from DSP L2 memory that I can build and run as well.  That will allow us to compare numbers to see if we have consistency, and then it will give me something more concrete to look at to understand any underlying performance issues.

    Best regards,
    Brad

  • Hi

    Since it could be hardware or system problem.

    I prefer to get benchmark project from ti with guaranteed and documented performance measurements.

    Can be any benchmark for any general purpose  c/c++ code using tirtos tasks if possible.

    We will run it and confirm numbers.

    Thanks

    Rasty

  • Rasty,

    Can you tell me the values of some more registers?  Let's make absolutely sure that things are configured as expected...

    • 0x01840000 L2CFG
    • 0x01840020 L1PCFG
    • 0x01840040 L1DCFG
    • Memory Attribute Value Registers corresponding to each range of memory in use by your application

    Also there's a compiler option I'd like you to try.  Note that this could be "dangerous", so it's really intended more as a diagnostic and not as a long term solution.  Can you please try adding --no_bad_aliases to your compiler options?  That might cause things to break, but I'm hoping that it doesn't.  And furthermore, if you see a huge performance improvement, that will indicate that you might need to use "restrict" keyword more in function definitions to better inform the compiler on whether various pointers overlap/alias or not.  By default the compiler behaves safely and assumes that all pointers might overlap, and that causes serious restrictions on how much optimization can be achieved.  This could be a quick test to see if that's a big part of your issues.

    Best regards,
    Brad

  • L2CFG = 0x00000003

    L1PCFG = 0x00000004

    L1DCFG = 0x00000004

    Memory Attribute Value Registers corresponding to each range of memory in use by your application

    Where can i see this one?

    With best regards,

    Vadim

  • Vadim Malinovsky said:
    L2CFG = 0x00000003

    This corresponds to 128KB L2 cache.  Is that expected?  Your platform screenshot from your original post shows 256KB L2 cache.  Is this a recent change, perhaps related to the cache coherence?  The AM57xx DSP's have 288KB of L2 memory, so it's possible to have the max 256KB L2 cache plus still have 32KB of RAM for other purposes.  Have you benchmarked the impact to your code between 128KB and 256KB L2 cache?

    Vadim Malinovsky said:

    L1PCFG = 0x00000004

    L1DCFG = 0x00000004

    These are 32KB each as I expected.  I recommend keeping these values.

    Vadim Malinovsky said:
    Memory Attribute Value Registers corresponding to each range of memory in use by your application

    Please refer to the TMS320C66x DSP CorePac User Guide.  Specifically please see Table 4-20 Memory Attribute Registers.

    Best regards,
    Brad

  • Yes it is expected, because i followed your advise and used L2SRAM for Shared memory between a15_0 <=> dsp_0.

    Here is the new configuration,

    I had no success of using the additional 32KB of memory, is it in higher address (top)? or lower address (down) this additional space? Because cache is growing from top

    so i configured the cache to be 128KB and used the lower 128 KB for shared memory, and also for critical functions code section so it could run faster...

  • Vadim Malinovsky said:
    Yes it is expected, because i followed your advise and used L2SRAM for Shared memory between a15_0 <=> dsp_0.

    Here's a quote of what I said:

    This strategy is reasonable if it's a very small chunk of shared memory.  If you have to reduce the L2-cache size in order to get enough  shared RAM then you're going to see performance degradation elsewhere.

    In other words, it is reasonable to use the 32KB of dedicated SRAM for this purpose, but I do not recommend reducing the L2 cache size.  Of course this is highly use case dependent, but in typical use cases where the bulk of code and data resides outside the DSP subsystem, you want to maximize the amount of L2 cache available for best performance.

    Vadim Malinovsky said:
    I had no success of using the additional 32KB of memory, is it in higher address (top)? or lower address (down) this additional space? Because cache is growing from top

    You're correct that the cache always grows from the top (i.e. highest address to lowest address).  This keeps all your available L2 RAM in a contiguous block.  There is always 32KB of SRAM available at 0x0080_0000.

  • Ok

    Brad Griffis said:

    You're correct that the cache always grows from the top (i.e. highest address to lowest address).  This keeps all your available L2 RAM in a contiguous block.  There is always 32KB of SRAM available at 0x0080_0000.

    Thanks, now i succeded of using the whole 256KB of SRAM for Cache and the lowest 32KB for the Shared Memory,

    I measured the difference between 128KB and 256KB running the same piece of code it is between 0-3uSec. 

  • Brad Griffis said:
    Can you please try adding --no_bad_aliases to your compiler options?

    Vadim,

    Can you please try the compiler option above?  It can make an order of magnitude difference in some code.

    Thanks,
    Brad

  • TI has provided VXLIB benchmark numbers.

    Please let us know if anything else is needed.