This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CCS/CC3220SF-LAUNCHXL: Benchmarking CC3220SF -- update, please?

Prodigy 145 points

Replies: 15

Views: 2250

Part Number: CC3220SF-LAUNCHXL

Tool/software: Code Composer Studio

I am getting the same poor results found by Alexander at https://e2e.ti.com/support/wireless_connectivity/simplelink_wifi_cc31xx_cc32xx/f/968/t/629041.  Ben Moore (TI) promised an internal review and updated documentation. 

What is the status of this?

If it turns out that internal XIP flash on I-code/D-code is even worse than executing from external SRAM on sys bus, this is an embarrassing stumble by TI, and I regret to express my disappointment.

15 Replies

  • Cameron,

    Some testing on this has been performed and we are still planning on updating the datasheet with the values. Let me look into it more and see if can get any more specific info. With it being the holidays, It will probably be next week before I can get anymore info to you.

    Regards,

    VR
  • In reply to Vincent Rodriguez:

    Hi, Vincent,

    Further testing on my part reveals another disappointing, but consistent result. Using the intrinsic __delay_cycles() function, the compiler inserts a little 3-cycle loop consisting of a subs instruction and a branch. Paragraph 5.8.2.3 of swas035A.pdf states that there is a 128-bit wide instruction prefetch buffer that allows "maximum performance for linear code or loops that fit inside the buffer." This loop is only 32 bits (two Thumb opcodes), so must fit inside the buffer, if it exists (unless there's an unlucky address-boundary issue). However, as noted before, the loop runs at less than half the expected 80 MHz rate.

    Here is the disassembly; it doesn't look like there should be an unlucky address-boundary problem within the loop. In fact, the entire loop (subs + bne) should have been fetched into the buffer in a single 32-bit request from flash (paragraph 5.8.2.3 states that "Reads and writes can be performed at word (32-bit) level." Does this mean that flash access is ONLY 32-bits?)

    1869 __delay_cycles(4000);
    010019d8: F2405034 movw r0, #0x534
    010019dc: F2C00000 movt r0, #0
    $1_$46:
    010019e0: 1E40 subs r0, r0, #1
    010019e2: D1FD bne $1_$46
    010019e4: BF00 nop
    010019e6: BF00 nop

    So even if the XIP flash memory is inexplicably slow, the alleged prefetch buffer would overcome this slowness for this loop, but it doesn't. This means the brokenness also affects the prefetch buffer.

    I look forward to hearing the explanation, and especially if there is a work-around to get the performance.

    Thanks,
    Cameron
  • In reply to cameron pike:

    I just updated to SDK 1.6, and associated service pack. I observe no change in performance. (I thought maybe the SDK would correct an erroneous value in a bus wait-state register.)

    When running Coremark, I'm getting a solid 107 iterations/second while executing in flash, and 133 iterations/second while executing from SRAM. I'm expecting something in excess of 200 iterations/second when executing from 0ws flash, with an 80 MHz cortex-m4.

    2K performance run parameters for coremark.
    CoreMark Size : 666
    Total ticks : 56818
    Total time (secs): 56
    Iterations/Sec : 107
    Iterations : 6000
    Compiler version : TI v16.9.6.LTS
    Compiler flags : -mv7M4 --code_state=16 --float_support=vfplib -me -O4 --opt_for_speed=5 -g --define=NORTOS_SUPPORT
    Memory location : code in FLASH, data in SRAM
    seedcrc : 0xe9f5
    [0]crclist : 0xe714
    [0]crcmatrix : 0x1fd7
    [0]crcstate : 0x8e3a
    [0]crcfinal : 0xa14c
    Correct operation validated. See readme.txt for run and reporting rules.

    2K performance run parameters for coremark.
    CoreMark Size : 666
    Total ticks : 45772
    Total time (secs): 45
    Iterations/Sec : 133
    Iterations : 6000
    Compiler version : TI v16.9.6.LTS
    Compiler flags : -mv7M4 --code_state=16 --float_support=vfplib -me -O4 --opt_for_speed=5 -g --define=NORTOS_SUPPORT
    Memory location : code in SRAM, data in SRAM
    seedcrc : 0xe9f5
    [0]crclist : 0xe714
    [0]crcmatrix : 0x1fd7
    [0]crcstate : 0x8e3a
    [0]crcfinal : 0xa14c
    Correct operation validated. See readme.txt for run and reporting rules.

    Cameron
  • In reply to cameron pike:

    Hey, Vincent,

    What is the status of this?

    Since TI people did benchmarking last October, what was their finding? How does it compare with Alexander's measurements, and mine reported in this thread?

    Is there any hope of a work-around (bus control register change....)?

    Cameron
  • In reply to cameron pike:

    > I just updated to SDK 1.6, and associated service pack. I observe no change in performance. (I thought maybe the SDK would correct an erroneous value in a bus wait-state register.)

    Regarding wait-state registers - there is an interesting file in the SDK called hw_stack_die_ctrl.h, which defines adresses and offsets for some registers which probably control on-chip flash and flash timings. I tried to play with them, but unsuccessfully - the best efforts only caused the CC3220 core to halt. If you have some spare time, try to explore that - maybe you will be more lucky :)

    Also, studying the datasheets for other TI Cortex microcontrollers gave me some clues about flash wait-states and caches - but I did not try anything of this yet. In any case, I am still waiting for a proper answer from TI.
  • In reply to Alexander Podshivalov:

    Hi, Alexander,

    I spent some time this morning looking at this file, and others, sifting through documentation, and poking around with the debugger, but I was unable to find sufficient information to make any progress -- I would just be shooting in the dark.  

    Vincent, or Benjamin if you're watching, will you please procure an answer to our request? This matter is really quite important.

    regards,

    Cameron Pike

  • In reply to cameron pike:

    Hi, Vincent,

    I took some time today to run Coremark on a Tiva board I have here (EK-TM4C1294XL Rev D). It will run at 120 MHz, but I ran with 80 MHz to get a good comparison. The TM4C1294 data sheet describes the memory arrangement similar to the CC3220 description: flash on ICODE/DCODE, SRAM on SYS bus.... Here are the results:

    Tiva CoreMark v1.0: SysClock set to 80000000
    2K performance run parameters for coremark.
    CoreMark Size : 666
    Total ticks : 31245
    Total time (secs): 31
    Iterations/Sec : 193
    Iterations : 6000
    Compiler version : TI v16.9.3.LTS
    Compiler flags : -mv7M4 --code_state=16 --float_support=FPv4SPD16 --abi=eabi -me -O3 --opt_for_speed=5 -g --gcc --define=PART_TM4C1294NCPDT --define=TARGET_IS_TM4C129_RA0 --gen_func_subsections=on --ual
    Memory location : code in XIP FLASH, data in SRAM
    seedcrc : 0xe9f5
    [0]crclist : 0xe714
    [0]crcmatrix : 0x1fd7
    [0]crcstate : 0x8e3a
    [0]crcfinal : 0xa14c
    Correct operation validated. See readme.txt for run and reporting rules.

    I would expect cc3220 to perform very close to this from XIP flash, but as we found earlier, about half speed.

    regards,

    Cameron Pike
  • In reply to cameron pike:

    By the way, I also ran from SRAM on Tiva, and get an answer nearly exactly same as CC3220SF and CC3200 from SRAM:

    Tiva CoreMark v1.0: SysClock set to 80000000
    2K performance run parameters for coremark.
    CoreMark Size : 666
    Total ticks : 43976
    Total time (secs): 43
    Iterations/Sec : 139
    Iterations : 6000
    Compiler version : TI v16.9.3.LTS
    Compiler flags : -mv7M4 --code_state=16 --float_support=FPv4SPD16 --abi=eabi -me -O3 --opt_for_speed=5 -g --gcc --define=PART_TM4C1294NCPDT --define=TARGET_IS_TM4C129_RA0 --gen_func_subsections=on --ual
    Memory location : code in SRAM, data in SRAM
    seedcrc : 0xe9f5
    [0]crclist : 0xe714
    [0]crcmatrix : 0x1fd7
    [0]crcstate : 0x8e3a
    [0]crcfinal : 0xa14c
    Correct operation validated. See readme.txt for run and reporting rules.
  • In reply to cameron pike:

    Hi, All,

    I am attaching a CCSv7 project which is a port of Coremark for cc3220SF.  If any of you are able to repeat my results, please post on this thread. 

    Thank you,

    Cameron Pike

    coremark_v1.0_sdk_1.6_cc3220sf.zip

    Project to perform coremark test on cc3220sf. I ported to cc3220sf launchpad, using CCSv7 on Win7 / 64-bit.

    Instructions:

    1. Unzip this archive into your CCSv7 workspace

    2. Download coremark source archive from:
    www.eembc.org/.../download.php
        You will be asked to register as a user.
        
    3. Open the coremark archive, probably with 7zip. Select all the source files in the archive,
        ignoring the folders, and drag into the coremark_v1.0_sdk_1.6 directory.  (You may also want
        to drag the doc folder over, but the other folders are ports to specific platforms.)
        
    4. In CCSv7, import the project into the workspace, telling CCS to search the workspace for projects.

    5. Build and Debug.  Makefiles are ignored by CCS.  Output will show up in the CIO window of CCSv7.  
        I get the following output:

    [Cortex_M4_0] microsecs per tick = 1000
    2K performance run parameters for coremark.
    CoreMark Size    : 666
    Total ticks      : 56818
    Total time (secs): 56
    Iterations/Sec   : 107
    Iterations       : 6000
    Compiler version : TI v16.9.6.LTS
    Compiler flags   : -mv7M4 --code_state=16 --float_support=vfplib -me -O4 --opt_for_speed=5 -g --define=NORTOS_SUPPORT
    Memory location  : code in FLASH, data in SRAM
    seedcrc          : 0xe9f5
    [0]crclist       : 0xe714
    [0]crcmatrix     : 0x1fd7
    [0]crcstate      : 0x8e3a
    [0]crcfinal      : 0xa14c
    Correct operation validated. See readme.txt for run and reporting rules.

    6. You can change between code in SRAM or code in FLASH by alternating between the two linker command
        files in the project (CC3220SF_LAUNCHXL_NoRTOS.cmd, CC3200SF_LAUNCHXL_NoRTOS_sram.cmd).

        Note that the text printed on the console regarding "Compiler flags", "Compiler version", and
        "Memory location" are NOT automatically generated, but are defined as macros in
        cc3220SF/core_portme.h.
       

    Kind Regards,

    Cameron Pike

  • In reply to cameron pike:

    Cameron, could you run the benchmark on a TM4C1294 with flash prefetch buffers disabled, as described on p. 608 of the datasheet ( )?

This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.