CCS/CC3220SF-LAUNCHXL: Benchmarking CC3220SF -- update, please?

cameron pike

Part Number: CC3220SF-LAUNCHXL
Other Parts Discussed in Thread: EK-TM4C1294XL, TM4C1294NCPDT, CC3220SF, CC3200, CC2650

Tool/software: Code Composer Studio

I am getting the same poor results found by Alexander at https://e2e.ti.com/support/wireless_connectivity/simplelink_wifi_cc31xx_cc32xx/f/968/t/629041. Ben Moore (TI) promised an internal review and updated documentation.

What is the status of this?

If it turns out that internal XIP flash on I-code/D-code is even worse than executing from external SRAM on sys bus, this is an embarrassing stumble by TI, and I regret to express my disappointment.

over 6 years ago

0 Vincent Rodriguez over 6 years ago

TI__Mastermind 28285 points

Cameron,

Some testing on this has been performed and we are still planning on updating the datasheet with the values. Let me look into it more and see if can get any more specific info. With it being the holidays, It will probably be next week before I can get anymore info to you.

Regards,

VR

0 cameron pike over 6 years ago in reply to Vincent Rodriguez

Prodigy 145 points

Hi, Vincent,

Further testing on my part reveals another disappointing, but consistent result. Using the intrinsic __delay_cycles() function, the compiler inserts a little 3-cycle loop consisting of a subs instruction and a branch. Paragraph 5.8.2.3 of swas035A.pdf states that there is a 128-bit wide instruction prefetch buffer that allows "maximum performance for linear code or loops that fit inside the buffer." This loop is only 32 bits (two Thumb opcodes), so must fit inside the buffer, if it exists (unless there's an unlucky address-boundary issue). However, as noted before, the loop runs at less than half the expected 80 MHz rate.

Here is the disassembly; it doesn't look like there should be an unlucky address-boundary problem within the loop. In fact, the entire loop (subs + bne) should have been fetched into the buffer in a single 32-bit request from flash (paragraph 5.8.2.3 states that "Reads and writes can be performed at word (32-bit) level." Does this mean that flash access is ONLY 32-bits?)

1869 __delay_cycles(4000);
010019d8: F2405034 movw r0, #0x534
010019dc: F2C00000 movt r0, #0
$1_$46:
010019e0: 1E40 subs r0, r0, #1
010019e2: D1FD bne $1_$46
010019e4: BF00 nop
010019e6: BF00 nop

So even if the XIP flash memory is inexplicably slow, the alleged prefetch buffer would overcome this slowness for this loop, but it doesn't. This means the brokenness also affects the prefetch buffer.

I look forward to hearing the explanation, and especially if there is a work-around to get the performance.

Thanks,
Cameron

0 cameron pike over 6 years ago in reply to cameron pike

Prodigy 145 points

I just updated to SDK 1.6, and associated service pack. I observe no change in performance. (I thought maybe the SDK would correct an erroneous value in a bus wait-state register.)

When running Coremark, I'm getting a solid 107 iterations/second while executing in flash, and 133 iterations/second while executing from SRAM. I'm expecting something in excess of 200 iterations/second when executing from 0ws flash, with an 80 MHz cortex-m4.

2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 56818
Total time (secs): 56
Iterations/Sec : 107
Iterations : 6000
Compiler version : TI v16.9.6.LTS
Compiler flags : -mv7M4 --code_state=16 --float_support=vfplib -me -O4 --opt_for_speed=5 -g --define=NORTOS_SUPPORT
Memory location : code in FLASH, data in SRAM
seedcrc : 0xe9f5
[0]crclist : 0xe714
[0]crcmatrix : 0x1fd7
[0]crcstate : 0x8e3a
[0]crcfinal : 0xa14c
Correct operation validated. See readme.txt for run and reporting rules.

2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 45772
Total time (secs): 45
Iterations/Sec : 133
Iterations : 6000
Compiler version : TI v16.9.6.LTS
Compiler flags : -mv7M4 --code_state=16 --float_support=vfplib -me -O4 --opt_for_speed=5 -g --define=NORTOS_SUPPORT
Memory location : code in SRAM, data in SRAM
seedcrc : 0xe9f5
[0]crclist : 0xe714
[0]crcmatrix : 0x1fd7
[0]crcstate : 0x8e3a
[0]crcfinal : 0xa14c
Correct operation validated. See readme.txt for run and reporting rules.

Cameron

0 cameron pike over 6 years ago in reply to cameron pike

Prodigy 145 points

Hey, Vincent,

What is the status of this?

Since TI people did benchmarking last October, what was their finding? How does it compare with Alexander's measurements, and mine reported in this thread?

Is there any hope of a work-around (bus control register change....)?

Cameron

0 Alexander Podshivalov over 6 years ago in reply to cameron pike

Intellectual 530 points

> I just updated to SDK 1.6, and associated service pack. I observe no change in performance. (I thought maybe the SDK would correct an erroneous value in a bus wait-state register.)

Regarding wait-state registers - there is an interesting file in the SDK called hw_stack_die_ctrl.h, which defines adresses and offsets for some registers which probably control on-chip flash and flash timings. I tried to play with them, but unsuccessfully - the best efforts only caused the CC3220 core to halt. If you have some spare time, try to explore that - maybe you will be more lucky :)

Also, studying the datasheets for other TI Cortex microcontrollers gave me some clues about flash wait-states and caches - but I did not try anything of this yet. In any case, I am still waiting for a proper answer from TI.

0 cameron pike over 6 years ago in reply to Alexander Podshivalov

Prodigy 145 points

Hi, Alexander,

I spent some time this morning looking at this file, and others, sifting through documentation, and poking around with the debugger, but I was unable to find sufficient information to make any progress -- I would just be shooting in the dark.

Vincent, or Benjamin if you're watching, will you please procure an answer to our request? This matter is really quite important.

regards,

Cameron Pike

0 cameron pike over 6 years ago in reply to cameron pike

Prodigy 145 points

Hi, Vincent,

I took some time today to run Coremark on a Tiva board I have here (EK-TM4C1294XL Rev D). It will run at 120 MHz, but I ran with 80 MHz to get a good comparison. The TM4C1294 data sheet describes the memory arrangement similar to the CC3220 description: flash on ICODE/DCODE, SRAM on SYS bus.... Here are the results:

Tiva CoreMark v1.0: SysClock set to 80000000
2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 31245
Total time (secs): 31
Iterations/Sec : 193
Iterations : 6000
Compiler version : TI v16.9.3.LTS
Compiler flags : -mv7M4 --code_state=16 --float_support=FPv4SPD16 --abi=eabi -me -O3 --opt_for_speed=5 -g --gcc --define=PART_TM4C1294NCPDT --define=TARGET_IS_TM4C129_RA0 --gen_func_subsections=on --ual
Memory location : code in XIP FLASH, data in SRAM
seedcrc : 0xe9f5
[0]crclist : 0xe714
[0]crcmatrix : 0x1fd7
[0]crcstate : 0x8e3a
[0]crcfinal : 0xa14c
Correct operation validated. See readme.txt for run and reporting rules.

I would expect cc3220 to perform very close to this from XIP flash, but as we found earlier, about half speed.

regards,

Cameron Pike

0 cameron pike over 6 years ago in reply to cameron pike

Prodigy 145 points

By the way, I also ran from SRAM on Tiva, and get an answer nearly exactly same as CC3220SF and CC3200 from SRAM:

Tiva CoreMark v1.0: SysClock set to 80000000
2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 43976
Total time (secs): 43
Iterations/Sec : 139
Iterations : 6000
Compiler version : TI v16.9.3.LTS
Compiler flags : -mv7M4 --code_state=16 --float_support=FPv4SPD16 --abi=eabi -me -O3 --opt_for_speed=5 -g --gcc --define=PART_TM4C1294NCPDT --define=TARGET_IS_TM4C129_RA0 --gen_func_subsections=on --ual
Memory location : code in SRAM, data in SRAM
seedcrc : 0xe9f5
[0]crclist : 0xe714
[0]crcmatrix : 0x1fd7
[0]crcstate : 0x8e3a
[0]crcfinal : 0xa14c
Correct operation validated. See readme.txt for run and reporting rules.

0 cameron pike over 6 years ago in reply to cameron pike

Prodigy 145 points

Hi, All,

I am attaching a CCSv7 project which is a port of Coremark for cc3220SF. If any of you are able to repeat my results, please post on this thread.

Thank you,

Cameron Pike

coremark_v1.0_sdk_1.6_cc3220sf.zip

Project to perform coremark test on cc3220sf. I ported to cc3220sf launchpad, using CCSv7 on Win7 / 64-bit.

Instructions:

1. Unzip this archive into your CCSv7 workspace

2. Download coremark source archive from:
www.eembc.org/.../download.php
   You will be asked to register as a user.

3. Open the coremark archive, probably with 7zip. Select all the source files in the archive,
   ignoring the folders, and drag into the coremark_v1.0_sdk_1.6 directory. (You may also want
   to drag the doc folder over, but the other folders are ports to specific platforms.)

4. In CCSv7, import the project into the workspace, telling CCS to search the workspace for projects.

5. Build and Debug. Makefiles are ignored by CCS. Output will show up in the CIO window of CCSv7.
   I get the following output:

[Cortex_M4_0] microsecs per tick = 1000
2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 56818
Total time (secs): 56
Iterations/Sec   : 107
Iterations       : 6000
Compiler version : TI v16.9.6.LTS
Compiler flags   : -mv7M4 --code_state=16 --float_support=vfplib -me -O4 --opt_for_speed=5 -g --define=NORTOS_SUPPORT
Memory location : code in FLASH, data in SRAM
seedcrc          : 0xe9f5
[0]crclist       : 0xe714
[0]crcmatrix     : 0x1fd7
[0]crcstate      : 0x8e3a
[0]crcfinal      : 0xa14c
Correct operation validated. See readme.txt for run and reporting rules.

6. You can change between code in SRAM or code in FLASH by alternating between the two linker command
   files in the project (CC3220SF_LAUNCHXL_NoRTOS.cmd, CC3200SF_LAUNCHXL_NoRTOS_sram.cmd).

   Note that the text printed on the console regarding "Compiler flags", "Compiler version", and
   "Memory location" are NOT automatically generated, but are defined as macros in
   cc3220SF/core_portme.h.

Kind Regards,

Cameron Pike

0 Alexander Podshivalov over 6 years ago in reply to cameron pike

Intellectual 530 points

Cameron, could you run the benchmark on a TM4C1294 with flash prefetch buffers disabled, as described on p. 608 of the datasheet ( )?

0 cameron pike over 6 years ago in reply to Alexander Podshivalov

Prodigy 145 points

Alexander,

Good suggestion. I get Coremark 86 when TM4C1294 is running 80 MHz with Prefetch Off, significantly less than CC3220SF (107). CC3220SF prefetch must be doing something good....

Granted, the Tiva prefetch is advertised to be very spiffy compared to CC3220SF (4 x 256 bit, vs. 1 x 128 bit).

Ben and Vincent, would you get us an answer for the poor performance of CC3220SF? Three months ago, on October 27, 2017, Ben said that TI had completed measurements. You said that the datasheet would need to be updated as a result of your findings, but I don't see any update yet.

kind regards,
Cameron Pike

0 Hnz over 6 years ago in reply to cameron pike

Guru 39650 points

Hi Cameron,

I have one suggestion for you. Please contact directly Josh Wyatt from TI ( https://e2e.ti.com/members/614 ). He is a head of SimpleLink WiFi App team. Maybe he will help you forward this question to right person.

BTW... this topic is very interesting for me also.

Jan

0 Alexander Podshivalov over 6 years ago in reply to cameron pike

Intellectual 530 points

Are the optimisation settings the same for TM4C129 and CC3220SF? I was very surprised today when unexpectedly got a Coremark score of 128 for a CC3220SF running from flash - but that was due to a --no-size-constraints optimisation option enabled in IAR :)

So the best scores for a CC3220 go as follows:

2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 3263
Total time (secs): 32.630000
Iterations/Sec   : 183.879865
Iterations       : 6000
Compiler version : IAR ARM 8.20.1
Compiler flags   : -Ohs --no-size-constraints
Memory location : Code in SRAM, data in SRAM
seedcrc          : 0xe9f5
[0]crclist       : 0xe714
[0]crcmatrix     : 0x1fd7
[0]crcstate      : 0x8e3a
[0]crcfinal      : 0xa14c
Correct operation validated. See readme.txt for run and reporting rules.
CoreMark 1.0 : 183.879865 / IAR ARM 8.20.1 -Ohs --no-size-constraints / Code in SRAM, data in SRAM

and

2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 4678
Total time (secs): 46.780000
Iterations/Sec   : 128.259940
Iterations       : 6000
Compiler version : IAR ARM 8.20.1
Compiler flags   : -Ohs --no-size-constraints
Memory location : Code in FLASH, data in SRAM
seedcrc          : 0xe9f5
[0]crclist       : 0xe714
[0]crcmatrix     : 0x1fd7
[0]crcstate      : 0x8e3a
[0]crcfinal      : 0xa14c
Correct operation validated. See readme.txt for run and reporting rules.
CoreMark 1.0 : 128.259940 / IAR ARM 8.20.1 -Ohs --no-size-constraints / Code in FLASH, data in SRAM

0 Benjamin Moore over 6 years ago in reply to Alexander Podshivalov

TI__Mastermind 29085 points

Hi All,

We are working on getting this information formally released as part of our documentation. I will provide an update on E2E when this is completed.

Best Regards,
Ben M

0 cameron pike over 6 years ago in reply to Benjamin Moore

Prodigy 145 points

Hi, Ben,

Thanks for weighing in on this issue.

It seems that we can surmise the following:

1. The XIP flash on cc3220sf is slow, despite the fact that it is on the ICODE/DCODE bus, and the prefetch buffer is too small and improperly designed to compensate for this shortcoming. (e.g. Tiva 1294 has 4x256 bit, cc2650 has 4KB, etc., which provide very good results.)

2. TI cannot or doesn't intend to offer work-around, register fix, etc., to improve the prefetch buffer performance. If they were going to, they would not have let us spin our wheels fruitlessly for four months.

3. Thus, the benefit of XIP on cc3220sf is simply its size (1 MB), which may be necessary for very large applications, or applications that need to devote all the SRAM to data. Its increased current consumption (~12 mA) and greatly reduced speed are the designer's cost for the large code/const space.

From my perspective, this constitutes a very bad design error on the part of TI, one that I sincerely hope will not be repeated. Having designed major (and minor) systems around TI DSPs and MPUs for over 25 years, C3x, C4x, C5x, C6x, OMAP, DM3xx, MSP43x, CCxx, deployed on land, sea, air, and space, I sincerely regret expressing my disappointment with TI on this issue.

regards,
Cameron Pike

Wi-Fi

Wi-Fi forum

CCS/CC3220SF-LAUNCHXL: Benchmarking CC3220SF -- update, please?