Help in generating Neon instructions (cortex-A8) and cache setting

Vivek Malhotra

Prodigy 235 points

Other Parts Discussed in Thread: SYSBIOS, LINUXEZSDK-DAVINCI, CCSTUDIO

Hi,

I'm using Mistral EVM 8148 board for automotive following is the tool chain

av_bios_sdk_00_08_00_00

bios_6_34_02_18

ndk_2_21_01_38

ipc_1_25_00_04

I have distributed some of my image processing algorithm on ARM cortex-A8 core. following is the compiler option i have set -

-mv7A8 --code_state=32 --abi=eabi -me -O3 --opt_for_speed=3 --diag_warning=225 --display_error_number --neon

My algorithm takes a huge time to execute around 60sec on A8 whereas the same algorithm takes only 2sec on DSP. I have following questions -

1> How do i make sure that the A8 cache is enabled ? I have set the mmu correctly in the .cfg file as

var attri = {
type: Mmu.FirstLevelDesc_SECTION, // SECTION descriptor
bufferable: false, // bufferable
cacheable: true, // cacheable
accPerm:3, // read or write permissions
}; and

/* configure the PRIVATE_DATA_CORE_HOS as cacheable */
for (var i= 0x81100000; i < 0x83400000; i = i + 0x100000)

{
Mmu.setFirstLevelDescMeta(i, i, attri);

}
/* configure SHARED_FRAME_BUFFER */
for (var i= 0x88b00000; i < 0x8FF00000; i = i + 0x100000)

{
Mmu.setFirstLevelDescMeta(i, i, attri);

}

var Cache = xdc.useModule('ti.sysbios.family.arm.a8.Cache');

Cache.enableCache = true;

is this correct ?

2> In my code there are no Neon instruction generated, although i have provided --neon option. example

a loop

for(i = 0; i < 999; i++)
{
array[i] = i*i;

} is generated in assembly as

;* --------------------------------------------------------------------------*
;* BEGIN LOOP ||$C$L2||
;*
;* Loop source line : 132
;* Loop closing brace source line : 136
;* Loop Unroll Multiple : 3x
;* Known Minimum Trip Count : 333
;* Known Maximum Trip Count : 333
;* Known Max Trip Count Factor : 333
;* --------------------------------------------------------------------------*
||$C$L2||:
$C$DW$L$SetupMessageQueue$8$B:
.dwpsn file "F:/Vivek/Projects/DPD/SVN/trunk/Source Code/DPM/Embedded/a8/src/DPD_IPC.c",line 133,column 0,is_stmt,isa 2
;** -----------------------g9:
;** 134 ----------------------- *(U$37 += 3) = _smulbb(i, i);
;** 134 ----------------------- C$11 = i+1;
;** 134 ----------------------- U$37[1] = _smulbb(C$11, C$11);
;** 134 ----------------------- C$10 = i+2;
;** 134 ----------------------- U$37[2] = _smulbb(C$10, C$10);
;** 132 ----------------------- if ( (i += 3) < 999 ) goto g9;
;** 138 ----------------------- return s8Error;
SMULBB LR, V9, V9 ; [DPU_8_PIPE0] |134|
ADD A3, V9, #1 ; [DPU_8_PIPE1] |134|
ADD A4, V9, #2 ; [DPU_8_PIPE0] |134|
ADD V9, V9, #3 ; [DPU_8_PIPE1] |132|
SMULBB A3, A3, A3 ; [DPU_8_PIPE0] |134|
CMP A1, V9 ; [DPU_8_PIPE1] |132|
SMULBB A4, A4, A4 ; [DPU_8_PIPE0] |134|
STR LR, [A2, #12]! ; [DPU_8_PIPE0] |134|
STR A3, [A2, #4] ; [DPU_8_PIPE0] |134|
STR A4, [A2, #8] ; [DPU_8_PIPE0] |134|

without any vector instructions.

Please let me know whats wrong with the settings ?

Additional info

My output type is a8F (since m using NDK where there is no A8Fnv library)

compiler version is ARM 5.0.1

Library is rtsv7A8_T_le_n_v3_eabi.lib

Regards,

Vivek

over 11 years ago

0 judahvang over 11 years ago

TI__Mastermind 32475 points

Vivek,

I can answer the first question. Yes, what you did will enable the cache.

var Cache = xdc.useModule('ti.sysbios.family.arm.a8.Cache');
Cache.enableCache = true;

Judah

0 Vivek Malhotra over 11 years ago in reply to judahvang

Prodigy 235 points

Hi Judah,

Thanks for the reply, to be specific I'm not sure if simply writing var Cache = xdc.useModule('ti.sysbios.family.arm.a8.Cache');
Cache.enableCache = true; will enable cache on A8. As on DSP we set the MAR bits, we have set the mmu memory attributes accordingly. I would like you to comment on it whether its done correctly.

My second question regarding NEON instructions is also equally important as the algorithm execution time on A8 is rediculously slow

Regards,

Vivek

0 Pavel Botev over 11 years ago in reply to Vivek Malhotra

TI__Guru**** 170625 points

Vivek,

Vivek Malhotra said:
My second question regarding NEON instructions is also equally important as the algorithm execution time on A8 is rediculously slow

I can provide you the below links, please have a look, might be in help:

http://processors.wiki.ti.com/index.php/Using_NEON_and_VFPv3_on_Cortex-A8

http://processors.wiki.ti.com/index.php/Cortex-A8_Neon_Architecture

http://processors.wiki.ti.com/index.php/Cortex_A8#What_is_Neon.3F

http://www.arm.com/products/processors/technologies/neon.php

Best regards,
Pavel

0 Vivek Malhotra over 11 years ago in reply to Pavel Botev

Prodigy 235 points

Hi Pavel,

Thanks for the reply,

I had already gone through the links you provided, I'm using the option

NEON enabled without VFP

As per this the setting I have provided you with the compiler options generated in my post. even then there is no single SIMD instructions generated by the compiler, It would be great if you can try the example "for loop" i have mentioned in my question. I'm not able to attach the .cfg file, copying the contents below

/* root of the configuration object model */
var Program = xdc.useModule('xdc.cfg.Program');
var Semaphore = xdc.useModule('ti.sysbios.knl.Semaphore');
var Task = xdc.useModule('ti.sysbios.knl.Task');
var GateHwi = xdc.useModule('ti.sysbios.gates.GateHwi');
var GateAll = xdc.useModule('ti.sysbios.gates.GateAll');
var GateMP = xdc.useModule('ti.sdo.ipc.GateMP');
var MessageQ = xdc.useModule('ti.sdo.ipc.MessageQ');
var Notify = xdc.useModule('ti.sdo.ipc.Notify');
var SharedRegion = xdc.useModule('ti.sdo.ipc.SharedRegion');
var BIOS = xdc.useModule('ti.sysbios.BIOS');
var MultiProc = xdc.useModule('ti.sdo.utils.MultiProc');
var HeapBufMP = xdc.useModule('ti.sdo.ipc.heaps.HeapBufMP');
var HeapMemMP = xdc.useModule('ti.sdo.ipc.heaps.HeapMemMP');
var Ipc = xdc.useModule('ti.sdo.ipc.Ipc');
var ti_sysbios_hal_Cache = xdc.useModule('ti.sysbios.hal.Cache');
var Notify = xdc.useModule('ti.sdo.ipc.Notify');
var Clock = xdc.useModule('ti.sysbios.knl.Clock');
var Timer = xdc.useModule('ti.sysbios.hal.Timer');
var Swi = xdc.useModule('ti.sysbios.knl.Swi');
var Idle = xdc.useModule('ti.sysbios.knl.Idle');
var Memory = xdc.useModule('xdc.runtime.Memory');
var Startup = xdc.useModule('xdc.runtime.Startup');
var System = xdc.useModule('xdc.runtime.System');
var Cache = xdc.useModule('ti.sysbios.family.arm.a8.Cache');
var ti_sysbios_family_arm_a8_intcps_Hwi = xdc.useModule('ti.sysbios.family.arm.a8.intcps.Hwi');
var Mmu = xdc.useModule('ti.sysbios.family.arm.a8.Mmu');
var TimestampProvider = xdc.useModule('ti.sysbios.family.arm.a8.TimestampProvider');
var GateMutex = xdc.useModule('ti.sysbios.gates.GateMutex');
var GateMutexPri = xdc.useModule('ti.sysbios.gates.GateMutexPri');
var GateSwi = xdc.useModule('ti.sysbios.gates.GateSwi');
var GateTask = xdc.useModule('ti.sysbios.gates.GateTask');
var ti_sysbios_timers_dmtimer_Timer = xdc.useModule('ti.sysbios.timers.dmtimer.Timer');
var MessageQ = xdc.useModule('ti.sdo.ipc.MessageQ');
var Mailbox = xdc.useModule('ti.sysbios.knl.Mailbox');
var HeapMem = xdc.useModule('ti.sysbios.heaps.HeapMem');
var Global = xdc.useModule('ti.ndk.config.Global');
var Tcp = xdc.useModule('ti.ndk.config.Tcp');
var Telnet = xdc.useModule('ti.ndk.config.Telnet');
var Http = xdc.useModule('ti.ndk.config.Http');
var Ip = xdc.useModule('ti.ndk.config.Ip');
var Global = xdc.useModule('ti.ndk.config.Global');
var Udp = xdc.useModule('ti.ndk.config.Udp');
Global.IPv6 = false;
var Mmu = xdc.useModule('ti.sysbios.family.arm.a8.Mmu');

Cache.enableCache = true;
/* Configure MMU to access the peripheral register space */

// descriptor attribute structure
var attrs = {
type: Mmu.FirstLevelDesc_SECTION, // SECTION descriptor
bufferable: false, // bufferable
cacheable: false, // cacheable
accPerm:3, // read or write permissions
};
// Each 'SECTION' descriptor entry spans a 1MB address range
Mmu.setFirstLevelDescMeta(0x4A100000, 0x4A100000, attrs);
Mmu.setFirstLevelDescMeta(0x48140000, 0x48140000, attrs);

var Mmu = xdc.useModule('ti.sysbios.family.arm.a8.Mmu');
Mmu.enableMMU = true;

var attri = {
type: Mmu.FirstLevelDesc_SECTION, // SECTION descriptor
bufferable: false, // bufferable
cacheable: true, // cacheable
accPerm:3, // read or write permissions
};

/* configure the EDMA - TPTC memory range */
for (var i= 0x49800000; i < 0x49BFFFFF; i = i + 0x100000)

{
Mmu.setFirstLevelDescMeta(i, i, attrs);

}
/* configure the PRIVATE_DATA_CORE_HOS as cacheable */
for (var i= 0x81100000; i < 0x83400000; i = i + 0x100000)

{
Mmu.setFirstLevelDescMeta(i, i, attri);

}
/* configure SHARED_FRAME_BUFFER */
for (var i= 0x88b00000; i < 0x8FF00000; i = i + 0x100000)

{
Mmu.setFirstLevelDescMeta(i, i, attri);

}

/* configure the shared memeory as cacheable */
/*for (var i= 0x88b00000; i < 0x8FF00000; i = i + 0x100000)

{
attrs.bufferable = false;
attrs.cacheable = true;
Mmu.setFirstLevelDescMeta(i, i, attrs);

} */

Program.sectMap[".text"] = "CODE_CORE_HOST";
/* Place the MMU table in DDR3 */
var sectionName = "ti.sysbios.family.arm.a8.mmuTableSection";
Program.sectMap[sectionName] = new Program.SectionSpec();
Program.sectMap[sectionName].type = "NOINIT";
Program.sectMap[sectionName].loadSegment = "CODE_CORE_HOST"
/*
* Don't generate any code, this example shows how to code the NDK stack thread
* 'StackTest' manually.
*/
Global.enableCodeGeneration = false;
var procNameAry = MultiProc.getDeviceProcNames();
MultiProc.baseIdOfCluster = 0;
MultiProc.numProcessors = 2;
MultiProc.setConfig("HOST", ["HOST", "DSP"]);
var hostId = MultiProc.getIdMeta("HOST");
Ipc.procSync = Ipc.ProcSync_PAIR;
Ipc.sr0MemorySetup = true;
SharedRegion.setEntryMeta(0,
new SharedRegion.Entry({
name: "IPC_Internal",
base: 0x88100000,
len: 0x00A00000,
ownerProcId: MultiProc.getIdMeta("HOST"),
cacheEnable: true,
isValid: true,
createHeap: true
})
);

//Program.stack = 8192;
Program.stack = 10240;
BIOS.heapSize = 40960;
Program.heap = 4096;
SharedRegion.translate = false;
SharedRegion.numEntries = 1;
MessageQ.traceFlag = true;
Task.defaultStackSize = 2048;
Global.normTaskStackSize = 2048;
Program.sysStack = 4096;
var heapMem0Params = new HeapMem.Params();
heapMem0Params.instance.name = "ARM_Heap";
heapMem0Params.size = 35651584;
heapMem0Params.align = 128;
heapMem0Params.minBlockAlign = 128;
heapMem0Params.sectionName = ".ArmHeap";
Program.global.ARM_Heap = HeapMem.create(heapMem0Params);

Regards,

Vivek

0 Vivek Malhotra over 11 years ago in reply to Vivek Malhotra

Prodigy 235 points

Hi Pavel,

While going through the compiler libraries under C:\ti\ccsv5\tools\compiler\arm_5.0.1\lib I found the file mklib.c which has options for generating the RTS library for ARM Cortex A8 with NEON support. following is the content in this file

library_t LIBRARIES[] = {
{ "rtsv7A8_T_le_n_xo_eabi_eh.lib", { "CORTEX_THUMB","V7A8","_32ASMFUNCS","AEABI","EXCEPTIONS","XO","EABI","_T2ASMFUNCS","_16_DUAL_IND_CALL","LITTLE_ENDIAN","EABI_TDEH","THUMB","CORTEX","NEON" } },
{ "rtsv7R4_A_be_xo_eabi.lib", { "_16ASMFUNCS","_32ASMFUNCS","V7R4","_16_DUAL_IND_CALL","AEABI","XO","CORTEX","EABI" } },
{ "rtsv7A8_T_le_xo_tiarm9_eh.lib", { "CORTEX_THUMB","FULL_PORTABLE_EH","V7A8","_32ASMFUNCS","EXCEPTIONS","XO","TI_ARM9_ABI","_T2ASMFUNCS","_16_DUAL_IND_CALL","LITTLE_ENDIAN","THUMB","CORTEX" } },
{ "rtsv7A8_A_be_v3_eabi_eh.lib", { "_16ASMFUNCS","V7A8","_32ASMFUNCS","AEABI_VFP","VFPV3","EXCEPTIONS","AEABI","EABI","VFP","_16_DUAL_IND_CALL","EABI_TDEH","CORTEX" } },
{ "rtsv4_A_be_tiarm9.lib", { "_16ASMFUNCS","_32ASMFUNCS","_16_DUAL_IND_CALL","TI_ARM9_ABI" } },
{ "rts32.lib", { "_16ASMFUNCS","_32ASMFUNCS","_16_DUAL_IND_CALL","TI_ARM9_ABI" } },
{ "rtsv7R4_T_le_tiarm9.lib", { "CORTEX_THUMB","_32ASMFUNCS","TI_ARM9_ABI","_T2ASMFUNCS","V7R4","LITTLE_ENDIAN","_16_DUAL_IND_CALL","THUMB","CORTEX" } },
{ "rtsv7A8_T_be_n_xo_eabi_eh.lib", { "CORTEX_THUMB","V7A8","_32ASMFUNCS","AEABI","EXCEPTIONS","XO","EABI","_T2ASMFUNCS","_16_DUAL_IND_CALL","EABI_TDEH","THUMB","CORTEX","NEON" } },
{ "rtsv7A8_A_be_xo_tiarm9.lib", { "_16ASMFUNCS","V7A8","_32ASMFUNCS","_16_DUAL_IND_CALL","XO","CORTEX","TI_ARM9_ABI" } },
{ "rtsv7A8_A_le_n_eabi_eh.lib", { "_16ASMFUNCS","V7A8","_32ASMFUNCS","AEABI","EXCEPTIONS","EABI","_16_DUAL_IND_CALL","LITTLE_ENDIAN","EABI_TDEH","CORTEX","NEON" } },
{ "rtsv7A8_A_le_n_eabi.lib", { "_16ASMFUNCS","V7A8","_32ASMFUNCS","AEABI","EABI","_16_DUAL_IND_CALL","LITTLE_ENDIAN","CORTEX","NEON" } },

I'm using the library "rtsv7A8_A_le_n_eabi.lib" even then the NEON instructions are not generated. I went through the .asm files to generate this library and found that there is a MACRO __TI_VFP_SUPPORT__ | __TI_NEON_SUPPORT__ inside some of the .asm files. M not sure whether I need to rebuild the library with this MACRO ON to generate the SIMD instructions for A8 ? if you can share any such compiler library I would appreciate it. Also let me know if I'm thinking in the right direction or not

Regards,

Vivek

0 K Md Feroz Irfan over 11 years ago in reply to Vivek Malhotra

TI__Expert 7225 points

Hi Vivek,

Please note that Neon Code may not be generated unless the code requires it to be so. The compiler would create neon code based on the C code used.

Coming to Cache enabling, can you please check the ARM reference manual for checking if indeed the cahce has been enabled. I checked online for the register and http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344k/Bgbciiaf.html seems to have the info

Also DSP is a pipelined architecture and so may be more effiicient than ARM. You may also relook at your code and compiler settings to see if that can help further….

Best Regards

Feroz

0 Vivek Malhotra over 11 years ago in reply to K Md Feroz Irfan

Prodigy 235 points

Hi Feroz,

I agree with you, but if you see my post i have mentioned about a simple for loop

for(i = 0; i < 999; i++)
{
array[i] = i*i;

}

Wherein array is a restricted pointer (i have tried doing it), Now such a simple code must be pipelined on NEON, simillarly My algorithm is optimized on dsp using restrict access, etc. if same algorithm takes < 5sec on DSP then how is 60 sec justified.

I will check for the cache settings which you have provided the link for and get back to you.

Regards,

Vivek

0 Pavel Botev over 11 years ago

TI__Guru**** 170625 points

Vivek,

Vivek Malhotra said:

I'm using Mistral EVM 8148 board for automotive following is the tool chain

av_bios_sdk_00_08_00_00

Could you please provide more details on this AV BIOS SDK and tool chain?

I am working on DM814x EZSDK with CodeSourcery Lite GCC tool chain:

http://www.ti.com/tool/linuxezsdk-davinci

http://software-dl.ti.com/dsps/dsps_public_sw/ezsdk/latest/index_FDS.html

What is the frequency of your DSP and Cortex-A8 ARM?

I will try to reproduce your problem on my DM8148 EVM with EZSDK.

Regards,
Pavel

0 Vivek Malhotra over 11 years ago in reply to Pavel Botev

Prodigy 235 points

Hi Pavel,

I think its a good idea to try and see ig the SIMD instructions for NEON is generated in your setup. please use the for loop i have mentioned in my post to have common reference.

Regarding tool chain , we arr using CCS5.5, ARM 5.0.1 compiler and sysbios on windows

av_bios_sdk_00_08_00_00

bios_6_34_02_18

ndk_2_21_01_38

ipc_1_25_00_04

DSP frequency is 500Mhz and A8 is 600Mhz.

Regards,

Vivek

0 Pavel Botev over 11 years ago in reply to Vivek Malhotra

TI__Guru**** 170625 points

Vivek,

I will try on my side, but keep in mind that I am using the CodeSourcery GCC ARM compiler, while you are using the TI ARM C/C++ compiler.

http://processors.wiki.ti.com/index.php/Using_NEON_and_VFPv3_on_Cortex-A8

In the above wiki page (which is for TI ARM C/C++ compiler, not GCC), we have:

The --neon option instructs the compiler to automatically vectorize loops to use the NEON instructions. To get benefit from this option you should be using --opt_level=2 or higher and be generating code for performance by using the --opt_for_speed=[3-5] option.

NEON enabled without VFP

In this mode the compiler will generate NEON instructions for SIMD integer operations. It will not generate NEON instructions to vectorize floating point operations. The motivation for not allowing floating point NEON instructions if VFP is not enabled is because it is possible to have an integer only variant of NEON implemented. In order for the NEON unit to support floating point operations the VFPv3 coprocessor must be present.

Can you try with these compiler options (--neon --opt_level=2 --opt_for_speed=5), does it make any difference?

Please note that we have a special E2E forum for the TI ARM C/C++ compiler:

http://e2e.ti.com/support/development_tools/compiler/f/343.aspx

I will check (in parallel) with the team there, and see if they can help here.

Regards,
Pavel

0 Vivek Malhotra over 11 years ago in reply to Pavel Botev

Prodigy 235 points

Hi Pavel,

While posting this issue i had tried the following compiler option

-mv7A8 --code_state=32 --abi=eabi -me -O3 --opt_for_speed=3 --diag_warning=225 --display_error_number --neon

I'm on leave for a week and unfortunately i do not have the tool chain with to check with --opt_for_speed=5

I will be able to reply back to you by Thursday next week. in the mean time if you get to know anything either on CodeSourcery GCC ARM compiler or on TI ARM compiler please let me know.

Regards,

Vivek

0 Pavel Botev over 11 years ago in reply to Vivek Malhotra

TI__Guru**** 170625 points

Vivek,

When trying, do not forget to use also the --opt_level=2 option:

-mv7A8 --code_state=32 --abi=eabi -me -O3 ---opt_level=2 --opt_for_speed=5 --diag_warning=225 --display_error_number --neon

This --opt_level=2 option is stated as "should be used" in the wiki page I referred in my previous post.

Regards,
Pavel

0 Pavel Botev over 11 years ago in reply to Pavel Botev

TI__Guru**** 170625 points

Vivek,

I tried this loop for(i=0; i<999; i++) array[i] = i*i; and it runs for less than a second on my DM8148 EVM Cortex-A8 with EZSDK.

Regarding NEON instruction, I compiled the loop C file with the below GCC command:

arm-none-linux-gnueabi-gcc -march=armv7-a -mtune=cortex-a8 -mfpu=neon -ftree-vectorize -ffast-math -mfloat-abi=softfp -o for_loop for_loop.c

But I do not have any NEON instructions generated when I disassemble the result executable out file (ELF 32-bit LSB executable):

$ arm-none-linux-gnueabi-objdump -d for_loop

for_loop:     file format elf32-littlearm

Disassembly of section .init:

000082e0 <_init>:
    82e0:    e92d4010     push    {r4, lr}
    82e4:    eb00001c     bl    835c <call_gmon_start>
    82e8:    e8bd8010     pop    {r4, pc}

Disassembly of section .plt:

000082ec <.plt>:
    82ec:    e52de004     push    {lr}        ; (str lr, [sp, #-4]!)
    82f0:    e59fe004     ldr    lr, [pc, #4]    ; 82fc <_init+0x1c>
    82f4:    e08fe00e     add    lr, pc, lr
    82f8:    e5bef008     ldr    pc, [lr, #8]!
    82fc:    0000830c     .word    0x0000830c
    8300:    e28fc600     add    ip, pc, #0    ; 0x0
    8304:    e28cca08     add    ip, ip, #32768    ; 0x8000
    8308:    e5bcf30c     ldr    pc, [ip, #780]!
    830c:    e28fc600     add    ip, pc, #0    ; 0x0
    8310:    e28cca08     add    ip, ip, #32768    ; 0x8000
    8314:    e5bcf304     ldr    pc, [ip, #772]!
    8318:    e28fc600     add    ip, pc, #0    ; 0x0
    831c:    e28cca08     add    ip, ip, #32768    ; 0x8000
    8320:    e5bcf2fc     ldr    pc, [ip, #764]!

Disassembly of section .text:

00008324 <_start>:
    8324:    e59fc024     ldr    ip, [pc, #36]    ; 8350 <_start+0x2c>
    8328:    e3a0b000     mov    fp, #0    ; 0x0
    832c:    e49d1004     pop    {r1}        ; (ldr r1, [sp], #4)
    8330:    e1a0200d     mov    r2, sp
    8334:    e52d2004     push    {r2}        ; (str r2, [sp, #-4]!)
    8338:    e52d0004     push    {r0}        ; (str r0, [sp, #-4]!)
    833c:    e59f0010     ldr    r0, [pc, #16]    ; 8354 <_start+0x30>
    8340:    e59f3010     ldr    r3, [pc, #16]    ; 8358 <_start+0x34>
    8344:    e52dc004     push    {ip}        ; (str ip, [sp, #-4]!)
    8348:    ebffffef     bl    830c <_init+0x2c>
    834c:    ebffffeb     bl    8300 <_init+0x20>
    8350:    0000843c     .word    0x0000843c
    8354:    000083cc     .word    0x000083cc
    8358:    00008440     .word    0x00008440

0000835c <call_gmon_start>:
    835c:    e59f3014     ldr    r3, [pc, #20]    ; 8378 <call_gmon_start+0x1c>
    8360:    e59f2014     ldr    r2, [pc, #20]    ; 837c <call_gmon_start+0x20>
    8364:    e08f3003     add    r3, pc, r3
    8368:    e7931002     ldr    r1, [r3, r2]
    836c:    e3510000     cmp    r1, #0    ; 0x0
    8370:    012fff1e     bxeq    lr
    8374:    eaffffe7     b    8318 <_init+0x38>
    8378:    0000829c     .word    0x0000829c
    837c:    00000018     .word    0x00000018

00008380 <__do_global_dtors_aux>:
    8380:    e59f2010     ldr    r2, [pc, #16]    ; 8398 <__do_global_dtors_aux+0x18>
    8384:    e5d23000     ldrb    r3, [r2]
    8388:    e3530000     cmp    r3, #0    ; 0x0
    838c:    03a03001     moveq    r3, #1    ; 0x1
    8390:    05c23000     strbeq    r3, [r2]
    8394:    e12fff1e     bx    lr
    8398:    0001062c     .word    0x0001062c

0000839c <frame_dummy>:
    839c:    e59f0020     ldr    r0, [pc, #32]    ; 83c4 <frame_dummy+0x28>
    83a0:    e92d4010     push    {r4, lr}
    83a4:    e5903000     ldr    r3, [r0]
    83a8:    e3530000     cmp    r3, #0    ; 0x0
    83ac:    08bd8010     popeq    {r4, pc}
    83b0:    e59f3010     ldr    r3, [pc, #16]    ; 83c8 <frame_dummy+0x2c>
    83b4:    e3530000     cmp    r3, #0    ; 0x0
    83b8:    08bd8010     popeq    {r4, pc}
    83bc:    e12fff33     blx    r3
    83c0:    e8bd8010     pop    {r4, pc}
    83c4:    00010514     .word    0x00010514
    83c8:    00000000     .word    0x00000000

000083cc <main>:
    83cc:    e52db004     push    {fp}        ; (str fp, [sp, #-4]!)
    83d0:    e28db000     add    fp, sp, #0    ; 0x0
    83d4:    e24ddefb     sub    sp, sp, #4016    ; 0xfb0
    83d8:    e24dd004     sub    sp, sp, #4    ; 0x4
    83dc:    e3a03000     mov    r3, #0    ; 0x0
    83e0:    e50b3008     str    r3, [fp, #-8]
    83e4:    ea00000d     b    8420 <main+0x54>
    83e8:    e51b1008     ldr    r1, [fp, #-8]
    83ec:    e51b2008     ldr    r2, [fp, #-8]
    83f0:    e51b3008     ldr    r3, [fp, #-8]
    83f4:    e0000293     mul    r0, r3, r2
    83f8:    e30f305c     movw    r3, #61532    ; 0xf05c
    83fc:    e34f3fff     movt    r3, #65535    ; 0xffff
    8400:    e1a02101     lsl    r2, r1, #2
    8404:    e24b1004     sub    r1, fp, #4    ; 0x4
    8408:    e0812002     add    r2, r1, r2
    840c:    e0823003     add    r3, r2, r3
    8410:    e5830000     str    r0, [r3]
    8414:    e51b3008     ldr    r3, [fp, #-8]
    8418:    e2833001     add    r3, r3, #1    ; 0x1
    841c:    e50b3008     str    r3, [fp, #-8]
    8420:    e51b2008     ldr    r2, [fp, #-8]
    8424:    e30033e6     movw    r3, #998    ; 0x3e6
    8428:    e1520003     cmp    r2, r3
    842c:    daffffed     ble    83e8 <main+0x1c>
    8430:    e28bd000     add    sp, fp, #0    ; 0x0
    8434:    e8bd0800     pop    {fp}
    8438:    e12fff1e     bx    lr

0000843c <__libc_csu_fini>:
    843c:    e12fff1e     bx    lr

00008440 <__libc_csu_init>:
    8440:    e92d47f0     push    {r4, r5, r6, r7, r8, r9, sl, lr}
    8444:    e1a08001     mov    r8, r1
    8448:    e1a07002     mov    r7, r2
    844c:    e1a0a000     mov    sl, r0
    8450:    ebffffa2     bl    82e0 <_init>
    8454:    e59f1044     ldr    r1, [pc, #68]    ; 84a0 <__libc_csu_init+0x60>
    8458:    e59f3044     ldr    r3, [pc, #68]    ; 84a4 <__libc_csu_init+0x64>
    845c:    e59f2044     ldr    r2, [pc, #68]    ; 84a8 <__libc_csu_init+0x68>
    8460:    e0613003     rsb    r3, r1, r3
    8464:    e08f2002     add    r2, pc, r2
    8468:    e1b05143     asrs    r5, r3, #2
    846c:    e0822001     add    r2, r2, r1
    8470:    08bd87f0     popeq    {r4, r5, r6, r7, r8, r9, sl, pc}
    8474:    e1a06002     mov    r6, r2
    8478:    e3a04000     mov    r4, #0    ; 0x0
    847c:    e1a0000a     mov    r0, sl
    8480:    e1a01008     mov    r1, r8
    8484:    e1a02007     mov    r2, r7
    8488:    e1a0e00f     mov    lr, pc
    848c:    e796f104     ldr    pc, [r6, r4, lsl #2]
    8490:    e2844001     add    r4, r4, #1    ; 0x1
    8494:    e1540005     cmp    r4, r5
    8498:    3afffff7     bcc    847c <__libc_csu_init+0x3c>
    849c:    e8bd87f0     pop    {r4, r5, r6, r7, r8, r9, sl, pc}
    84a0:    ffffff04     .word    0xffffff04
    84a4:    ffffff08     .word    0xffffff08
    84a8:    0000819c     .word    0x0000819c

Disassembly of section .fini:

000084ac <_fini>:
    84ac:    e92d4010     push    {r4, lr}
    84b0:    e8bd8010     pop    {r4, pc}

I will continue to investigate the reason that we do not have NEON instructions generated.

Regards,
Pavel

0 Pavel Botev over 11 years ago in reply to Pavel Botev

TI__Guru**** 170625 points

I have also found these two links, please have a look. might be in help:

http://e2e.ti.com/support/development_tools/compiler/f/343/t/271747.aspx

http://infocenter.arm.com/help/topic/com.arm.doc.dht0002a/DHT0002A_introducing_neon.pdf

Regards,
Pavel

0 Pavel Botev over 11 years ago in reply to Pavel Botev

TI__Guru**** 170625 points

Vivek,

I also tried with pure CCStudio project, no EZSDK, no SysBIOS.

I am using CCS5.4.0 running on Linux Ubuntu PC. The target configuration is for DM8148 EVM, using the default DM8148 GEL file. The C file (main.c) have the below source code:

int main(void) {

    int a[200],b[200],c[200];
    int i;

    for (i = 0; i < 200; i++)
     {
        a[i]= b[i]=i+1;
     }

    for (i = 0; i < 200; i++)
     {
         c[i]= a[i] * b[i];
     }

   return 0;
}

I compile with ARM compiler (not GCC), version 5.0.3.

First I compile with the default ARM Compiler options, which are:

-mv7A8 --code_state=32 --abi=eabi -me -O2 -g --include_path="/home/users/pbotev/ti/ccsv5/tools/compiler/arm_5.0.3/include" --define=dm8146 --define=dm8148 --diag_warning=225 --display_error_number --diag_wrap=off

As result I have no NEON assembly instructions, which is OK, as I do not use the --neon option. See the full log of compile and disassembly messages output: http://e2e.ti.com/cfs-file.ashx/__key/communityserver-discussions-components-files/716/5758.No_5F00_NEON

Second, I compile with --opt_level=2, --opt_for_speed=5 and --neon options:

-mv7A8 --code_state=32 --abi=eabi -me -O2 --opt_for_speed=5 -g --include_path="/home/users/pbotev/ti/ccsv5/tools/compiler/arm_5.0.3/include" --define=dm8146 --define=dm8148 --diag_warning=225 --display_error_number --diag_wrap=off --neon

As result, I have the NEON assembly instructions generated:

16                 c[i]= a[i] * b[i];
          $C$DW$L$main$4$B, $C$L2:
40300bc4:   F4200A8D VLD1.32         {D0, D1}, [R0]!
40300bc8:   F4222A8D VLD1.32         {D2, D3}, [R2]!
          $C$DW$L$main$4$E, $C$DW$L$main$5$B:
40300bcc:   F2220950 VMUL.I32        Q0, Q1, Q0
14            for (i = 0; i < 200; i++)
40300bd0:   E25CC001 SUBS            R12, R12, #1
16                 c[i]= a[i] * b[i];
40300bd4:   F4010A8D VST1.32         {D0, D1}, [R1]!

See the full log of compile and disassembly messages output:http://e2e.ti.com/cfs-file.ashx/__key/communityserver-discussions-components-files/716/3731.With_5F00_NEON

Best regards,
Pavel

0 Vivek Malhotra over 11 years ago in reply to Pavel Botev

Prodigy 235 points

Hi Pavel,

Thanks for your extended support, I think following options marked in yellow will be important,

mv7A8 --code_state=32 --abi=eabi -me -O2 -g ---opt_for_speed=5 include_path="/home/users/pbotev/ti/ccsv5/tools/compiler/arm_5.0.3/include" --define=dm8146 --define=dm8148 --diag_warning=225 --display_error_number --diag_wrap=off --neon

I will also check the links for cache and neon provided by you in earlier posts, I or someone from my team, will be able to get back with the findings by thursday or Friday 8th Nov 2013.

BTW, were you able to see the improvement in performance with and without neon instructions ?

Regards,

Vivek

0 Pavel Botev over 11 years ago in reply to Vivek Malhotra

TI__Guru**** 170625 points

Vivek,

Vivek Malhotra said:
BTW, were you able to see the improvement in performance with and without neon instructions ?

In both cases (with Neon, and without Neon), the main.c code runs very fast (for less than a second). But this is with pure CCS project, where I have only simple main.c file running directly on Cortex-A8 (the DSP is powered OFF amd not used).

While you are on SysBIOS SDK running on DSP. I assume the delay is caused by the fact that DSP and Cortex-A8 should communicate with each other. But I am not familiar with SysBIOS running on DSP.

What I can recommend you is to ask in our BIOS forum, how to improve the Cortex-A8 performance, when running SysBIOS on DSP.

http://e2e.ti.com/support/embedded/bios/f/355.aspx

Regards,
Pavel

0 Pavel Botev over 11 years ago in reply to Pavel Botev

TI__Guru**** 170625 points

Vivek,

You can also refer to the below two links, might be in help:

http://e2e.ti.com/support/embedded/bios/f/355/p/293453/1023926.aspx

http://processors.wiki.ti.com/index.php/BIOS_6_Real-Time_Analysis_%28RTA%29_in_CCSv4

Regards,
Pavel

0 Vivek Malhotra over 11 years ago in reply to Pavel Botev

Prodigy 235 points

Dear Pavel,

sorry for the late reply, we were able to generate the neon instructions as per your suggestions and the performance has improved more than 70% !

Please note, when using RTSC platform the target has to be fnV for generating these instructions

Regards,

Vivek

Processors

Processors forum

Help in generating Neon instructions (cortex-A8) and cache setting