This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Help in generating Neon instructions (cortex-A8) and cache setting

Other Parts Discussed in Thread: SYSBIOS, LINUXEZSDK-DAVINCI, CCSTUDIO

Hi,

I'm using Mistral EVM 8148 board for automotive following is the tool chain

av_bios_sdk_00_08_00_00

bios_6_34_02_18

ndk_2_21_01_38

ipc_1_25_00_04

I have distributed some of my image processing algorithm on ARM cortex-A8 core. following is the compiler option i have set -

-mv7A8 --code_state=32 --abi=eabi -me -O3 --opt_for_speed=3 --diag_warning=225 --display_error_number --neon

My algorithm takes a huge time to execute around 60sec on A8 whereas the same algorithm takes only 2sec on DSP. I have following questions -

1> How do i make sure that the A8 cache is enabled ? I have set the mmu correctly in the .cfg file as 

var attri = {
type: Mmu.FirstLevelDesc_SECTION, // SECTION descriptor
bufferable: false, // bufferable
cacheable: true, // cacheable
accPerm:3, // read or write permissions
}; and 

/* configure the PRIVATE_DATA_CORE_HOS as cacheable */
for (var i= 0x81100000; i < 0x83400000; i = i + 0x100000)

{
Mmu.setFirstLevelDescMeta(i, i, attri);

}
/* configure SHARED_FRAME_BUFFER */
for (var i= 0x88b00000; i < 0x8FF00000; i = i + 0x100000)

{
Mmu.setFirstLevelDescMeta(i, i, attri);

}

var Cache = xdc.useModule('ti.sysbios.family.arm.a8.Cache');

Cache.enableCache = true;

is this correct ?

2> In my code there are no Neon instruction generated, although i have provided --neon option. example

a loop 

for(i = 0; i < 999; i++)
{
array[i] = i*i;

} is generated in assembly as

;* --------------------------------------------------------------------------*
;* BEGIN LOOP ||$C$L2||
;*
;* Loop source line : 132
;* Loop closing brace source line : 136
;* Loop Unroll Multiple : 3x
;* Known Minimum Trip Count : 333
;* Known Maximum Trip Count : 333
;* Known Max Trip Count Factor : 333
;* --------------------------------------------------------------------------*
||$C$L2||:
$C$DW$L$SetupMessageQueue$8$B:
.dwpsn file "F:/Vivek/Projects/DPD/SVN/trunk/Source Code/DPM/Embedded/a8/src/DPD_IPC.c",line 133,column 0,is_stmt,isa 2
;** -----------------------g9:
;** 134 ----------------------- *(U$37 += 3) = _smulbb(i, i);
;** 134 ----------------------- C$11 = i+1;
;** 134 ----------------------- U$37[1] = _smulbb(C$11, C$11);
;** 134 ----------------------- C$10 = i+2;
;** 134 ----------------------- U$37[2] = _smulbb(C$10, C$10);
;** 132 ----------------------- if ( (i += 3) < 999 ) goto g9;
;** 138 ----------------------- return s8Error;
SMULBB LR, V9, V9 ; [DPU_8_PIPE0] |134|
ADD A3, V9, #1 ; [DPU_8_PIPE1] |134|
ADD A4, V9, #2 ; [DPU_8_PIPE0] |134|
ADD V9, V9, #3 ; [DPU_8_PIPE1] |132|
SMULBB A3, A3, A3 ; [DPU_8_PIPE0] |134|
CMP A1, V9 ; [DPU_8_PIPE1] |132|
SMULBB A4, A4, A4 ; [DPU_8_PIPE0] |134|
STR LR, [A2, #12]! ; [DPU_8_PIPE0] |134|
STR A3, [A2, #4] ; [DPU_8_PIPE0] |134|
STR A4, [A2, #8] ; [DPU_8_PIPE0] |134|

without any vector instructions.

Please let me know whats wrong with the settings ?

Additional info

My output type is a8F (since m using NDK where there is no A8Fnv library)

compiler version is ARM 5.0.1

Library is rtsv7A8_T_le_n_v3_eabi.lib

Regards,

Vivek

  • Vivek,

    I can answer the first question.  Yes, what you did will enable the cache.

    var Cache = xdc.useModule('ti.sysbios.family.arm.a8.Cache');
    Cache.enableCache = true;

    Judah

  • Hi Judah,

    Thanks for the reply, to be specific I'm not sure if simply writing var Cache = xdc.useModule('ti.sysbios.family.arm.a8.Cache');
    Cache.enableCache = true; will enable cache on A8. As on DSP we set the MAR bits, we have set the mmu memory attributes accordingly. I would like you to comment on it whether its done correctly.

    My second question regarding NEON instructions is also equally important as the algorithm execution time on A8 is rediculously slow


    Regards,

    Vivek

  • Vivek,

    Vivek Malhotra said:
    My second question regarding NEON instructions is also equally important as the algorithm execution time on A8 is rediculously slow

    I can provide you the below links, please have a look, might be in help:

    http://processors.wiki.ti.com/index.php/Using_NEON_and_VFPv3_on_Cortex-A8

    http://processors.wiki.ti.com/index.php/Cortex-A8_Neon_Architecture

    http://processors.wiki.ti.com/index.php/Cortex_A8#What_is_Neon.3F

    http://www.arm.com/products/processors/technologies/neon.php

    Best regards,
    Pavel

  • Hi Pavel,

    Thanks for the reply,

    I had already gone through the links you provided, I'm using the option 

    • NEON enabled without VFP
    As per this the setting I have provided you with the compiler options generated in my post. even then there is no single SIMD instructions generated by the compiler, It would be great if you can try the example "for loop" i have mentioned in my question. I'm not able to  attach the .cfg file, copying the contents below

    /* root of the configuration object model */
    var Program = xdc.useModule('xdc.cfg.Program');
    var Semaphore = xdc.useModule('ti.sysbios.knl.Semaphore');
    var Task = xdc.useModule('ti.sysbios.knl.Task');
    var GateHwi = xdc.useModule('ti.sysbios.gates.GateHwi');
    var GateAll = xdc.useModule('ti.sysbios.gates.GateAll');
    var GateMP = xdc.useModule('ti.sdo.ipc.GateMP');
    var MessageQ = xdc.useModule('ti.sdo.ipc.MessageQ');
    var Notify = xdc.useModule('ti.sdo.ipc.Notify');
    var SharedRegion = xdc.useModule('ti.sdo.ipc.SharedRegion');
    var BIOS = xdc.useModule('ti.sysbios.BIOS');
    var MultiProc = xdc.useModule('ti.sdo.utils.MultiProc');
    var HeapBufMP = xdc.useModule('ti.sdo.ipc.heaps.HeapBufMP');
    var HeapMemMP = xdc.useModule('ti.sdo.ipc.heaps.HeapMemMP');
    var Ipc = xdc.useModule('ti.sdo.ipc.Ipc');
    var ti_sysbios_hal_Cache = xdc.useModule('ti.sysbios.hal.Cache');
    var Notify = xdc.useModule('ti.sdo.ipc.Notify');
    var Clock = xdc.useModule('ti.sysbios.knl.Clock');
    var Timer = xdc.useModule('ti.sysbios.hal.Timer');
    var Swi = xdc.useModule('ti.sysbios.knl.Swi');
    var Idle = xdc.useModule('ti.sysbios.knl.Idle');
    var Memory = xdc.useModule('xdc.runtime.Memory');
    var Startup = xdc.useModule('xdc.runtime.Startup');
    var System = xdc.useModule('xdc.runtime.System');
    var Cache = xdc.useModule('ti.sysbios.family.arm.a8.Cache');
    var ti_sysbios_family_arm_a8_intcps_Hwi = xdc.useModule('ti.sysbios.family.arm.a8.intcps.Hwi');
    var Mmu = xdc.useModule('ti.sysbios.family.arm.a8.Mmu');
    var TimestampProvider = xdc.useModule('ti.sysbios.family.arm.a8.TimestampProvider');
    var GateMutex = xdc.useModule('ti.sysbios.gates.GateMutex');
    var GateMutexPri = xdc.useModule('ti.sysbios.gates.GateMutexPri');
    var GateSwi = xdc.useModule('ti.sysbios.gates.GateSwi');
    var GateTask = xdc.useModule('ti.sysbios.gates.GateTask');
    var ti_sysbios_timers_dmtimer_Timer = xdc.useModule('ti.sysbios.timers.dmtimer.Timer');
    var MessageQ = xdc.useModule('ti.sdo.ipc.MessageQ');
    var Mailbox = xdc.useModule('ti.sysbios.knl.Mailbox');
    var HeapMem = xdc.useModule('ti.sysbios.heaps.HeapMem');
    var Global = xdc.useModule('ti.ndk.config.Global');
    var Tcp = xdc.useModule('ti.ndk.config.Tcp');
    var Telnet = xdc.useModule('ti.ndk.config.Telnet');
    var Http = xdc.useModule('ti.ndk.config.Http');
    var Ip = xdc.useModule('ti.ndk.config.Ip');
    var Global = xdc.useModule('ti.ndk.config.Global');
    var Udp = xdc.useModule('ti.ndk.config.Udp');
    Global.IPv6 = false;
    var Mmu = xdc.useModule('ti.sysbios.family.arm.a8.Mmu');

    Cache.enableCache = true;
    /* Configure MMU to access the peripheral register space */

    // descriptor attribute structure
    var attrs = {
    type: Mmu.FirstLevelDesc_SECTION, // SECTION descriptor
    bufferable: false, // bufferable
    cacheable: false, // cacheable
    accPerm:3, // read or write permissions
    };
    // Each 'SECTION' descriptor entry spans a 1MB address range
    Mmu.setFirstLevelDescMeta(0x4A100000, 0x4A100000, attrs);
    Mmu.setFirstLevelDescMeta(0x48140000, 0x48140000, attrs);

    var Mmu = xdc.useModule('ti.sysbios.family.arm.a8.Mmu');
    Mmu.enableMMU = true;

    var attri = {
    type: Mmu.FirstLevelDesc_SECTION, // SECTION descriptor
    bufferable: false, // bufferable
    cacheable: true, // cacheable
    accPerm:3, // read or write permissions
    };

    /* configure the EDMA - TPTC memory range */
    for (var i= 0x49800000; i < 0x49BFFFFF; i = i + 0x100000)

    {
    Mmu.setFirstLevelDescMeta(i, i, attrs);

    }
    /* configure the PRIVATE_DATA_CORE_HOS as cacheable */
    for (var i= 0x81100000; i < 0x83400000; i = i + 0x100000)

    {
    Mmu.setFirstLevelDescMeta(i, i, attri);

    }
    /* configure SHARED_FRAME_BUFFER */
    for (var i= 0x88b00000; i < 0x8FF00000; i = i + 0x100000)

    {
    Mmu.setFirstLevelDescMeta(i, i, attri);

    }

    /* configure the shared memeory as cacheable */
    /*for (var i= 0x88b00000; i < 0x8FF00000; i = i + 0x100000)

    {
    attrs.bufferable = false;
    attrs.cacheable = true;
    Mmu.setFirstLevelDescMeta(i, i, attrs);

    } */


    Program.sectMap[".text"] = "CODE_CORE_HOST";
    /* Place the MMU table in DDR3 */
    var sectionName = "ti.sysbios.family.arm.a8.mmuTableSection";
    Program.sectMap[sectionName] = new Program.SectionSpec();
    Program.sectMap[sectionName].type = "NOINIT";
    Program.sectMap[sectionName].loadSegment = "CODE_CORE_HOST"
    /*
    * Don't generate any code, this example shows how to code the NDK stack thread
    * 'StackTest' manually.
    */
    Global.enableCodeGeneration = false;
    var procNameAry = MultiProc.getDeviceProcNames();
    MultiProc.baseIdOfCluster = 0;
    MultiProc.numProcessors = 2;
    MultiProc.setConfig("HOST", ["HOST", "DSP"]);
    var hostId = MultiProc.getIdMeta("HOST");
    Ipc.procSync = Ipc.ProcSync_PAIR;
    Ipc.sr0MemorySetup = true;
    SharedRegion.setEntryMeta(0,
    new SharedRegion.Entry({
    name: "IPC_Internal",
    base: 0x88100000,
    len: 0x00A00000,
    ownerProcId: MultiProc.getIdMeta("HOST"),
    cacheEnable: true,
    isValid: true,
    createHeap: true
    })
    );


    //Program.stack = 8192;
    Program.stack = 10240;
    BIOS.heapSize = 40960;
    Program.heap = 4096;
    SharedRegion.translate = false;
    SharedRegion.numEntries = 1;
    MessageQ.traceFlag = true;
    Task.defaultStackSize = 2048;
    Global.normTaskStackSize = 2048;
    Program.sysStack = 4096;
    var heapMem0Params = new HeapMem.Params();
    heapMem0Params.instance.name = "ARM_Heap";
    heapMem0Params.size = 35651584;
    heapMem0Params.align = 128;
    heapMem0Params.minBlockAlign = 128;
    heapMem0Params.sectionName = ".ArmHeap";
    Program.global.ARM_Heap = HeapMem.create(heapMem0Params);

    Regards,
    Vivek
  • Hi Pavel,

    While going through the compiler libraries under C:\ti\ccsv5\tools\compiler\arm_5.0.1\lib I found the file mklib.c which has options for generating the RTS library for ARM Cortex A8 with NEON support. following is the content in this file 

    library_t LIBRARIES[] = {
    { "rtsv7A8_T_le_n_xo_eabi_eh.lib", { "CORTEX_THUMB","V7A8","_32ASMFUNCS","AEABI","EXCEPTIONS","XO","EABI","_T2ASMFUNCS","_16_DUAL_IND_CALL","LITTLE_ENDIAN","EABI_TDEH","THUMB","CORTEX","NEON" } },
    { "rtsv7R4_A_be_xo_eabi.lib", { "_16ASMFUNCS","_32ASMFUNCS","V7R4","_16_DUAL_IND_CALL","AEABI","XO","CORTEX","EABI" } },
    { "rtsv7A8_T_le_xo_tiarm9_eh.lib", { "CORTEX_THUMB","FULL_PORTABLE_EH","V7A8","_32ASMFUNCS","EXCEPTIONS","XO","TI_ARM9_ABI","_T2ASMFUNCS","_16_DUAL_IND_CALL","LITTLE_ENDIAN","THUMB","CORTEX" } },
    { "rtsv7A8_A_be_v3_eabi_eh.lib", { "_16ASMFUNCS","V7A8","_32ASMFUNCS","AEABI_VFP","VFPV3","EXCEPTIONS","AEABI","EABI","VFP","_16_DUAL_IND_CALL","EABI_TDEH","CORTEX" } },
    { "rtsv4_A_be_tiarm9.lib", { "_16ASMFUNCS","_32ASMFUNCS","_16_DUAL_IND_CALL","TI_ARM9_ABI" } },
    { "rts32.lib", { "_16ASMFUNCS","_32ASMFUNCS","_16_DUAL_IND_CALL","TI_ARM9_ABI" } },
    { "rtsv7R4_T_le_tiarm9.lib", { "CORTEX_THUMB","_32ASMFUNCS","TI_ARM9_ABI","_T2ASMFUNCS","V7R4","LITTLE_ENDIAN","_16_DUAL_IND_CALL","THUMB","CORTEX" } },
    { "rtsv7A8_T_be_n_xo_eabi_eh.lib", { "CORTEX_THUMB","V7A8","_32ASMFUNCS","AEABI","EXCEPTIONS","XO","EABI","_T2ASMFUNCS","_16_DUAL_IND_CALL","EABI_TDEH","THUMB","CORTEX","NEON" } },
    { "rtsv7A8_A_be_xo_tiarm9.lib", { "_16ASMFUNCS","V7A8","_32ASMFUNCS","_16_DUAL_IND_CALL","XO","CORTEX","TI_ARM9_ABI" } },
    { "rtsv7A8_A_le_n_eabi_eh.lib", { "_16ASMFUNCS","V7A8","_32ASMFUNCS","AEABI","EXCEPTIONS","EABI","_16_DUAL_IND_CALL","LITTLE_ENDIAN","EABI_TDEH","CORTEX","NEON" } },
    { "rtsv7A8_A_le_n_eabi.lib", { "_16ASMFUNCS","V7A8","_32ASMFUNCS","AEABI","EABI","_16_DUAL_IND_CALL","LITTLE_ENDIAN","CORTEX","NEON" } },

    I'm using the library "rtsv7A8_A_le_n_eabi.lib" even then the NEON instructions are not generated. I went through the .asm files to generate this library and found that there is a MACRO __TI_VFP_SUPPORT__ | __TI_NEON_SUPPORT__ inside some of the .asm files. M not sure whether I need to rebuild the library with this MACRO ON to generate the SIMD instructions for A8 ? if you can share any such compiler library I would appreciate it. Also let me know if I'm thinking in the right direction or not

    Regards,

    Vivek

  • Hi Vivek,

     

    Please note that Neon Code may not be generated unless the code requires it to be so. The compiler would create neon code based on the C code used.

     

    Coming to Cache enabling, can you please check the ARM reference manual for checking if indeed the cahce has been enabled. I checked online for the register and http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344k/Bgbciiaf.html seems to have the info

     

    Also DSP is a pipelined architecture and so may be more effiicient than ARM. You may also relook at your code and compiler settings to see if that can help further….

    Best Regards

    Feroz

  • Hi Feroz,

    I agree with you, but if you see my post i have mentioned about a simple for loop

    for(i = 0; i < 999; i++)
    {
    array[i] = i*i;

    }

    Wherein array is a restricted pointer (i have tried doing it), Now such a simple code must be pipelined on NEON, simillarly My algorithm is optimized on dsp using restrict access, etc. if same algorithm takes < 5sec on DSP then how is 60 sec justified.

    I will check for the cache settings which you have provided the link for and get back to you.

    Regards,

    Vivek

  • Vivek,

    Vivek Malhotra said:

    I'm using Mistral EVM 8148 board for automotive following is the tool chain

    av_bios_sdk_00_08_00_00

    Could you please provide more details on this AV BIOS SDK and tool chain?

    I am working on DM814x EZSDK with CodeSourcery Lite GCC tool chain:

    http://www.ti.com/tool/linuxezsdk-davinci

    http://software-dl.ti.com/dsps/dsps_public_sw/ezsdk/latest/index_FDS.html

    What is the frequency of your DSP and Cortex-A8 ARM?

    I will try to reproduce your problem on my DM8148 EVM with EZSDK.

    Regards,
    Pavel

  • Hi Pavel,

    I think its a good idea to try and see ig the SIMD instructions for NEON is generated in your setup. please use the for loop i have mentioned in my post to have common reference. 

    Regarding tool chain , we arr using CCS5.5, ARM 5.0.1 compiler and sysbios on windows

    av_bios_sdk_00_08_00_00

    bios_6_34_02_18

    ndk_2_21_01_38

    ipc_1_25_00_04

    DSP frequency is 500Mhz and A8 is 600Mhz.

    Regards,

    Vivek

  • Vivek,

    I will try on my side, but keep in mind that I am using the CodeSourcery GCC ARM compiler, while you are using the TI ARM C/C++ compiler.

    http://processors.wiki.ti.com/index.php/Using_NEON_and_VFPv3_on_Cortex-A8

    In the above wiki page (which is for TI ARM C/C++ compiler, not GCC), we have:

    The --neon option instructs the compiler to automatically vectorize loops to use the NEON instructions. To get benefit from this option you should be using --opt_level=2 or higher and be generating code for performance by using the --opt_for_speed=[3-5] option.

    • NEON enabled without VFP
    In this mode the compiler will generate NEON instructions for SIMD integer operations. It will not generate NEON instructions to vectorize floating point operations. The motivation for not allowing floating point NEON instructions if VFP is not enabled is because it is possible to have an integer only variant of NEON implemented. In order for the NEON unit to support floating point operations the VFPv3 coprocessor must be present.


    Can you try with these compiler options (--neon --opt_level=2 --opt_for_speed=5), does it make any difference?

    Please note that we have a special E2E forum for the TI ARM C/C++ compiler:

    http://e2e.ti.com/support/development_tools/compiler/f/343.aspx

    I will check (in parallel) with the team there, and see if they can help here.

    Regards,
    Pavel

  • Hi Pavel,

    While posting this issue i had tried the following compiler option 

    -mv7A8 --code_state=32 --abi=eabi -me -O3 --opt_for_speed=3 --diag_warning=225 --display_error_number --neon

    I'm on leave for a week and unfortunately i do not have the tool chain with to check with --opt_for_speed=5 

    I will be able to reply back to you by Thursday next week. in the mean time if you get to know anything either on CodeSourcery GCC ARM compiler or on TI ARM compiler please let me know.

    Regards,

    Vivek

  • Vivek,

    When trying, do not forget to use also the --opt_level=2 option:

    -mv7A8 --code_state=32 --abi=eabi -me -O3 ---opt_level=2 --opt_for_speed=5 --diag_warning=225 --display_error_number --neon

    This --opt_level=2 option is stated as "should be used" in the wiki page I referred in my previous post.

    Regards,
    Pavel


  • Vivek,

    I tried this loop for(i=0; i<999; i++) array[i] = i*i; and it runs for less than a second on my DM8148 EVM Cortex-A8 with EZSDK.

    Regarding NEON instruction, I compiled the loop C file with the below GCC command:

    arm-none-linux-gnueabi-gcc -march=armv7-a -mtune=cortex-a8 -mfpu=neon -ftree-vectorize -ffast-math -mfloat-abi=softfp -o for_loop for_loop.c

    But I do not have any NEON instructions generated when I disassemble the result executable out file (ELF 32-bit LSB executable):

    $ arm-none-linux-gnueabi-objdump -d for_loop

    for_loop:     file format elf32-littlearm


    Disassembly of section .init:

    000082e0 <_init>:
        82e0:    e92d4010     push    {r4, lr}
        82e4:    eb00001c     bl    835c <call_gmon_start>
        82e8:    e8bd8010     pop    {r4, pc}

    Disassembly of section .plt:

    000082ec <.plt>:
        82ec:    e52de004     push    {lr}        ; (str lr, [sp, #-4]!)
        82f0:    e59fe004     ldr    lr, [pc, #4]    ; 82fc <_init+0x1c>
        82f4:    e08fe00e     add    lr, pc, lr
        82f8:    e5bef008     ldr    pc, [lr, #8]!
        82fc:    0000830c     .word    0x0000830c
        8300:    e28fc600     add    ip, pc, #0    ; 0x0
        8304:    e28cca08     add    ip, ip, #32768    ; 0x8000
        8308:    e5bcf30c     ldr    pc, [ip, #780]!
        830c:    e28fc600     add    ip, pc, #0    ; 0x0
        8310:    e28cca08     add    ip, ip, #32768    ; 0x8000
        8314:    e5bcf304     ldr    pc, [ip, #772]!
        8318:    e28fc600     add    ip, pc, #0    ; 0x0
        831c:    e28cca08     add    ip, ip, #32768    ; 0x8000
        8320:    e5bcf2fc     ldr    pc, [ip, #764]!

    Disassembly of section .text:

    00008324 <_start>:
        8324:    e59fc024     ldr    ip, [pc, #36]    ; 8350 <_start+0x2c>
        8328:    e3a0b000     mov    fp, #0    ; 0x0
        832c:    e49d1004     pop    {r1}        ; (ldr r1, [sp], #4)
        8330:    e1a0200d     mov    r2, sp
        8334:    e52d2004     push    {r2}        ; (str r2, [sp, #-4]!)
        8338:    e52d0004     push    {r0}        ; (str r0, [sp, #-4]!)
        833c:    e59f0010     ldr    r0, [pc, #16]    ; 8354 <_start+0x30>
        8340:    e59f3010     ldr    r3, [pc, #16]    ; 8358 <_start+0x34>
        8344:    e52dc004     push    {ip}        ; (str ip, [sp, #-4]!)
        8348:    ebffffef     bl    830c <_init+0x2c>
        834c:    ebffffeb     bl    8300 <_init+0x20>
        8350:    0000843c     .word    0x0000843c
        8354:    000083cc     .word    0x000083cc
        8358:    00008440     .word    0x00008440

    0000835c <call_gmon_start>:
        835c:    e59f3014     ldr    r3, [pc, #20]    ; 8378 <call_gmon_start+0x1c>
        8360:    e59f2014     ldr    r2, [pc, #20]    ; 837c <call_gmon_start+0x20>
        8364:    e08f3003     add    r3, pc, r3
        8368:    e7931002     ldr    r1, [r3, r2]
        836c:    e3510000     cmp    r1, #0    ; 0x0
        8370:    012fff1e     bxeq    lr
        8374:    eaffffe7     b    8318 <_init+0x38>
        8378:    0000829c     .word    0x0000829c
        837c:    00000018     .word    0x00000018

    00008380 <__do_global_dtors_aux>:
        8380:    e59f2010     ldr    r2, [pc, #16]    ; 8398 <__do_global_dtors_aux+0x18>
        8384:    e5d23000     ldrb    r3, [r2]
        8388:    e3530000     cmp    r3, #0    ; 0x0
        838c:    03a03001     moveq    r3, #1    ; 0x1
        8390:    05c23000     strbeq    r3, [r2]
        8394:    e12fff1e     bx    lr
        8398:    0001062c     .word    0x0001062c

    0000839c <frame_dummy>:
        839c:    e59f0020     ldr    r0, [pc, #32]    ; 83c4 <frame_dummy+0x28>
        83a0:    e92d4010     push    {r4, lr}
        83a4:    e5903000     ldr    r3, [r0]
        83a8:    e3530000     cmp    r3, #0    ; 0x0
        83ac:    08bd8010     popeq    {r4, pc}
        83b0:    e59f3010     ldr    r3, [pc, #16]    ; 83c8 <frame_dummy+0x2c>
        83b4:    e3530000     cmp    r3, #0    ; 0x0
        83b8:    08bd8010     popeq    {r4, pc}
        83bc:    e12fff33     blx    r3
        83c0:    e8bd8010     pop    {r4, pc}
        83c4:    00010514     .word    0x00010514
        83c8:    00000000     .word    0x00000000

    000083cc <main>:
        83cc:    e52db004     push    {fp}        ; (str fp, [sp, #-4]!)
        83d0:    e28db000     add    fp, sp, #0    ; 0x0
        83d4:    e24ddefb     sub    sp, sp, #4016    ; 0xfb0
        83d8:    e24dd004     sub    sp, sp, #4    ; 0x4
        83dc:    e3a03000     mov    r3, #0    ; 0x0
        83e0:    e50b3008     str    r3, [fp, #-8]
        83e4:    ea00000d     b    8420 <main+0x54>
        83e8:    e51b1008     ldr    r1, [fp, #-8]
        83ec:    e51b2008     ldr    r2, [fp, #-8]
        83f0:    e51b3008     ldr    r3, [fp, #-8]
        83f4:    e0000293     mul    r0, r3, r2
        83f8:    e30f305c     movw    r3, #61532    ; 0xf05c
        83fc:    e34f3fff     movt    r3, #65535    ; 0xffff
        8400:    e1a02101     lsl    r2, r1, #2
        8404:    e24b1004     sub    r1, fp, #4    ; 0x4
        8408:    e0812002     add    r2, r1, r2
        840c:    e0823003     add    r3, r2, r3
        8410:    e5830000     str    r0, [r3]
        8414:    e51b3008     ldr    r3, [fp, #-8]
        8418:    e2833001     add    r3, r3, #1    ; 0x1
        841c:    e50b3008     str    r3, [fp, #-8]
        8420:    e51b2008     ldr    r2, [fp, #-8]
        8424:    e30033e6     movw    r3, #998    ; 0x3e6
        8428:    e1520003     cmp    r2, r3
        842c:    daffffed     ble    83e8 <main+0x1c>
        8430:    e28bd000     add    sp, fp, #0    ; 0x0
        8434:    e8bd0800     pop    {fp}
        8438:    e12fff1e     bx    lr

    0000843c <__libc_csu_fini>:
        843c:    e12fff1e     bx    lr

    00008440 <__libc_csu_init>:
        8440:    e92d47f0     push    {r4, r5, r6, r7, r8, r9, sl, lr}
        8444:    e1a08001     mov    r8, r1
        8448:    e1a07002     mov    r7, r2
        844c:    e1a0a000     mov    sl, r0
        8450:    ebffffa2     bl    82e0 <_init>
        8454:    e59f1044     ldr    r1, [pc, #68]    ; 84a0 <__libc_csu_init+0x60>
        8458:    e59f3044     ldr    r3, [pc, #68]    ; 84a4 <__libc_csu_init+0x64>
        845c:    e59f2044     ldr    r2, [pc, #68]    ; 84a8 <__libc_csu_init+0x68>
        8460:    e0613003     rsb    r3, r1, r3
        8464:    e08f2002     add    r2, pc, r2
        8468:    e1b05143     asrs    r5, r3, #2
        846c:    e0822001     add    r2, r2, r1
        8470:    08bd87f0     popeq    {r4, r5, r6, r7, r8, r9, sl, pc}
        8474:    e1a06002     mov    r6, r2
        8478:    e3a04000     mov    r4, #0    ; 0x0
        847c:    e1a0000a     mov    r0, sl
        8480:    e1a01008     mov    r1, r8
        8484:    e1a02007     mov    r2, r7
        8488:    e1a0e00f     mov    lr, pc
        848c:    e796f104     ldr    pc, [r6, r4, lsl #2]
        8490:    e2844001     add    r4, r4, #1    ; 0x1
        8494:    e1540005     cmp    r4, r5
        8498:    3afffff7     bcc    847c <__libc_csu_init+0x3c>
        849c:    e8bd87f0     pop    {r4, r5, r6, r7, r8, r9, sl, pc}
        84a0:    ffffff04     .word    0xffffff04
        84a4:    ffffff08     .word    0xffffff08
        84a8:    0000819c     .word    0x0000819c

    Disassembly of section .fini:

    000084ac <_fini>:
        84ac:    e92d4010     push    {r4, lr}
        84b0:    e8bd8010     pop    {r4, pc}

    I will continue to investigate the reason that we do not have NEON instructions generated.

    Regards,
    Pavel


  • I have also found these two links, please have a look. might be in help:

    http://e2e.ti.com/support/development_tools/compiler/f/343/t/271747.aspx

    http://infocenter.arm.com/help/topic/com.arm.doc.dht0002a/DHT0002A_introducing_neon.pdf

    Regards,
    Pavel

  • Vivek,

    I also tried with pure CCStudio project, no EZSDK, no SysBIOS.

    I am using CCS5.4.0 running on Linux Ubuntu PC. The target configuration is for DM8148 EVM, using the default DM8148 GEL file. The C file (main.c) have the below source code:

    int main(void) {

        int a[200],b[200],c[200];
        int i;

        for (i = 0; i < 200; i++)
         {
            a[i]= b[i]=i+1;
         }

        for (i = 0; i < 200; i++)
         {
             c[i]= a[i] * b[i];
         }
        
        return 0;
    }

    I compile with ARM compiler (not GCC), version 5.0.3.

    First I compile with the default ARM Compiler options, which are:

    -mv7A8 --code_state=32 --abi=eabi -me -O2 -g --include_path="/home/users/pbotev/ti/ccsv5/tools/compiler/arm_5.0.3/include" --define=dm8146 --define=dm8148 --diag_warning=225 --display_error_number --diag_wrap=off

    As result I have no NEON assembly instructions, which is OK, as I do not use the --neon option. See the full log of compile and disassembly messages output: http://e2e.ti.com/cfs-file.ashx/__key/communityserver-discussions-components-files/716/5758.No_5F00_NEON

    Second, I compile with --opt_level=2, --opt_for_speed=5 and --neon options:

    -mv7A8 --code_state=32 --abi=eabi -me -O2 --opt_for_speed=5 -g --include_path="/home/users/pbotev/ti/ccsv5/tools/compiler/arm_5.0.3/include" --define=dm8146 --define=dm8148 --diag_warning=225 --display_error_number --diag_wrap=off --neon

    As result, I have the NEON assembly instructions generated:

    16                 c[i]= a[i] * b[i];
              $C$DW$L$main$4$B, $C$L2:
    40300bc4:   F4200A8D VLD1.32         {D0, D1}, [R0]!
    40300bc8:   F4222A8D VLD1.32         {D2, D3}, [R2]!
              $C$DW$L$main$4$E, $C$DW$L$main$5$B:
    40300bcc:   F2220950 VMUL.I32        Q0, Q1, Q0
    14            for (i = 0; i < 200; i++)
    40300bd0:   E25CC001 SUBS            R12, R12, #1
    16                 c[i]= a[i] * b[i];
    40300bd4:   F4010A8D VST1.32         {D0, D1}, [R1]!

    See the full log of compile and disassembly messages output:http://e2e.ti.com/cfs-file.ashx/__key/communityserver-discussions-components-files/716/3731.With_5F00_NEON

    Best regards,
    Pavel


  • Hi Pavel,

    Thanks for your extended support, I think following options marked in yellow will be important, 

    mv7A8 --code_state=32 --abi=eabi -me -O2 -g ---opt_for_speed=5 include_path="/home/users/pbotev/ti/ccsv5/tools/compiler/arm_5.0.3/include" --define=dm8146 --define=dm8148 --diag_warning=225 --display_error_number --diag_wrap=off --neon

    I will also check the links for cache and neon provided by you in earlier posts, I or someone from my team, will be able to get back with the findings by thursday or Friday 8th Nov 2013.

    BTW, were you able to see the improvement in performance with and without neon instructions ?

    Regards,

    Vivek

  • Vivek,

    Vivek Malhotra said:
    BTW, were you able to see the improvement in performance with and without neon instructions ?

    In both cases (with Neon, and without Neon), the main.c code runs very fast (for less than a second). But this is with pure CCS project, where I have only simple main.c file running directly on Cortex-A8 (the DSP is powered OFF amd not used).

    While you are on SysBIOS SDK running on DSP. I assume the delay is caused by the fact that DSP and Cortex-A8 should communicate with each other. But I am not familiar with SysBIOS running on DSP.

    What I can recommend you is to ask in our BIOS forum, how to improve the Cortex-A8 performance, when running SysBIOS on DSP.

    http://e2e.ti.com/support/embedded/bios/f/355.aspx

    Regards,
    Pavel

  • Vivek,

    You can also refer to the below two links, might be in help:

    http://e2e.ti.com/support/embedded/bios/f/355/p/293453/1023926.aspx

    http://processors.wiki.ti.com/index.php/BIOS_6_Real-Time_Analysis_%28RTA%29_in_CCSv4

    Regards,
    Pavel

  • Dear Pavel,

    sorry for the late reply, we were able to generate the neon instructions as per your suggestions and the performance has improved more than 70% !

    Please note, when using RTSC platform the target has to be fnV for generating these instructions

    Regards,

    Vivek