This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

DM8148 peripheral access is slower than expected

Other Parts Discussed in Thread: SYSBIOS, SYSCONFIG

I have the a DM8148 dev kit project with SysBios and am noticing more cpu cycles spent than expected to access some of the peripherals.  For example, accessing a gpio set register:

*(volatile int *)(0x48032190)=0x00000008;

I'm finding that the fastest I can toggle this discrete is 5Mhz, or 200ns per register access.  My L3 is running at 200Mhz and my ARM core is running at 720Mhz.  I'm new to this processor, so, is there something that can be improved for this kind of access time for GPIO?  This seems excessively long to take 144 clock cycles to access an on-die discrete, 30-50 would be more reasonable to me.

Can anyone shed any light on this topic?

  • Justin,

    GPIO is connected to L4 interconnect. Assume you are running the SYSBIOS code from M3 which is running at a lower speed than 720MHz 200ns seems reasonable.

  • I don't understand why would I assume my code is running from M3?  My code is running from the A8, and then needs to go through the interconnect to get to the gpio.  I understand the gpio is on the L4 which is still running at 100Mhz.  How many clock cycles does it take to simply write a gpio value over the L4?  Like I said, right now, that would add up to about 15-20 clock cycles on the L4 just to write a gpio value.  This does not seem right, or at the very least seems like I should be able to configure the processor to improve this time.  Is there any configuration to the L3 or L4 arbitration scheme that might help my access times from the ARM core?

  • You can check the registers for configuring L3 and L4 in the section 1.12.2.5 and 1.12.3.5 of the document sprugz8b.

  • Hi Justin,

    With this piece of code *(volatile int *)(0x48032190)=0x00000008; you are writing in the GPIO_CLEARDATAOUT register (offset 0x190). How exactly do you measure the 200ns write access time? Is it with some debug tool? Do you have the same 200ns write access time with the GPIO_SETDATAOUT register (offset 0x194) ?

    Please note that the DSP has 128-bit read/write port, and Cortex-A8 ARM has 64-bit read/write port, so DSP read/write access should be faster. Here is one thread explaining the DSP interconnect bandwith:
    http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/717/t/177428.aspx

    If you increase the Cortex-A8 clock speed to 1GHz, do you have better (than 200ns) write time? This thread is explaining how to increase the Cortex-A8 clock speed:
    http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/716/t/204555.aspx

    I have check the L3 Interconnect and L4 Interconnect description and registers. The only section explaining the interconnect bandwidth is 1.12.2.3.3 Bandwidth Regulators.
    Bandwidth regulators are mainly used to give priority to the following masters: HDVICP, TPTC_RD2/4, TPTC_WR2/4, MMU, ISS and SGX. So we need to check if these registers are programmed (L3_BW_R_BANDWIDTH and L3_BW_R_WATERMARK) to give priority of some of these masters over the Cortex-A8.  Can you provide me the registers values?

    Other thing that we can try is to increase the L3/L4 slow interconnect speed and the GPIO peripheral speed.

    Best Regards,

    Pavel

  • Pavel,

    Yes, I have a string of code that writes to the clear and set registers alternating, and am looking on an oscilloscope how fast the gpio is toggleing.  If I do the same thing with GPIO_0_SET, GPIO_1_SET,GPIO_0_CLR,GPIO_1_CLR...etc, a given discrete then only toggles every 400ns.

    Having a wider bus on the DSP vs the ARM shouldn't impact access time to a 32bit resource.  I agree that the bandwidth could be different, but single word access time shouldn't be impacted by the width of the bus.

    Yes, if I increase the A8 core clock it gets marginally better.  If I increase the L3 clock speed it gets better as well, but I'm hoping there is a better solution than using a sledge hammer to solve performance issue.  There are obviously other issues gains by increasing clock speeds beyond the specified frequency.

    We are not setting any bandwidth registers, so they are reset/default.  I was having a hard time understanding what the default state/behavior would be and if it would even affect single cycle access latencies, or just high bandwidth data movement.  Is there arbitration involved here that is configured by these bandwidth registers.  I should also mention, at this time, I don't have anything else active on any other core.

    Is there a way to configure interconnect speeds beyond just the L3 PLL?

  • Justin,

    Given the details in the thread, I would like you to configure the “pressure input to interconnect” as follows:

    • Set bits 1:0 of INIT_PRIORITY_0 register in the control module to “11”

     This should help prioritize ARM traffic within the interconnect. Please let me know if this helps in reducing the response latency for your test case.

    Thanks and Regards,

    Rahul

  • Justin,

    I had missed reading in your response earlier that there is "no other" concurrent traffic in your system from any other core. The above configuration parameter would help only if there is competing traffic within the interconnect. 

    Thanks and Regards,

    Rahul

     

  • Rahul,

    Just to confirm, changing the priority did not change the time required to access a peripheral.

    This wouldn't be a huge concern if this was capable of out of order execution, but my understanding of the A8 is that when a strongly ordered access type like this, its going to cause the A8 pipeline to stall out, sacrificing about 150 clock ticks to set peripheral registers. 

    Also, gpio is my simple scapegoat of an example.  My real problem is Ethernet register access times.  

  • Here is some timing data I have to a few different peripherals with non-cache, non-buffered, not-shared properties in the mmu.

    [CortexA8] EMAC CPPI Write ( 1 0x4a102000): 174ticks 241ns Adj:230ns
    [CortexA8] EMAC CPPI Read ( 2 0x4a102000): 180ticks 251ns Adj:239ns
    [CortexA8] DDR3 uncached Write ( 3 0x82000000): 100ticks 140ns Adj:128ns
    [CortexA8] DDR3 uncached Read ( 4 0x82000000): 116ticks 161ns Adj:150ns
    [CortexA8] OCM RAM write ( 5 0x40300000): 119ticks 165ns Adj:154ns
    [CortexA8] OCM RAM read ( 6 0x40300000): 108ticks 150ns Adj:139ns
    [CortexA8] A8 SRAM write ( 7 0x402f1000): 44ticks 62ns Adj:50ns
    [CortexA8] A8 SRAM read ( 8 0x402f1000): 42ticks 59ns Adj:47ns
    [CortexA8] GPIO write ( 9 0x481ae13c): 159ticks 221ns Adj:209ns
    [CortexA8] GPIO read (10 0x481ae138): 152ticks 211ns Adj:199ns

  • The simple question is how many A8 clock cycles does it take to get an access to the L3 interconnect, then how many L3 clock cycles does it take to issue a write request to the L4, and finally how many L4 clock cycles does it take to complete the write to a register resource.

    A8 = 720

    L3 = 200

    L4 = 100 (implicitly L3/2 I think ?)

    I might guess 4 L4 cycles, maybe 5 L3, and maybe 10 on the A8 adding up to ~(30+16+10) would be 56 A8 cycles.  That would be a number I could easily accept, but 150+ cycles seems like I might have something configured less than optimal.

    I have all caches enabled and have cache,buffer, and share properies on the previous entries disabled.  My understanding makes these strongly ordered accesses.  And since this is an A8 ARM, all accesses to strongly ordered resources will stall out the pipeline for this 150+ clock cycles.

  • Justin,

    I'm not 100% sure, but is it because of idle mode? I'm not an expert in micro-architecture, but can there be delay in the transition because of idle mode support? Can you set the GPIO_SYSCONFIG to "No-idle" and try?

  • Renjith,

    Thanks for the reply, but no, the interface is not in idle mode.  If I let it go to idle mode the access time is closer to 500ns.

  • Justin,

    This is interesting info. But if you see the table that you've posted for different peripherals, GPIO is not that bad when compared to un-cached DDR access :)

    Can you try another experiment. Can you run A8 in bypass clock(~20MHz) and see whether its still close to 200ns or it goes up marginally. If it doesn't go up much, then the real issue has to be within the GPIO controller itself and if it goes up, then we've to suspect the L3/L4.

  • I'll give it a try, but I've tried changing the frequency of the A8 with little change.  I've changed the frequency of the L3 with significant change.  Unfortunately, the datasheet says the L3 is limited to 200Mhz.  And like I said, GPIO was just my L3/L4 scapegoat to figure out if I had the system configured optimally for accessing L3 and L4 resources.  They all seem slower than I expected, and I want a document or confirmation that its the best it can be before just accepting the performance as is.  The only exception is looking at the architecture, it doesn't look like the DDR3 goes through the L3.  The L3 is causing me performance problems with my Ethernet drivers accessing registers and cppi ram.

  • Ok, I placed it into bypass, and this is what I see for resource access times.  I'm guessing this means that this is likely the shear number of ticks to get through the A8 subsystem and MMU for the most part.  Is 50-60 ticks expected?  Of my ~150 ticks to access EMAC or GPIO, likely 50 of them are due to the ARM core, 100 ticks due to the L3 (or 28 ticks in that domain).  Still seems higher than I would expect for both.  These are the types of numbers I thought TI would have published, but I can't find them anywhere.

    [CortexA8] Frequency 20000000Hz

    [CortexA8] EMAC CPPI Write ( 1 0x4a102000): 59ticks 2958ns Adj:2555ns
    [CortexA8] EMAC CPPI Read ( 2 0x4a102000): 53ticks 2674ns Adj:2271ns
    [CortexA8] DDR3 uncached Write ( 3 0x82000000): 54ticks 2740ns Adj:2336ns
    [CortexA8] DDR3 uncached Read ( 4 0x82000000): 55ticks 2798ns Adj:2395ns
    [CortexA8] OCM RAM write ( 5 0x40300000): 52ticks 2638ns Adj:2235ns
    [CortexA8] OCM RAM read ( 6 0x40300000): 48ticks 2438ns Adj:2035ns
    [CortexA8] A8 SRAM write ( 7 0x402f1000): 44ticks 2248ns Adj:1845ns
    [CortexA8] A8 SRAM read ( 8 0x402f1000): 50ticks 2500ns Adj:2097ns
    [CortexA8] GPIO write ( 9 0x481ae13c): 64ticks 3248ns Adj:2845ns
    [CortexA8] GPIO read (10 0x481ae138): 52ticks 2636ns Adj:2233ns

  • Justin,

    There are couple of things here. Since your code is "*(volatile int *)(0x48032190)=0x00000008;" using volatile, there will be atleast minimum 3 or more assembly instructions generated for this statement. The instructions really depend on the access times. The GPIO toggle happens after the Load and mov instructions. Can you profile the exact time taken from the store to the GPIO toggle to occur.

    a. LDR (load the address from instruction memory)

    b. MOV (move the value 0x8 to register)

    c. STR ( store the value 8 to the address)

    Also, you can check the following.

    1. Instruction cache is enabled (assuming to be enabled)

    2. Data cache is enabled (assuming to be enabled)

    3. The optimization levels of the compiler

  • I'm not sure how you would profile the time from the start of STR to the toggle actually occurring.  I'm working on learning to use Trace to capture the time to execute the STR instruction.

    Instruction cache is enabled.

    Data cache is enabled.

    Optimization is -03

  • Since STR is when the actual write starts and rest all other instructions are run as a prelude to STR instruction. We need not take all the other instructions into account as it might be different with various factors such as optimization levels, usage of volatile type qualifier, etc. I feel that profiling the STR instruction only can give the exact latency.

  • Ah, you just want the number of clock cycles for the STR instruction.  It sounded like you wanted it correlated to the state change on the pin.

    But unfortunately, today is the first time using trace, and I'm not seeing data that makes sense to me.  I have a block of code that my timestamp says takes 1700+ clock cycles, and trace cycle count column says ~300.  So, the raw trace number is 19 cycles for a the STR gpio access, but I don't trust the definition of a cycle is the same at my A8 core frequency.

  • One more experiment. In case of auto-idle 300ns gets added to the STR instruction only. That time if you see the cycle count it will be almost clear how 1 cycle translates to.

  • I looked into it, it looks like the extra time was more due to instruction caching.  I had the gel file forcefully waking up the gpio clock.

    WR_MEM_32(CM_ALWON_GPIO_0_CLKCTRL, 2)

  • Rahul and Pavel,

    I'm still waiting for some TI support here on this topic.  What is the expected clock cycles to get from the A8 to the L3, and how many L3 clock cycles can I expect for EMAC registers and CPPI RAM?

  • If you re-run the same instruction again, how much time is it taking as it is assumed to be cached?

  • Here is my simple test code to test EMAC CPPI RAM.  The GPIO access time was just the ah-ha moment of why our Ethernet was not performing as we might expect.  The first call to profile_write_time returns 295 ticks (assumed uncached instructions), 160 to 168 ticks for subsequent accesses.

    unsigned int time32(void){
    asm(" mrc p15, #0, r0, c9, c13, #0");
    }

    int profile_write_time(register int *address){
    register int time1;
    register int time2;
    register int time3;
    volatile int retVal;
    time1 = time32();
    time2 = time32();
    *address = (int)address;
    time3 = time32();
    retVal = (time3 - time2) - (time2 - time1);
    return retVal;
    }

    printf("%d CPU ticks\n",profile_write_time((int *)0x4A102000u));

    And yet, this is the trace that I am seeing for a ~160 tick iteration.  22 cycles is clearly not 160.  Its like the cycle index is based on a 100Mhz clock, but I don't know where that would be coming from.

    Instruction Instr Addr Read Addr Write Addr Cycle Index Cycle delta
    MOV             R12, R0 0x80106B54 697 0
    BL              0x80106B48 0x80106B58 697 3
    MRC             P15, #0, R0, C9, C13, #0 0x80106B48 700 0
    BX              R14 0x80106B4C 700 5
    MOV             R2, R0 0x80106B5C 705 0
    BL              0x80106B48 0x80106B60 705 2
    MRC             P15, #0, R0, C9, C13, #0 0x80106B48 707 1
    BX              R14 0x80106B4C 708 4
    MOV             R1, R0 0x80106B64 712 22
    STR             R12, [R12] 0x80106B68 734 1
    0x4A102000 735
    735 8
    BL              0x80106B48 0x80106B6C 743 1
    MRC             P15, #0, R0, C9, C13, #0 0x80106B48 744 3
    SUB             R12, R0, R1, LSL #1 0x80106B70 747 1
    ADD             R12, R2, R12 0x80106B74 748 0
  • Justin,

    What is the real problem? Are you facing a throughput issue in Ethernet driver? If so, what is the current throughput that you are getting and what is the expected throughput?

  • I'm trying to optimize our Ethernet stack to the best it can be, so yes, in a way its a throughput issue because the driver consumes more CPU cycles than I would have expected.  I don't want to talk about things at a high level because that just muddies the waters for my use case, I want to know if accessing a specified resource is expected and normal to take 200 clock cycles, be it a gpio regiser, CPPI RAM, EMAC registers, OCM RAM...etc.  That information has a huge bearing on how I write my application and drivers.  I'm concerned that I don't have something configured right or setup right, or just not optimized for the ARM core to access those resources, and I'm trying to figure out if the latency on access time is normal/expected/explainable.

    160 clock cycles to access an on-die RAM resource just doesn't seem right.  If that's the best I can do due to the architecture of the processor, I'd like to know that and that closes my issue.  If I can do better, I'd like to figure out what I have configured incorrectly so we don't abandon this processor for a potential configuration issue.

  • Justin,

    Can you just let me know the current throughput. I believe you might have explored the use of DMA etc. Also have you explored the use of burst transfer using ARM itself?

  • This is why I started the thread with gpio being the focus because I didn't want things to get muddy and sidetracked away from my core concern/question, so I'd rather not publish any high level throughput number.  Ethernet performance is not the direct issue in question because the same issue exists for all the L3 resources.  I don't want help solving a larger scale Ethernet stack performance question, I'll start a different thread if it comes to that.  For this thread, I want help understanding if the access time to L3 resources is what I should be seeing.

  • Justin,

    I'm sorry, I won't be able to help further. Somebody who is familiar with the micro architecture of 8148 might be able to help you. Hope some TI person responds.

  • In your experience, is the trace cycle count supposed to reflect the full frequency cpu cycle count?

    So ignore I said anything about Ethernet, lets go back to gpio access time.  You don't have any further ideas how I could be configuring it wrong?

  • Justin,

    I haven't tried measuring trace cycle count.

    Have you tried a simple loop where you keep toggling a GPIO line back to back and saw the wave form on an oscilloscope without idle mode? I guess that will clearly say what exactly is the delay.

  • That is the exact test I did before I posted anything to the forum.  The fastest I can toggle a gpio pin is 200ns.  And, with a 720Mhz CPU, and 200Mhz L3 interconnect, I expected much faster than that, maybe somewhere around 50ns would have been more reasonable and would not have grabbed my attention that I might have something configured wrong.  Not only that, but a co-worker tells me they can toggle the same gpio from the DSP running at 500Mhz at a 70ns rate.

  • Justin,

    Can you just write assembly code with multiple STR instructions to keep toggle the GPIO? The problem with C code is that you keep executing additional instructions also. 

  • Justin,

    Can you try the same experiment using M3 core and DSP core as well? This will give an idea whether the L3/L4 bus is causing the trouble or the ARM-L3 interface is the bottleneck here. Because in DM8148, using DMA people are doing high bandwidth video processing, if such an issue exists it should have been reflected there atleast.

  • The writes to GPIO from the DSP are much faster.  40ns per access (until the pipeline fills up)  It appears as though the DSP on this chip allows for a certain amount of out of order execution.  I had all operations on the same D1 instruction bank, and see the pin toggle at a 40 ns rate for a burst of 8 to 10 toggles.  That is a very different number than the ARM is giving at about 200ns per write, but I know the A8 doesn't allow out of order execution, and not only that, but this access would be categorized as strongly ordered and I don't know what that is going to do to the ARM pipeline.  I don't know if it flushes it, or just stalls it, either way, its slow.

    However, this does imply that the DSP is arbitrating for the L3 interconnect, and if it can keep ownership of the interconnect, it is about 6 or 7 L3 clock cycles per access.  The ARM on the other hand may not be able to issue requests fast enough that somehow its going idle or something in which case each request takes 40 clock cycles on the L3.  So it certainly seems that the L3 can perform better than what I'm seeing for single ARM accesses, but I can't figure out why or what to do about it.

    As for the M3 on chip, I haven't figured out if I can use it for general programming.  It doesn't show up on the jtag icepick.

  • Justin,

    That is a very good info!  If you select TI814x in the CCS target configuration, it will show all cores properly.

  • Right, and the only ones it shows are the A8 and the DSP, the other connection points are things like the STM and ETB, there is no connection point for the M3

  • Can you share a snap shot of your CCS configuration? I'm able to see all the cores here if I select "TI814x" instead of DM814x, or EVM814x etc.. 

  • That is not a valid configuration on my install.  Just for clairity, which configuration are you talking about?  

  • Justin,

    I'm not able to upload the snapshot here as my silverlight plugin is failing. If you can send me an email, I can mail it to you.

    I'm talking about, when you open the .ccxml file of your project, you'll be able to see three tabs "basic" "advanced" and "source". You'll be able to select a target in the "basic" window. There you'll see a list of target platform. If you type "TI814x" in the filter, you'll be able to see an list item. If you select and go to the "advanced" tab you'll be able to see all the cores. I'm able to see 3-ARM9s, 1-DSP, 1-Cortex-A8, 3-Cortex-M3s (total 8) cores in my list.

  • Thanks,

    No, they aren't an option, but the dm8148.xml had them there, just commented out, so now I'm seeing them.  I'm wondering which ones are available for custom programming...

  • Try M3-ISS or M3-Video. You might have to run a gel for connecting to it. 

  • The gel doesn't appear to be set up to take those cores out of reset, I don't know that its worth continuing to go down the M3 route at the moment.  

    Why would I be able to write at 40ns from the dsp and it takes 200ns to execute a similar instruction on the ARM.

    The reads on the two chips are very similar access times, both in the domain of 200ns.  Its almost like the ARM waits for some sort of ack while the dsp doesn't and just moves on to the next instruction once it has issued the interconnect request.

  • It makes sense to try on M3 because both, A8 and M3 and a 64-bit wide bus interface whereas DSP has a 128-bit wide interface.

  • Jansen,

    Is there any update about the issue?

  • No luck on actually getting any custom code to run on the M3, I've seen other posts on the forum from TI that say this should not be done anyway, it is intended to be a block box part of the system.

  • I really don't think you should consider M3 as a black box. If you have EZSDK or AVBIOS package there will be lot of example applications available. You can just compile one of them and try to execute the app. Then there you can try your code easily.

  • Justin,

    Did you ever solve this? I'm seeing +200ns for the DSP to read a single value from the CPPI.

    - Andrew

  • The simple answer is, thats just how long it takes....live with it. ;)

    I actually found for Ethernet CPPI it was faster to use OCM for the descriptor space.  But in general, 150-200ns is a reasonable number for a single read access across the L3 interconnect for this processor.  You can speed up the L3 up to 220Mhz, but beyond that, not much you can do for reads.  So, read from it ONLY when you HAVE to.  

  • Thanks for the response Justin. Like you perhaps, I'm doing custom stuff with Ethernet descriptors. I worked out my read was taking 156 cycles. Well, if my numbers are "expected" I guess I'll take another approach and re-write a bunch of code :-).