Looking for precisions about AMBA AXI and SCR ?

Christophe Beausoleil

Hi,

I am requesting some help regarding AXI and SCR components.

TMS570 documentation refers to generic AMBA AXI documentation, but implementation specificities are not detailed, neither the way it is used by Cortex-R4F :

What is AXI revision (AXI3 or AXI4) ?
What is the data bus width (64-bit as far as I understood) ?
Are address bus and data bus shared or not ? Are address bus shared with several data buses ? Is multilayer used ?
Which is (are) the burst type(s) used (normal/wrapping/streaming) ? For how many words ? In which case ?
What is transfer latency for various access ? Especially for DMA transfer and EMIF access ?

Regarding SCR, how arbitration priority management impacts delayed request ?

These last points are really crucial for me, because we must produce avionics hard real-time software, and for certification constraints, this is only possible if processor behavior is deterministic. So we must model the processor behavior and we must prove that every task has a maximum execution time (WCET).

Where can I find such information ?

Thanks in advance for any help

Best regards

Christophe

[edit] : one more question : how does arbitration work between DMA and CPU when both of them need to access TCM ? (difference between ATCM and BTCM access, fixed priority, round robin,... ?)

over 14 years ago

0 Brian Fortman over 14 years ago

TI__Expert 7650 points

Hi Christophe,

The Hercules MCUs are built for real-time, deterministic behavior. We go the extra mile to ensure cycle accuracy among family members as well. Thanks for explaining your rationale for the need for this information. I'm going to ask one of our experts to respond to you. Could you just please confirm that you are using a TMS570LS20x or 'LS10x MCU?

0 Christophe Beausoleil over 14 years ago in reply to Brian Fortman

Prodigy 245 points

Hi Brian,

Thanks for your help.

I am using a Keil evaluation board MCBTMS570 (CPU+IO boards) with a TMS570LS20216ZWT.

I know that this processor family is designed to be highly deterministic, but avionics DO-178 standard implies high level knowledge about processor's internal functioning for certification process needs.

E.g., if we want to use BIST facilities, we must demonstrate that associated embedded TI's firmware complies with DO-178 rules. If we do not want to use them, we must demonstrate that BIST controller will never interfere with our software... (maybe it will be another forum topic, coming soon ;-) )

Best regards

Christophe

0 Alxa over 14 years ago in reply to Christophe Beausoleil

TI__Prodigy 15 points

Bonjour Christophe,

You should have a look at our safety manual, some of the question related to certification are eluded in it.

Concerning your questions:

Cortex-R4F and Cortex-R5F are implementing the AXI3, for more info on what is the detail of the implementation the CR4F and CR5F TRM Section Level 2 Interface details the subset of the bus.

The bus is 64-bit on AXI master and Slave as well as the TCMs.

The Main SCR is multi layer obviously and the burst type are the one specified in the CR4F TRM.

The CPU and DMAs arbitration at the TCM boundary (each TCM have their own arbitration) are as specfified in the TRM, which gives higher priority to the CPU access and if the CPU is occupiying the full bandwidth the CPU will give 1 transaction to the DMA every 15 transactions.

On the R4 BTCM, we interleave the RAM address between the 2 banks we are sure that the DMA have at max 1 cycles latency going ti the RAM.

Let me know, if you need more information.

Amities,

Alex.

0 Christophe Beausoleil over 14 years ago in reply to Alxa

Prodigy 245 points

Hi Alex,

Thanks a lot for your answer.

I still have some questionings regarding AXI/SCR nevertheless...

I have just read some of readings you recommended (except safety manual yet), and particularly, I still can not figure out how EMIF is serviced. I made simple tests for reading or writing data from/to external RAM through EMIF :

dataAddr
    .word 0x60000000

testFunc
    ldr    r7, dataAddr

    ldrh   r0, [r7,#0]
    ldrh   r1, [r7,#4]
    ldrh   r2, [r7,#8]
    ldrh   r3, [r7,#12]
    ldrh   r0, [r7,#16]
    ldrh   r1, [r7,#20]
    ldrh   r2, [r7,#24]
    ldrh   r3, [r7,#28]
    ...x2000

    strh   r0, [r7,#0]
    strh   r1, [r7,#4]
    strh   r2, [r7,#8]
    strh   r3, [r7,#12]
    strh   r0, [r7,#16]
    strh   r1, [r7,#20]
    strh   r2, [r7,#24]
    strh   r3, [r7,#28]
    ...x2000

Note : I read/write 16 bits each 32 bits to avoid a potential linefill effect.

When I count cycles (with PMU cycle counter), I can't get less than :
- 28 cycles for one ldrh instruction (I get an average value of 27.8 cycles/instruction for 2000 ldrh instructions)
- 18 cycles for one strh instruction (I get an average value of 18.3 cycles/instruction for 2000 strh instructions)

EMIF is configured with :
SETUP TIME     0 => 1 cycle
STROBE TIME    1 => 2 cycles
HOLD TIME      0 => 1 cycle
TURN ARND CYC 0 => 1 cycle
which should make 5 EMIF cycles for individual reading/writing, and 4 EMIF cycles for consecutive readings/writings according to TMS570 TRM.

Because EMIF clock frequency is half of core frequency (160 MHz), I thought that it should take 10 core cycles for individual reading/writing, and 8 core cycles for consecutive readings/writings.

I also tried to execute some instructions located in this external RAM (copied from Flash to RAM, then branched directly or after POM redirection), then I measured 10 core cycles per 16-bit Thumb2 instruction (and 20 core cycles per 32-bit instruction), which is a bit closer from theory (even though I expected to come close to 8 cycles per 16-bit instruction).

How can we explain such a gap (28<=>8) ?
Is it a delay due to AXI ?
To SCR ?
To EMIF ? (I do not think so, because of "correct" values while executing instructions)
Is it a matter of priority (although I can not imagine any conflict) ?
Of AXI clock frequency (I have supposed that it is equal to core clock frequency, but I can not find this information until now... Could you confirm ?) ?
Other things I did not imagine ?

Sorry for this long explanation, I tried to reduce further questioning from your side about this experiment.

Thanks again for supporting !

Best regards

Christophe

0 Christophe Beausoleil over 13 years ago in reply to Christophe Beausoleil

Prodigy 245 points

Hi,

No answer ? No clue at all ? Anyone ?

I made more tests with EMIF, similar to the tests above, with different data size :

- LDRB, LDRH ==> 27.8 cycles/instruction.

- LDR ==> 35.8 CPI

- LDRD ==> 51.7 CPI

- STRB at contiguous addresses ==> from 1 to 6 CPI, depending of number of different addresses

- STRH at contiguous addresses ==> 14.8 CPI

- STR at contiguous addresses ==> 17.3 CPI

- STRB, STRH and STR at non contiguous addresses ==> 21.6 CPI

- STRD at contiguous addresses ==> 33.3 CPI

- STRD at non contiguous addresses ==> 27.7 CPI

I also made same tests with internal RAM. For every read instructions (LDRx), I get an average speed of 1 CPI, which conforms to specification.

But for write operations, I get :

- STRB, STRH and STR ==> 2.5 CPI

- STRD ==> 1.68 CPI

I am really lost with all these results :

- why is it so long to read data from external RAM through EMIF ?

- how does contiguity affect store operations ?

- how can we explain that STRB through EMIF writes data down to 1 CPI (below 8 CPI !) ?

- how can we explain such STRx results with internal RAM ?

I am interested in understanding the way it works and obtaining best performances, but, above all, I really need to know what are the worst cases for all those accesses.

Thanks for any help.

Best regards

Christophe

0 KGreb over 13 years ago in reply to Christophe Beausoleil

TI__Mastermind 23000 points

Hello Christophe,

Sorry for the delayed response. The EMIF used on the TMS570LS20x/10x series of products was originally intended to be used for flash overlay memory for firmware calibration. It is optimized to provide a wide range of connectivity options at low silicon cost, rather than being optimized as a primary application memory. In addition, the level 2 AXI interface is designed by ARM to have high sustained throughput; but it is not optimized for latency. The AXI is deterministic, but it is quite complex compared to many interconnect systems and it can be difficult to calculate exact cycles for a given transaction.

There are fixed cycles which are consumed in the datapath as such for a full transaction

Delay from LSU/PFU to initiation of transaction by L2 AXI master (internal to CPU)
Pipeline buffers on the SCR
Clock domain crossing and data width reduction between SCR and EMIF
Delays internal to the EMIF controller and external memory
Clock domain crossing and data width change between EMIF and SCR
Pipeline buffers on SCR
Delay from L2 AXI master to LSU/PFU (internal to CPU)

If you do a single transaction, you will see the full latency, which can be 20+ cycles as you are seeing. For bursts/pipelined transactions, this will typically reduce to less than half of the single transaction (on average) when the software is optimized for the system.

To get best performance out of the EMIF, you should consider:

Check the memory protection unit and review the memory attributes set for the region. In particular, if write buffering is not set then the L2 AXI interface will wait for completion of one write before issuing a second write. For EMIF you generally want Normal memory type, shared and buffered
Confirm the clock configuration of the EMIF as opposed to the CPU and interconnect. For lowest latency you want the divider between the clock domains to be as small as possible that can be supported via the datasheet.
Take advantage of the 64b interface - use word or double word accesses rather than byte or half-word transactions as you are doing in the example code. A 16b transaction takes the same amount of time as a 64b transaction from the CPU and SCRs perspective (See ARM R4 TRM r1p3 section 9.3.5, non-cacheable reads and 9.3.6, non-cacheable writes).
Consider the impact of write merging - if using multiple byte or half-word writes, the CPU may combine these to a single 64b bus transaction (ARM TRM r1p3 section 9.3.8). This saves power, reduces memory transactions, but can impact latency and result in non-dependent writes occuring out of order.
Use bursts when possible rather than singles (i.e. LDM, not LDR) - the interface is optimized for burst transactions

I would also recommend that you use the PMU for your cycle measurements rather than the RTI. The latency to PMU access is lower and you have the ability to monitor many events generated by the L2 AXI controller to better understand your system.

Best Regards,

Karl

0 Christophe Beausoleil over 13 years ago in reply to KGreb

Prodigy 245 points

Hello Karl,

Thanks a lot for all those explanations, I begin to better understand a lot of stuff (I hope so at least...).

However, could you tell me a bit more about those various delays and/or mechanisms involved by CPU/EMIF transactions :

- Regarding L2 AXI, what is the write merging capacity (when enabled) ? As far as I understand, write transaction only occurs when a write request is not in the same area of 32 aligned bytes (256 bits) than a previous write request. Could you confirm ?

- Regarding SCRs (primary SRC and EMIF dedicated SCR), what are the transfer durations (if we supposed there is no conflict with another master) ? Are they constant ? For both directions ? Does it depend on data width ? Is there also write merging capacity ? Anything else...?

- Regarding clock domain crossing between SCR and EMIF, could we model its behavior ? How does it deal with clock (mis-)alignment ? Is there any chronogram available ?

Another topic I asked above is about internal RAM access : for write operations, I get :

- STRB, STRH and STR ==> 2.5 CPI (average speed)

- STRD ==> 1.68 CPI (average speed)

I guess there are read/modify/write operations for STRB, STRH and STR, but how can we explain such results (1.68 and 2.5 CPI) ?

Finally, one more questioning is about flash access : @160MHz, when I read constant data located in flash with my instruction code located in internal RAM (this way, I know I can execute one instruction per cycle, and LSU is the only element which accesses the flash), I can execute LDRx instruction at an average speed of 6 CPI (reading non contiguous data). I configured the flash with pipeline mode, 1 address wait state and 3 data wait state, so I wonder why I do not reach 5 CPI. Could you explain that ? Is it due to LSU access rather than PFU access ? What is the initial pipeline filling penalty when executing code in this configuration ? Especially, how many clock cycles does a branch take ? (e.g. 9 instruction cycles + 6 cycles for filling flash pipeline ?...)

With a "usual" SW (with both code and const located in flash) I also wonder how accessing a single constant data may impact the flash pipeline continuity ? Is continuity broken ? In other words, is the flash pipeline reconfigured when read request comes from LSU ? Does a flash pipeline chronogram exist ?

I must confess that I tried a lot of various combinations of code accessing const data (both located in flash), and I get results from 1.33 CPI to 7 CPI that I can hardly explain...

Thanks again for your help

Best regards

Christophe

0 KGreb over 13 years ago in reply to Christophe Beausoleil

TI__Mastermind 23000 points

Hi Christophe,

I will try to answer your questions one at a time :)

The write merging capability of the Cortex R4F on L2 AXI transactions is documented by ARM in the Cortex R4F r1p3 TRM, section 9.3.8 - "Normal Write Merging". Basically what happens is that if the CPU detects that there are multiple transactions requested for the same 64b aligned word, it will merge these into a single transaction. This is done for two reasons - to reduce the number of transactions initiated on the external interconnect and to improve power efficiency.

Regarding the SCRs, there are fixed pipeline stage delays on the main SCR. I do not have the details of the cycle delays, but they are constant. All transactions are 64b - only the valid byte strobes change with smaller transactions. If a transaction is not aligned to a 64b boundary, it will be necessary to break this into multiple 64b transactions. There is no write merging done by the interconnect system - only by bus masters.

Regarding the clock domain crossing from SCR to EMIF, this is a synchronous crossing. The minimum time through the clock bridge should be 1 master clock plus one slave clock. The EMIF itself manages the synchronization to asynch memories; there is no asynch clock bridge on this path for the interconnect.

Regarding the timing that you see to the TCM memories, you are correct that if you write in a size smaller than 64b with ECC enabled it will be necessary to perform a read/modify/write operation (done automatically) in order to update the 64b word and keep proper ECC. Please also take into account in your experiments the effects of congestion with the registers - in some cases a delay could be due a pipeline bubble inserted to avoid a register use conflict.

I will continue shortly on the example given.

Regards,

Karl

0 KGreb over 13 years ago in reply to Christophe Beausoleil

TI__Mastermind 23000 points

Hi Christophe,

Regarding the example given, you need to also consider a few points which can influence the CPI:

Branch prediction - the CPU will predictively prefetch branch targets based on the results of the last 256 branches encountered
Prefetch - the CPU will prefetch as necessary to fill a prefetch buffer. Up to 4 instructions per clock can be fetched in a single clock cycle
Limited dual execution - dependent on the instruction sequence, it is possible to execute two instructions in a single clock cycle
Flash wrapper word size - every flash wrapper clock, a 128b word will be fetched and stored in a local buffer. This can be up to 8 instructions
Flash local buffers - separate buffers exist for both instruction and data fetched from flash
Instruction alignment - instructions should be aligned on word boundaries in order to ensure most efficient prefetching

The scheme is deterministic, but the large number of dependent variables make it quite difficult to explain. This is an area where we are trying to improve the detail in our future customer documentation, without overwhelming customers with detail.

Regards,

Karl

0 Christophe Beausoleil over 13 years ago in reply to KGreb

Prodigy 245 points

Hello Karl,

About write merging, you wrote :

if the CPU detects that there are multiple transactions requested for the same 64b aligned word, it will merge these into a single transaction.

Are you sure about 64b ? Cortex R4F r1p3 TRM, section 9.3.8 states : "The STB can detect when it contains more than one write request to the same cache line". That is why I had thought it concerns 256 bits (32 bytes) and so asked to confirm.

Regarding read/modify/write operations, I confess that I did not think about ECC configuration. I did not explicitly enable ECC but it seems to be active by default, contrary to what I thought. I am going to explore this way... Nevertheless, I am still a bit stuck with 1.68 CPI with STRD into internal RAM.

Regarding pipeline bubble, I do my best to avoid it : I carefully read Cortex R4 TRM "Cycle Timings and Interlock Behavior" chapter about this. For my measurements, I also take care to avoid branch, dual-issuing pairs of instructions, instructions are aligned on word boundaries, so I really try to provide one instruction per cycle to the execution unit.

I remind you that I roughly make 4000 STRx (or LDRx for reading) instructions, then I count cycles with PMU, then I divide the result by 4000 to get an average CPI. That limits impact due to :

PMU start/stop instructions (which is about 23 clock cycles)
initial registers initialization (which also leads to a few clock cycles)
initial/final pipeline filling due to branch made for calling the test function (for which clock timings are still "mysterious"...).

I inserted below a short view (reading data from flash) where you can see addresses/instructions.

Regarding Flash local buffers, you wrote : separate buffers exist for both instruction and data fetched from flash.

Is it documented (buffer type, size, delays...) ? I can not remember to read anything about that until now !!! but it may help to explain some results.

Regarding flash pipeline (initial filling delay, "burst" strategies, branch impact), don't you have more detailed answers ?

Thanks for helping

Best regards

Christophe

0 Christophe Beausoleil over 13 years ago in reply to Christophe Beausoleil

Prodigy 245 points

You were right about ECC : it is enabled by default. After disabling, I finally reached 1CPI for any write access into internal RAM !

Knowing this, I will try to understand my previous measurements (1.68 for STRD and 2.5CPI for others STRx). Soon...

Best regards

Christophe

0 dave johnson1 over 13 years ago in reply to Christophe Beausoleil

Prodigy 60 points

I am working on some motor control Project and wanted to Simulate it first ; before Prototyping . My motor network is not much bigger and what i want to sense is current Drawn and Torque Delivered through shaft for each motor . Can you suggest what micro controller would be best suiting to my needs. Is Hercules ARM would be perfect or i should find some low end controller ? Is Proteus Simulation Available for this micro controller ?

pcb components

Arm-based microcontrollers

Arm-based microcontrollers forum

Looking for precisions about AMBA AXI and SCR ?