AM3359: ARM latencies to PRU

Jiri Biel

Part Number: AM3359
Other Parts Discussed in Thread: SYSBIOS

I want to ask for latencies on AM335X from ARM side to PRU Memories and HW blocks (in cycles).

Something similar what you provided for AM335X PRU latencies (best case is enough).

Are the latencies same for all AM335X types, when clock speed and other configurations are the same or there is any difference.

With best regards

Jiri Biel

over 7 years ago

0 Biser Gatchev-XID over 7 years ago

TI__Guru**** 393215 points

The factory team have been notified. They will respond here.

0 Jiri Biel over 7 years ago

Intellectual 865 points

Part Number: AM3359

Hello,

I want to ask what is the latency of PRU to ARM interrupt for following combinations:

1. Just expected minimum from HW point of view

2. On TI RTOS

3. On Linux RT

4. On Linux

With best regards

Jiri Biel

0 JJD over 7 years ago in reply to Jiri Biel

TI__Guru* 86820 points

Jiri, we don't have this specific data, but based on the architecture, access from the ARM into the PRUSS I believe should be on the order of around 40 cycles. The latency will be the same across all AM335x devices.
On the subject of interrupt latency, you will have to check the forums for each of those software offerings. I believe the latency numbers may be available in their release notes.

Thanks,
James

0 Jiri Biel over 7 years ago in reply to JJD

Intellectual 865 points

Hi James,
the question was interrupt latency from PRU to ARM (not ARM to PRU, as I know there is no real interrupt on side of PRU only PRU INTC bit can be set). From Linux point of view I found via following link:

processors.wiki.ti.com/index.php

that measured minimum was 10 us. This seems far away of 40 cycles.

Please can you provide for the RTOS? I was trying in release information, but I was not able to found.

With best regards
Jiri

0 JJD over 7 years ago in reply to Jiri Biel

TI__Guru* 86820 points

JIri, the 40cycle I stated was not referring to interrupt latency, it was referring to the ARM's latency accessing components on the PRU.

You can find benchmark data for RTOS in BIOS_INSTALL_DIR\packages\ti\sysbios\benchmarks directory

Regards,
James

0 Jiri Biel over 7 years ago in reply to JJD

Intellectual 865 points

Hi James,
now I understand what was happened. I was sending 2 separated questions:

One related to ARM->HW blocks latencies.
Second one related to interrupt latencies on various OS.

We can close the discussion about interrupt latencies on OS (here I have at least some picture).

Your answer for first question was 40 cycles from ARM to PRUSS. Here I need more precise answer. I predict that it's 40 cycles of ARM (can run on 800MHz or different) and not PRU cycles (200 MHz). What do you mean with all AM335x devices. Is it AM335x peripherals or all processors of AM335x platform have this latency?

On the other hand I was not asking only for PRUSS, but for memories (like DDR3, IRAM, OCMC, SharedRAM, ...) and all HW blocks (like PRU CTRL, PRU INTC, PRU UART...). At least from PRU side is visible that there is a difference and I can't to believe that from ARM side any block can be reached in 40 cycles.

Please can you ask your HW or RTOS designers to provide this information (I believe that this is a need for any SW design)?

With best regards
Jiri

0 Mukul Bhatnagar over 7 years ago in reply to Jiri Biel

TI__Guru* 83985 points

Hi Jiri
Garrett pointed this thread to me.
As James mentioned we have not archived stand alone access latencies from ARM or other initiators (apart from the PRU latency wiki data) to/from various end point peripherals or memory.
Usually there are several factors that come into play and the "standalone" data gives limited mileage, as with multiple initiators and different access size, cache/non cache, things will vary further.
In previous devices we collected data like following
processors.wiki.ti.com/.../AM1x
However I have not really seen much use of this.

If you are seeing some issues or read/write access latencies that are not making sense, please feel free to share those too.

I will follow up with James on the 40 cycles from ARM to PRU - whether it was quoted on CPU or PRU cycles etc. I believe it is in terms of CPU cycles - but if you are not seeing this - please clarify your test setup and what you are seeing in terms of the access latencies

Given we do not have any data that I can readily share with you, if you have some very specific use-case or area of interest , we can dig deeper , once I have a better understanding of your intended use-case or how you plan to leverage this data. However we will not be able support an exhaustive data set request on this.

Regards
Mukul

0 Jiri Biel over 7 years ago in reply to Mukul Bhatnagar

Intellectual 865 points

Hi Mukul,
I'm still not in implementation stage, I'm in design stage for communication stack on ARM which needs to be able answer in 1 us when PRU receives frame (via UART). When response time will be in mili-seconds I don't need to be interested so much about the latencies, but here it's little bit different and I need to see latencies from both sides (ARM and PRU too) to be fast enough. The frame size can vary from 1 byte to 255 bytes.

Let´s say I need to execute code from DDR3, means cache needs to be enabled for code and data too to be fast on ARM side.

Now I need to know which memory is the best to read (for frame data) on side of ARM (here seems OCMC should be the best) and which on side of PRU (there seems SharedRAM is the best). I tried to put data to DDR3 via EDMA, but it needs to call invalidate cache function and it slowing the response afterwards. And I need also to use the PRU->ARM interrupt which seems from timing point of view as a real bottelneck (at least from my tests what I saw).

Means the knowing the latencies is the best to create the proper design at least from my side.

In case that latencie data are not available from ARM side I need to proof the design concept by my own tests (really time consuming) or in other case to ask TI.

Means based on that what I was wrote:
1. Frame is received on PRU side (UART) and needs to be handled on ARM side in 1 us.
2. Frame size can vary from 1 byte to 255 bytes.
3. Data and instruction cache are enabled.

Questions:
1. Is it possible to answer in 1 us from ARM or not?
2. If yes what is the best way how to do that? (direct read of memories or EDMA setup to some memories?)

With best regards
Jiri

0 Garrett Ding over 7 years ago in reply to Jiri Biel

TI__Mastermind 43296 points

Jiri,

The timing benchmark file:///C:/ti/bios_6_52_00_12/packages/ti/sysbios/benchmarks/doc-files/TI_A8Fnv_ti_platforms_evmAM3359_time.html
may help you determine the latency from RTOS point of view, where it states the interrupt latency is 447 cycles and the Hardware Interrupt to Blocked Task is 941 cycles. You can find the detailed description of the timing benchmark measurement in the Appendix B, SPRUEX3T, bios_6_52_00_12\docs\Bios_User_Guide.pdf. With the 600Mhz cortex-A8 in AM335x, the RTOS latency for interrupt handling will prevent ARM from processing the frames received from PRU in 1us though the IP fabric provides much short latency.

If your application doesn't make ARM too busy and can poll instead of using interrupt to check the frame availability from PRU, the latency should be reduced along with EDMA to transfer data from PRU shared data memory to DDR.

Regards,
Garrett

0 Jason Reeder over 7 years ago in reply to Garrett Ding

TI__Genius 10440 points

Jiri,

Can you clarify what it means for the ARM to 'handle' the frame? Does this mean that the ARM core receives the data, makes some determination on what to do (based on the data), and then responds through the UART?

Do all frames require a response in 1us, or just a specific subset of frames? If only a subset of the frames require the immediate response, and the PRU can detect these frames, it may be possible to have the PRU respond to these frames (almost) immediately. The rest of the frames could be passed up to the ARM core for normal processing.

At what point does the 1us frame handling countdown begin? Some back of the napkin calculations show that even at 12Mbps it takes 666ns to receive 8 bits in the PRU hardware. A 255 byte frame would take 170us to receive, right? Since this is the case, I'm assuming that your countdown begins after the full frame has been received by the PRU. If this is correct, you may be able to have the PRU 'pre-interrupt' the ARM as UART data begins arriving. The ARM could then poll until the frame has been fully received in order to mitigate the interrupt latency discussed before.

Jason Reeder

0 Jiri Biel over 7 years ago in reply to Jason Reeder

Intellectual 865 points

Hi Jason, Garret
sorry for long delay in answer I needed to cooperate on other projects too.

We have a communication stack with several complex state machines.

1st state machine (PRU0 SM) can't to answer directly it's always needed to forward the frame to 2nd state machine (PRU1 SM). Frame can be forwarded only in case when is fully received. Not in other case because SM can drop the frame on wrong FCS for example or on any error from UART or due to timeout on some timer.

2nd state machine (PRU1 SM, but so complex and we are thinking about some split) can answer directly to several frames back to PRU0 SM without forwarding to 3rd state machine (ARM SM). But rest frames needs to be forwarded.

3rd state machine (ARM SM) takes forwarded frames and provides them to higher layers.

My problem is that at least some frames need to go via ARM SM and application layers. Of course 1 us is only a minimal time for response, but these small details are normaly saying whether the end product can be sold.

The measurement starts on last byte of request and ends on start byte of response and this should be close to 1 us. Means you have 1 us from fully received frame to handle the answer. And due to complexity of state machines on PRU's you can't to send response like immediately. You need to always go thorough several states from indicating of prepared response to be able to send (another delay).

On the other hand pooling should be not a way, normaly there are not only a stack, but also other applications needs to run. You can't predict now how much time they will need. Means I'm trying to find really fast way of signalling and data exchange between PRU and ARM

With best regards
Jiri

0 Garrett Ding over 7 years ago in reply to Jiri Biel

TI__Mastermind 43296 points

Jiri,

Were you able to figure out an alternative solution to address the latency issue? 1us budget is too tight for the inter processor communication with RTOS scheduling on host processor.

Regards,
Garrett

0 Jiri Biel over 7 years ago in reply to Garrett Ding

Intellectual 865 points

Hi Garret,
for now, some of the frames will be pre-prepared in shared memory.
In that case we are not limited by 1 us on ARM-PRU interface here.
Protocol standard allows it.

We will see whether we can use this approach for each type of frame or not.

With best regards
Jiri

Processors

Processors forum

AM3359: ARM latencies to PRU