AM2434: PRU application - XFR2VBUS Optimization

Bolt Hsieh

Part Number: AM2434

Hi experts,

With reference to the topic :

LP-AM243: The Timing Question of " PRU - FSI Bandwidth Optimization " Document's Implement - Arm-based microcontrollers forum - Arm-based microcontrollers - TI E2E support forums

In SDK version 8.6.0.45, when I used XFR2VBUS to process 32-bytes FSI data, the processing time did not show a significant improvement compared to memcpy(). Based on the discussion above, XFR2VBUS is recommended primarily for handling 64-bytes data blocks.

May I ask whether the latest SDK or compiler provides any specific optimizations for the XFR2VAR application, such as processing 32-byte data or reducing the wait time for Xin/Xout to complete?

Best Regards

Bolt

1 month ago

0 Nick Saulnier 1 month ago

TI__Guru** 106450 points

Hello Bolt,

I do not have much experience with XFR2VBUS yet, so we will be learning together.

Point #1: MCU+ SDK releases will not affect the efficiency of PRU code

MCU+ SDK has updates to the MCU+ core drivers. We also bundle in some PRU firmware. However, the MCU+ SDK is unrelated to the PRU's hardware design, the PRU assembly instructions that run on that hardware, or the PRU C compiler that can be used to generate assembly code from C code.

Any improvements in PRU performance would come from learning to more efficiently use the existing hardware & software tools, not an SDK update.

Should we expect that the latency (the time to complete the read or write) is significantly different between XFR2VBUS & directly reading/writing?

I am following up with the rest of the team for more details.

However, my high-level understanding is that NO, we should expect the actual time to complete the data transfer to be about the same between XFR2VBUS and SBBO/LBBO/SBCO/LBCO. The actual benefit is that the XFR2VBUS offloads long reads or writes, which frees up the PRU core to spend more clock cycles on other tasks. For more details, refer to my response here: RE: AM625: SPI bit-bangned data to DDR

Please ignore the current version of the AM243x PRU Academy's XFR2VBUS documentation. (future readers, by the time you read this hopefully I have corrected the page. If you see references to "Normal PRU Reads", that means I have not updated the documentation yet) This page was written by a new team member, and I suspect most of the information is wrong.

Are there any details that you might not have looked at in your investigations 2 years ago?

I have not tested DMA yet, but I am interested in benchmarking DMA performance as a part of the current read/write validation effort discussed in your other thread. We are focusing on SBBO/LBBO/SBCO/LBCO at the moment, but if you are interested in running some tests I can collaborate with you to define that test code.

Some things for us to investigate based on skimming through the technical reference manual (TRM):

* Does XFR2VBUS have a lower latency path to the on-chip SRAM than the "normal" PRU Data bus? "The XFR2VBUS is a simple hardware accelerator which is used to get the lowest read round trip latency from MSMC". Would be interesting to see if SRAM reads are substantially different than reads to DDR or R5F subsystem TCM memory

* "32 Byte optimization mode available" - does this have different latencies than the "regular" 4/32/64 byte reads?

Regards,

Nick

0 Bolt Hsieh 1 month ago in reply to Nick Saulnier

Prodigy 165 points

Hello Nick,

Thank you for your response.

It appears that the newer SDK versions are still unable to reduce the data processing time for XFR2VBUS with 32-byte transfers.

Regarding the SBBO/LBBO/SBCO/LBCO use cases, we tested a hybrid approach combining C and assembly two years ago.

However, compared with directly using the C library function memcpy(), there was no noticeable timing improvement during FSI communication data exchange.

Below, I have listed the FSI measured times and the functions used for copying words for your reference. Is there still potential for further optimization?

FSI Test: PRU tick log.

This assembly routine can be invoked directly from C code. ( memcpy( ) -> MemCpyAsm( ) )

Best Regards

Bolt

0 Nick Saulnier 1 month ago in reply to Bolt Hsieh

TI__Guru** 106450 points

Hello Bolt,

Ok, I see you were testing with SRAM.

Do you still have your XFR2VBUS test code? I would be curious to see it.

Regards,

Nick

0 Bolt Hsieh 1 month ago in reply to Nick Saulnier

Prodigy 165 points

Hi Nick,

Do you have any suggestions for optimizing SRAM code?

About XFR2VBUS, you may refer to the thread - LP-AM243: The Timing Question of " PRU - FSI Bandwidth Optimization " Document's Implement - Arm-based microcontrollers forum - Arm-based microcontrollers - TI E2E support forums.

When I raised this question, I had already posted my XFR2VBUS implementation code.

Regards,
Bolt

0 Nick Saulnier 1 month ago in reply to Bolt Hsieh

TI__Guru** 106450 points

Hello Bolt,

ok, I have heard back from my team members. It sounds like my understanding was wrong. A team member who has tested xfr2vbus on AM64x (think AM243x, but with additional A53 cores) reported that they saw a non-linear relationship between number of bytes sent with XFR2VBUS to MSRAM, and to the time the transfer took.

I don't have any context for these results, so please take them with a grain of salt for now:

MSRAM - MSRAM using xfr2vbus

1B : 1540 ns
4B : 1220 ns
32B : 1217 ns
64B : 1268 ns
101B : 1952 ns
128B : 1616 ns
256B : 2105 ns

Regards,

Nick

0 Bolt Hsieh 17 days ago in reply to Nick Saulnier

Prodigy 165 points

Hi Nick,

Thank you for your response regarding XFR2VBUS.

Regarding another question about assembly code optimization, do you have any recommendations?

I have rewritten it below.
This assembly routine can be invoked directly from C code. ( memcpy( ) -> MemCpyAsm( ) )

Can this assembly code be further optimized to reduce execution time?

Regards,
Bolt

0 Rich Chen 12 days ago in reply to Bolt Hsieh

TI__Expert 4380 points

Nick,

In this e2e post, Bolt is looking way to optimize the C and ASM hybrid code.

He is looking for suggestion on how improve latency on memcpy().

Could team check the code he wrote and advise way to optimize?

BR, Rich

0 Rich Chen 10 days ago in reply to Rich Chen

TI__Expert 4380 points

Nick,

Anyone from team can help?

BR, Rich

0 PratheeshGangadhar 7 days ago in reply to Rich Chen

TI__Mastermind 49771 points

Bolt Hsieh said:
FSI measured times and the functions used for copying words for your reference. Is there still potential for further optimization?

I do not see any obvious ways to optimize as the latency is due to PRU/ICSSG access latency over interconnect to a slow and far FSI module. A few general techniques customers can use to optimize the latency

1) Read access from a slow peripheral over SoC interconnect is going to be slow from PRUs. So, in such cases use DMA widget (to reduce stall cycles in PRU) than LBBO/LBCO or System DMA to copy to ICSS DMEM/SMEM.

2) Read access to PRU memory from R5F is much slower than R5F accessing from TCM or OCRAM, so make PRU copy data directly to memory closer to R5F/CPU for optimal system partition

3) As much as possible avoid unaligned access from PRU cores as it will break further accesses to aligned at bus level and add extra latency

4) As much as possible very large burst accesses from PRU, this will block other PRU cores if all cores operate simultaneously, use DMA widgets like XFR2VBUS or XFR2PSI or system DMA as much as you can.

0 Nick Saulnier 4 days ago in reply to PratheeshGangadhar

TI__Guru** 106450 points

Hello Bolt,

I am getting together benchmarking code to try running some tests myself, but I probably won't have time to actually do tests on board until next week or later.

Let's double check your program flow with XFR2VBUS

The primary benefits of XFR2VBUS that I am aware of are:

1) The broadside interface means that it only takes a single clock cycle to move up to 64 bytes into the PRU's internal registers

2) Since the XFR2VBUS is handling the data transfer, the PRU cores are freed up to do other instructions while the data transfer is taking place

Are you doing anything with your 32 byte data after copying it into the PRU's internal registers? If so, you should be processing FSI data n while the XFR2VBUS is transferring FSI data n+1.

We actually have an example getting added to the open-pru repo which uses XFR2VBUS to transfer data in 32-byte amounts. Let's take a look at the logic in fir_filter.asm:
https://github.com/TexasInstruments/open-pru/pull/97

; NICK_NOTE
; the code uses XFR2VBUS to load 64 32-bit values into the PRU internal registers
; 32 bytes of data are loaded over the broadside interface in a single instruction
; so we should see 8 XFR2VBUS reads

	;*** configure xfer2vbus first time (reconfigured at the end of every FIR loop)***
    ldi r18, 0x5                            ; config: auto read mode on, read size-32 Byte
    ldi32 r19, FIR_COEF_MEM                 ; start address of window coefficients
    xout XFR2VBUSP_RD0_XID, &r18, 8         ; set configuration and updated address

; NICK_NOTE
; this XFR2VBUS is configured to trigger the next read after XIN
; so draining the FIFO with XIN also initiates the first read
    xin XFR2VBUSP_RD0_XID, &r2, 32          ; drain fifo to clear already loaded values

; NICK_NOTE
; waiting for XFR2VBUS read #1 to finish
; this is the only read where we wait more than 1 clock cycle for the read to execute
; for all the other reads, the read happens concurrently with data processing
    ; wait till coefficient data is loaded
	; (non-deterministic wait only in the beginning)
WAIT_DATA_READY:
    xin     XFR2VBUSP_RD0_XID, &r18, 1
    qbne    WAIT_DATA_READY, r18.b0, 5

FIR_CYCLE_START:
********* FIR cycle (for every new input) ******************

...

;*************
;* 	x64MAC	 *
;*************
...
; *** perform multiplications for all inputs in SPAD_B0 ***
	loop FIR_MAC24_B0, 3			; loop over input values in SPAD_B0, 3 x 8

; NICK_NOTE
; this xin is XFR2VBUS reads #1, 2, 3
; note that the next read is triggered by the XIN, and then the PRU code executes the
; data processing code while the XFR2VBUS read is happening in the background
; the programmer verified that their data processing code would not complete before
; the XFR2VBUS transfer happened. So even though the XFR2VBUS read takes time, from the
; PRU core's perspective, it only takes 1 clock cycle after that initial XFR2VBUS read
;
; If the other PRU cores or XFR2VBUS instances are also
; doing reads and writes out of the PRU subsystem, they could potentially cause arbitration
; delay, and make the XFR2VBUS read take longer.
; You could make this code safer by adding a WAIT_DATA_READY check before the xin
; in order to double check that the read has completed
	; no data-ready check needed except in the first load
	; timing requirements are met for configured clock
	xin XFR2VBUSP_RD0_XID, &r2, 32 	; load 8 coefficient values
...
FIR_MAC8_B0: 
; (max cycles per FIR_MAC8_B0 loop : 12)
...
	qbge FIR_MAC8_B0, mvid_reg_ptr, 36	; loop till 8 coefficient values are covered

FIR_MAC24_B0:	; 3x8 loop ends

; *** perform multiplications for all inputs in SPAD_B1 ***
	loop FIR_MAC24_B1, 3			; loop over input values in SPAD_B1, 3 x 8

; NICK_NOTE
; this xin is XFR2VBUS reads #4, 5, 6
	xin XFR2VBUSP_RD0_XID, &r2, 32	; load 8 coefficient values
	...
FIR_MAC8_B1:
; (max cycles per FIR_MAC8_B1 loop : 12)
...
	qbge FIR_MAC8_B1, mvid_reg_ptr, 36	; loop till 8 coefficient values are covered
FIR_MAC24_B1:	; 3x8 loop ends
...

; *** perform multiplications for all inputs in SPAD_B2 (loop unrolled * 2)***
    ;*** unrolled loop 1 ***

; NICK_NOTE
; this xin is XFR2VBUS read #7
	xin XFR2VBUSP_RD0_XID, &r2, 32	; load 8 coefficient values
  	ldi mvid_reg_ptr, 8				; point to first coefficient in R2
FIR_MAC8_B2_L1:
...
	qbge FIR_MAC8_B2_L1, mvid_reg_ptr, 36	; loop till 8 coefficient values are covered

	;*** unrolled loop 2 ***

	; reconfigure and trigger XVER2VBUS for next FIR cycle
    ldi r18, 0x5                            ; config: auto read mode on, read size-32 Byte
    ldi32 r19, FIR_COEF_MEM                 ; start address of window coefficients
    xout XFR2VBUSP_RD0_XID, &r18, 8         ; set configuration and updated address

; NICK_NOTE
; this xin is XFR2VBUS read #8
; since the XFR2VBUS got reconfigured, read #8 also triggers read #1 of the next cycle
; so if "FIR cycle" was a while(1) loop, read #1 would also act like a 1 clock cycle
; read in the next loop
    xin XFR2VBUSP_RD0_XID, &r2, 32  ; load last set of 8 coefficient values in buffer
	; Note: last step drains FIFO for next set of transfer we triggered
	...
	FIR_MAC8_B2_L2:
	...
		qbge FIR_MAC8_B2_L2, mvid_reg_ptr, 36	; loop till 8 coefficient values are covered
;FIR_MAC24_B2:	; 3x8 loop ends

;**** x64 MAC ends ****
...
********* FIR cycle ends ******************

So the "correct flow" that you identified in your old thread is a good safeguard to make sure that the read has completed. But I would initiate the next XFR2VBUS read before you start to process the FSI data, and only do the wait_till_read_busy loop after the previous batch of data has been processed.
LP-AM243: The Timing Question of " PRU - FSI Bandwidth Optimization " Document's Implement

Regards,

Nick

0 Bolt Hsieh 4 days ago in reply to PratheeshGangadhar

Prodigy 165 points

Hello Pratheesh,

Thank you for your response.

We will apply the recommendations you provided to perform program optimization (for example, aligned access).

Regards,
Bolt

0 Bolt Hsieh 4 days ago in reply to Nick Saulnier

Prodigy 165 points

Hello Nick,

Thank you for your response.

At present, our use of the PRU indeed follows a strictly sequential approach: we wait until the data transfer is fully completed before proceeding to process that data, and there is no parallel processing of other data during the transfer interval. We will review whether any parallel operations can be introduced (our previous assessment indicated that this was already exhausted, as parallel tasks are executed on the four R5F cores). Thank you for your suggestion.

However, our primary focus remains on confirming whether there are any methods to further reduce the pure data transfer time itself.

Regards,
Bolt

0 Nick Saulnier 4 days ago in reply to Bolt Hsieh

TI__Guru** 106450 points

Hello Bolt,

I have spent some more time looking at the FSI peripheral. I see that both TX and RX FSI peripherals have a 16-word circular buffer, which I assume you are reading. So if your programming flow is something like this, then the auto-read option to parallelize data transfer would not make sense.

FSI RX receives a single 32-byte frame
Interrupt is triggered to PRU core
PRU initiates a single read of 32-byte frame
once frame lands in PRU internal registers, frame data can be processed

Since PRU is reading in data from SRAM instead of the FSI memory space, I assume you have a different data flow.

Other system design brainstorming:

If the interrupt to read in the next frame is likely to happen while PRU is still processing the previous frame, you could still start a XFR2VBUS read when the interrupt is received (for future readers, this would probably require using the task manager to interrupt the currently running code. So a setup like this would only work on ICSSG devices like AM243x or AM64x which have a task manager). Then the PRU could finish processing the current frame, and then move on to the next frame.

Regards,

Nick

Processors

Processors forum

AM2434: PRU application - XFR2VBUS Optimization