TDA4VH-Q1: MSMC 1 Extended Memory and Streaming Engine

srikar varaganti

Part Number: TDA4VH-Q1

Hello,

We are trying to develop a kernel that works on C7 DSP using streaming engine, for reduced memory complexity we want to use MSMC 1, extended memory which as 3MB capability.

The idea is to move the images in DDR to MSMC and back from MSMC to DDR using streaming engine. Could you provide us any sample examples that could help us achieve this task. We looked at C7 Training material that has information on how to use streaming engine, but that does not mention anything about streaming the data back to DDR memory.

As mentioned in the image above, SE cannot be used to write instructions but to input instructions. what does this exactly mean in terms of data streaming, please elaborate.

Here is the MSMC we want to invoke for the kernel:

Thank you.

over 1 year ago

0 Asha Bhandarkar over 1 year ago

TI__Genius 10170 points

Hi Srikar,

Streaming engine (SE) is an interface that allows for reading a vector-width amount of data at a time from memory for the C7x CPU to process. The streaming engine interface is not able to write back to memory. The Streaming address generator (SA) acts differently, but can read and write from memory.

srikar varaganti said:
The idea is to move the images in DDR to MSMC and back from MSMC to DDR using streaming engine.

Could you clarify the type of data movement you are trying to achieve?

Is it:

1) MSMC ⇔ DDR (without any processing from C7x)

2) MSMC → C7x CPU (for some processing of the image) → DDR

3) Something else

Best,

Asha

0 srikar varaganti over 1 year ago in reply to Asha Bhandarkar

Intellectual 280 points

Asha Bhandarkar said:
Could you clarify the type of data movement you are trying to achieve?

Is it:

1) MSMC ⇔ DDR (without any processing from C7x)

2) MSMC → C7x CPU (for some processing of the image) → DDR

3) Something else

Its the option two. We are developing a custom kernel that generates a 2X2 CFA pattern for a 4X4 RGBIR image sensor. This kernel uses convolution and other mathematical operations to be processed on raw images in C7.

DDR -> MSMC -> C7x -> DDR (This way instead of using the 512KB L2 cache for C7, we can take advantage of the 3MB space available in MSMC directly to operate in C7 by skipping L2)

0 Asha Bhandarkar over 1 year ago in reply to srikar varaganti

TI__Genius 10170 points

Hi Srikar,

Thank you for the clarification!

If you are planning on using streaming engine, the interface is tightly coupled with the C7x L2SRAM. It would be beneficial to use streaming engine for maximum throughput between the C7x core and memory. I have outlined a flow similar to what you described above

For transfers between MSMC and DDR, you will want to utilize DRU (streaming engine will have significant latency). We have examples of utilizing DRU within PSDK RTOS in pdk/packages/ti/drv/udma (examples/udma_dru_test might be particularly helpful to look at).

For writing back to DDR, you can utilize SA to determine the address offsets and write to L2SRAM or MSMC (depending on image sizes and sizes of memory). Utilizing DRU to write to DDR from there should be faster than writing directly to DDR using SA.

Best,

Asha

0 srikar varaganti over 1 year ago in reply to Asha Bhandarkar

Intellectual 280 points

Hello,

Based on your response I have a query (this might be very basic but we need to understand in order to design an efficient pipeline).

1. I understand that you've included L2SRAM in order to use streaming engine, but is this a necessary step to use SE?

2. If the plan is to use L2SRAM, what's the purpose of MSMC, can we not directly move the data from DDR -> L2SRAM? or does using MSMC makes this movement more efficient?

3. Can we use SE to move data from MSMC to C7 at all?

Thank you.

0 Asha Bhandarkar over 1 year ago in reply to srikar varaganti

TI__Genius 10170 points

Hi Srikar,

Let me follow up internally on these points. But to clarify based on your reply and original post - you are effectively wanting to find the most efficient pipeline between DDR to C7x? You don't necessarily need to use MSMC?

Best,

Asha

0 srikar varaganti over 1 year ago in reply to Asha Bhandarkar

Intellectual 280 points

The only reason I want to use MSMC instead of L2SRAM is that the space available in MSMC1 is 3MB compared to 512KB available in L2SRAM. This amount of space can help us perform the kernel process more quicker.

0 Asha Bhandarkar over 1 year ago in reply to srikar varaganti

TI__Genius 10170 points

Hi Srikar,

Thank you for the clarification! I've reached out internally to our development team see if we have a concrete answer for you on what would be the most optimal path overall. Do expect a couple of days of delay due to the time zone differences.

Thanks,

Asha

0 Asha Bhandarkar over 1 year ago in reply to Asha Bhandarkar

TI__Genius 10170 points

Hi Srikar,

I am very sorry for the delay. I've given an overall response to your questions below after further discussion internally.

The most optimal solution (and to answer your second question) would be to organize the data movement as the following

DDR -> L2SRAM -> C7x -> MSMC1 -> DDR (utilizing DRU, SE between L2SRAM and C7x)

To get a better understanding between SE's interaction between L2SRAM and MSMC, I would recommend looking at Figure 12. UMC Interfaces and Block Diagram in the C71x DSP Corepac Technical Reference Manual.

srikar varaganti said:
1. I understand that you've included L2SRAM in order to use streaming engine, but is this a necessary step to use SE?

srikar varaganti said:
3. Can we use SE to move data from MSMC to C7 at all?

L2SRAM is technically not necessary, SE can read from MSMC or DDR as well. However, SE is tightly coupled with L2SRAM and will have the least amount of latency and maximize throughput when reading. If you look at the diagram and how L2SRAM is organized into 4 banks, you are able to have 2 512-bit read paths (using both SE) to the C7x core. However, if you look lower at the diagram to see the connection between the UMC and MSMC you have only one read/write path. So you would only get half the throughput in reading from MSMC (as well as some additional latency).

From a write perspective there is only one 512-bit write back from the C7x core, so the choice between using L2SRAM and MSMC is more minimal. In the case above, it might be worth writing to MSMC for the output due to the larger size as you have mentioned, and also to maximize the use of L2SRAM for the input processing.

Let me know if that clarifies this for you. If you scroll past the figure in the document I mentioned, it explains the interface in more detail. Thank you for your patience on this topic.

Best,

Asha

Processors

Processors forum

TDA4VH-Q1: MSMC 1 Extended Memory and Streaming Engine