TMS320C6747: low EMIFA async read speed caused by SCR multi-level cache request

Xinbing Ma

Part Number: TMS320C6747

Hi,

I work on a c6747 board in our project, a xilinx Artix-7 FPGA connect to the c6747 via the EMIFA CS2, async mode. I write a demo program to test the emif read and write timing. It seems the write timing is right but the read timing is not right.

In my case, cpu run at 375MHz, emifa clock is 125MHz, FPGA sample clock is 40MHz. I set CE2CFG as:

wr/rd Setup -- 4 emifa clock

wr/rd Strobe -- 4 emifa clock

wr/rd Hold -- 4 emifa clock

TA -- 4 emifa clock

My test code shows below:

int i;
volatile unsigned short data;
while(1){
    // test FPGA EMIFA interface: A[13:0]
    for (i = 0; i < 14; i++){ // write
        *((unsigned short *)(0x60000000 + (1<<(i+1)))) = (1<<(13-i));
    }
    for (i = 0; i < 14; i++){ // read
        data = *((volatile unsigned short *)(0x60000000 + (1<<(i+1))));
    }
}

My test result shows below: the FPGA sample clock is 40 MHz in the image

It shows that the writing timing is right, but the read timing is not right. After each read, there is a long time delay. The delay reduce the read speed so much that waste a lot of cpu time.

I search "emifa read" in the forum and read many issue like this:

e2e.ti.com/.../11649

In this issue, Brad give an explanation, the SCR multi-level cache request cause the delay.

e2e.ti.com/.../494280

In this issue, Joerg Seiler said that the user can change master priority.

I also read the two wiki below:

processors.wiki.ti.com/.../AM1x_SoC_Level_Optimizations
processors.wiki.ti.com/.../AM1x_SoC_Architectural_Overview

I don't think change the master priority can solve the problem, because in my case, there is no PRU, EDMA3, EMAC, USB, LCDC, HPI, the left is DSP MDMA and DSP CFG, so I don't think change master priority can work.

Some engineer use EDMA3 to avoid this influence, but i don't want to use the EDMA3,and in my project the read address in the CS2 FPGA is not continuous.

So, is there a clearly solution for this problem? To reduce the long read delay time in a standalone program without using EDMA3.

Or is there a method to disable the cpu to cache the CS2 address region?

I really need your help to reduce the long read delay time in a standalone program without using EDMA3, as the delay waste so much cpu time in my project.

Thanks.

over 5 years ago

0 Xinbing Ma over 5 years ago

Prodigy 100 points

My test code shows below:

int i;
volatile unsigned short data;

while(1){
// test FPGA EMIFA interface: A[13:0]
for (i = 0; i < 14; i++){ // write
*((unsigned short *)(0x60000000 + (1<<(i+1)))) = (1<<(13-i));
}
for (i = 0; i < 14; i++){ // read
data = *((volatile unsigned short *)(0x60000000 + (1<<(i+1))));
}
}

0 Xinbing Ma over 5 years ago in reply to Xinbing Ma

Prodigy 100 points

I has not change the L1P L1D L2RAM setting, they work in default setting. I use L2 RAM as normal RAM, and my program load and runs on the L2RAM.

0 Sahin Okur over 5 years ago in reply to Xinbing Ma

TI__Mastermind 27355 points

Hello,

It does appear you are running into the read latency due to the SCR and other latencies, unfortunately. As you've already read, when the CPU issues a read instruction, it stalls until the data is returned to the EMIF, unlike writes which are "fire and forget."

Have you already tried experimenting with smaller values for R strobe, hold, and turnaround? (if your system allows for it)

Does using a long data type make a difference?

Regards,
Sahin

0 Brad Griffis over 5 years ago in reply to Sahin Okur

TI__Guru*** 125430 points

Performing 32-bit accesses will significantly increase your throughput, i.e. in between the two 16-bit sub-accesses you will see the programmed timings. In fact, you could go one step further and use type 'long long' to perform a 64-bit access.

What will your accesses look like? Are they mostly linear accesses or will you be jumping around between a lot of addresses? Will there be large amounts of data transferred? Will it be "bursty"?

0 Xinbing Ma over 5 years ago in reply to Brad Griffis

Prodigy 100 points

I try 16-bit 32-bit 64-bit access, the result shows below:

This really do some help, there is no latency between sub-access, although the latency still exist after the 2 or 4 sub-access.

I think the latency cannot be avoid. So I will add some 32-bit/64-bit access in my program when there is continuous read address.

I think this latency cannot be avoid is due to the EMIFA controller is connected to the async BR7, and the data read back must go through the EMIFA->BR7(async)->SCR2->BR3/BR4->SCR1->DSP MDMA, really a long way, cost about 80 cpu clock estimated from the test timing above.

0 Xinbing Ma over 5 years ago in reply to Xinbing Ma

Prodigy 100 points

But the write access go through the same DSP MDMA---->EMIFA route. Perhaps the write access is pipelined going through the route, but the read access must be one by one, the first finished from EMIFA to DSP MDMA, then the second read access starts. In the 32/64-bit access, perhaps the sub read access can be pipelined. Is this pipeline explanation right?

0 Xinbing Ma over 5 years ago in reply to Xinbing Ma

Prodigy 100 points

I don't think the pipeline explanation is right. I think the read data go through the EMIFA---->DSP MDMA using less cycle same as the write timing. I think the latency is occurred after the EMIFA---->DSP MDMA route, in the cpu core, some more detail happened and caused the latency, the cache? or something else? Can you give me more detailed explanation about the hardware level mechanism which cause the latency?

0 Brad Griffis over 5 years ago in reply to Xinbing Ma

TI__Guru*** 125430 points

The performance difference between reads and writes is due to a significant difference in how pipelining works for writes vs reads. For a write, the data is no longer needed so the CPU doesn't have to wait for the write to "land" in memory. Writes are "fire and forget", i.e. the processor can immediately move onto the next instruction. Reads are much different because the future instructions depend on having access to the data being read. For that reason reads will stall the pipeline until the read has "landed".

Processors

Processors forum

TMS320C6747: low EMIFA async read speed caused by SCR multi-level cache request