EDMA3 transfer preemption by a DSP on EMIF-16 problem

Pawel Dabrowski

Other Parts Discussed in Thread: TMS320C6678

Hi,

Few words about our system:
We have a TMS320C6678 (only one core) chip interconnected with a Cyclone V FPGA through EMIF-16.

We are able to send/receive data by writing/reading to appropriate memory address at C6678.

Our system works right now in the following way:

1. Every 4ms FPGA generates an interrupt to the DSP through GPIO Interrupt.
   This informs us that data is ready to be read out.
2. After the interrupt we trigger EDMA3 through EDMA Channel Controller (TPCC) 1 to move the data
   to the processor. It takes about 3ms to move the data.
   EMIF is set now for pretty low speed settings, so we could speed this up, however we predict       that ultimately we would have increase of data so anyway we would end up with 3ms.
3. We asynchronically need to write/read some registers from FPGA in this 3ms window.

And our problem is at step 3, becasue our write/read gets stalled to the moment when EDMA finishes it's transfer.

We thought that DSP has higher priority than EDMA so it could access the EMIF and FPGA even when the transfer occurs.

We even tried to change QUEPRI (Queue Priority Register) and set everything to lowest prioty on every channel controller (7 instead of 0). However we think that it maybe the priority that only applies between channel controllers.

We also thought about splitting one big transfer to let's say two EDMA transfers, when one would trigger another. So it would give place for
the DSP to write/read register more quickly. However we don't like the solution :).

We would appreciate any help on the matter.

Best regards,
Pawel Dabrowski

over 11 years ago

0 Pawel Dabrowski over 11 years ago

Prodigy 120 points

Hi,

I have found in EDMA3 documentation for keystone devices RDRATE register for transfer controllers.

According to the documentation:

"The EDMA3 transfer controller issues read commands at a rate controlled by the read
rate register (RDRATE). The RDRATE defines the number of idle cycles that the read
controller must wait before issuing subsequent commands. This applies both to
commands within a transfer request packet (TRP) and for commands that are issued
for different transfer requests (TRs). For instance, if RDRATE is set to 4 cycles between
reads, there are 3 inactive cycles between reads.

RDRATE allows flexibility in transfer controller access requests to an endpoint. For an
application, RDRATE can be manipulated to slow down the access rate, so that the
endpoint may service requests from other masters during the inactive EDMA3TC
cycles."

I have set RDRATE to 32 cycles between reads by writing to appropriate memory address (for all TCs to be safe).

This doesn't change anything in our RTA Raw logs. Still access from DSP Core 0 waits

for the EDMA to finish for the same amount of time. This is really strange.

What else can I do ?

Best,

Paweł

0 Pubesh over 11 years ago in reply to Pawel Dabrowski

TI__Mastermind 20790 points

Pawel,

Are you using EDMA LLD package on your application?
Please refer the below E2E post, its more over simillar your issue. It may help you,
http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/t/248412.aspx

Find more detailed information about the EDMA transfer at the following wiki pages,
http://processors.wiki.ti.com/index.php/EDMA3#Types_of_EDMA3_transfer
http://processors.wiki.ti.com/images/b/b8/Eindhoven_JAN_12-10_IntroTo_Edma.pdf
http://processors.wiki.ti.com/index.php/Programming_the_EDMA3_using_the_Low-Level_Driver_(LLD)
http://processors.wiki.ti.com/index.php/Programming_EDMA_without_EDMA3LLD_package

0 Pawel Dabrowski over 11 years ago in reply to Pubesh

Prodigy 120 points

Hi Pubesh,

Thanks for the reply.

Yes we are using EDMA LLD package.

Our transfer works fine, we are able to send data back and forth.

I have two questions:

1) Can processor preempt large EDMA transfer ? If not can I set any priorities to change it?

We have tried as I have written earlier, QUEPRI also RDRATE etc. I even checked that TC responsible for moving data has the lowest PRI.

2) If the first question is no, then the only solution or workaround is to use transfers with Intermediate chaining ?

As specified here: http://www.ti.com/lit/pdf/sprugs5 at page 99

"Breaking Up Large Transfers with Intermediate Chaining"

3) How does cache come into play ? Does it make more difficult for the processor to gain access to the EMIF?

Best,

Paweł

0 RandyP over 11 years ago in reply to Pawel Dabrowski

TI__Guru* 84110 points

Pawel,

Can you put some specific details to the timing information you have listed, please:

What is the DSP clock speed?
What is the EMIF clock speed?
How many total cycles have you set for each async EMIF read and write? Please show the four EMIF AnCR registers' values.
What is the width of the EMIF to the FPGA?
How many bytes are being read by the EDMA from the FPGA in the large transfer? Please show the PARAM values for the transfer.
When during the 3 ms EDMA transfer does the DSP code try to read or write from the FPGA?
How much is the DSP reading and writing?
How close are the addresses that the EDMA is reading compared to what the DSP is reading and writing?
Since you mentioned cache, is any of this range of the EMIF addressing space cacheable by the DSP? I would suspect it should not be cached.
To what memory endpoint is the EDMA copying the data from the FPGA? DDR3, MSMCSRAM, L2?

Sorry for all the questions, but the answers will clear up a lot of possible confusion over what is happening in your system.

Regards,
RandyP

0 Pawel Dabrowski over 11 years ago in reply to RandyP

Prodigy 120 points

Hi Randy,

Of course I'll be glad to add more informations.

RandyP said:
What is the DSP clock speed?

1GHz

What is the EMIF clock speed?

EMIFK_CLK = CPU/6 = 166.67 MHz

How many total cycles have you set for each async EMIF read and write? Please show the four EMIF AnCR registers' values.
void ConfigEmif16()
{

    /* Config nand FCR reg. 16 bit , 4-bit HW ECC */
     hEmif16Cfg->A1CR = (0                                         \
        | (0 << 31)     /* selectStrobe */ \
        | (0 << 30)     /* extWait (never with NAND) */ \
        | (0xf << 26)   /* writeSetup  10 ns */ \
         | (0x3f << 20)  /* writeStrobe 40 ns */ \
        | (7 << 17)     /* writeHold   10 ns */ \
        | (0xf << 13)   /* readSetup   10 ns */ \
        | (0x3f << 7)   /* readStrobe  60 ns */ \
         | (7 << 4)      /* readHold    10 ns */ \
        | (3 << 2)      /* turnAround  40 ns */ \
        | (1 << 0));   /* asyncSize   16-bit bus */ \
        /* Config nand FCR reg. 8 bit NAND, 4-bit HW ECC */

     hEmif16Cfg->A3CR = (0                                         \
        | (0 << 31)     /* selectStrobe */ \
        | (0 << 30)     /* extWait (never with NAND) */ \
        | (0x2 << 26)   /* writeSetup  10 ns */ \
         | (0x8 << 20)  /* writeStrobe 40 ns */ \
        | (0x2 << 17)     /* writeHold   10 ns */ \
        | (0x2 << 13)   /* readSetup   10 ns */ \
        | (0xA << 7)   /* readStrobe  60 ns */ \
         | (0x2 << 4)      /* readHold    10 ns */ \
        | (0x2 << 2)      /* turnAround  40 ns */ \
        | (0x1 << 0));   /* asyncSize   16-bit bus */ \


    /* Set the wait polarity */
     hEmif16Cfg->AWCCR = (0x80            /* max extended wait cycle */ \
        | (0 << 16)     /* CS2 uses WAIT0 */    \
        | (0 << 28));  /* WAIT0 polarity low */ \

    /*
    Wait Rise.
     Set to 1 by hardware to indicate rising edge on the
    corresponding WAIT pin has been detected.
    The WP0-3 bits in the Async Wait Cycle Config register have
    no effect on these bits.
    */

     /*
    Asynchronous Timeout.
    Set to 1 by hardware to indicate that during an extended
    asynchronous memory access cycle, the WAIT signal did not
    go inactive within the number of cycles defined by the
     MAX_EXT_WAIT field in Async Wait Cycle Config register.
    */

    hEmif16Cfg->IRR = (1                      /* clear async timeout */ \
        | (1 << 2));   /* clear wait rise */ \


 }
What is the width of the EMIF to the FPGA?

16 bits

How many bytes are being read by the EDMA from the FPGA in the large transfer? Please show the PARAM values for the transfer.

I was mistaken at the beginning, there are couple of transfers chained together. For this question and the next I will give you new
values, because we have limited transfer size.

So now there are 4 edma transfer occuring:
----------------------------------------------
| size |     src            |     dst            |
----------------------------------------------
| 1536 | 0x78010000 | 0x80A01480 |
----------------------------------------------
| 1024 | 0x78018000 | 0x80A07480 |
----------------------------------------------
| 128   | 0x78020000 | 0x80A0B480 |
----------------------------------------------
| 224   | 0x78024000 | 0x80A0BB80 |
----------------------------------------------
* size specified in bytes

Params:
  edma3Result = EDMA3_DRV_setSrcParams (transfer->hedma, transfer->edma_chan, transfer->src, EDMA3_DRV_ADDR_MODE_INCR, EDMA3_DRV_W8BIT);
     if (edma3Result != EDMA3_DRV_SOK) return edma3Result;
    edma3Result = EDMA3_DRV_setDestParams (transfer->hedma, transfer->edma_chan, transfer->dst, EDMA3_DRV_ADDR_MODE_INCR, EDMA3_DRV_W8BIT);
    if (edma3Result != EDMA3_DRV_SOK) return edma3Result;
     edma3Result = EDMA3_DRV_setTransferParams(transfer->hedma, transfer->edma_chan, 4, transfer->size>>2, 1, 0, EDMA3_DRV_SYNC_AB);
    if (edma3Result != EDMA3_DRV_SOK) return edma3Result;
When during the 3 ms EDMA transfer does the DSP code try to read or write from the FPGA?

How much is the DSP reading and writing?

After lowering data size to be transferred, the time from the event to edma transfer completion is now 789us (all timings were mesaured using logs with Live session).
And yes during this time of period DSP tries to read and write some control registers. It reads 56 bytes and writes 122 bytes.
For those setting both read and write occur after EDMA transfer finishes.

I then was playing with transfer paramms, i.e:
  edma3Result = EDMA3_DRV_setTransferParams(transfer->hedma, transfer->edma_chan, 8, transfer->size>>3, 1, 0, EDMA3_DRV_SYNC_AB);
    It now allows the DSP to somehow access the FPGA. However the whole EDMA transfer time increased from 789us to 931us and the
     read from DSP of those 56 bytes takes 300us and after that write of 122 bytes takes 82us. I tried different values - however this was the best scenario.

    With old transfer param setting when read and write occur after EDMA transfer (so EMIF is free) it takes 21us to read 56 bytes and 23us to write those 122 bytes.

How close are the addresses that the EDMA is reading compared to what the DSP is reading and writing?

DSP reads and writes to 0x78000000.

Since you mentioned cache, is any of this range of the EMIF addressing space cacheable by the DSP? I would suspect it should not be cached.

To what memory endpoint is the EDMA copying the data from the FPGA? DDR3, MSMCSRAM, L2?

We are transfering data to DDR3 which is set to cachable, EMIF is not. L2 is set to all cache mode.

Best,

Paweł

0 RandyP over 11 years ago in reply to Pawel Dabrowski

TI__Guru* 84110 points

Pawel,

This is very helpful, and I have a few follow-on questions. I will start with some comments that may not quite be answers, but at least they are not questions.

Internal bus priorities are not the same as Quality of Service implementations. If two internal bus requests get to a decision point, the higher priority one will win the tie. But if a lower priority request gets there first, it will go first. Unfortunately, "gets there first" is not always easy to define or determine or predict.

Pawel Dabrowski said:

1) Can processor preempt large EDMA transfer ? If not can I set any priorities to change it?

2) If the first question is no, then the only solution or workaround is to use transfers with Intermediate chaining ?

1) No preemption can occur, but commands from one bus master (TPCC) can interleave with commands from another bus master (DSP). This interleaving will happen within the TeraNet 3_A bus switch shown in the C6678 Data Manual.

2) We always recommend breaking up long transfers into multiple smaller transfers. When accessing a fast endpoint like the DDR3, this is not a big problem. But when accessing a very slow peripheral like your FPGA, this can become a serious problem, such as what you have found in this situation.

If that is the conclusion you were looking for and that is the end of it, you are welcome to mark this thread Answered and go back to getting your application working.

Or if you want to discuss options, you are welcome to continue this. I will go ahead with some comments and questions that you can choose to use or not.

Your comments in the EMIF A3CR register show different times than my count. The value that is written into the timing fields represents 1 less than the actual value, such as for the read-strobe where 0xA = 11 EMIF cycles = 66ns.

The total delay per EMIF read should be 17 EMIF clocks per 16 bits = 102ns per 16 bits or 51ns per byte. When I divide 789us by ( 1536 + 1024 + 128 + 224 ) = 2912, I get 271ns per byte. I am not sure what the measurement tool is that you are using to get your delays, but you may want to look at the EMIF control signals to see how they are actually behaving.

There may be an undocumented register bit in the EMIF that causes extra delays. I am not sure if this was changed in the C6678 or not, but you can search this forum and the C64x Single Core DSP forum for something like "undocumented EMIF slow" (no quotes) to find some discussions about how to fix it if, it exists and is needed.

What is the DSP doing with the FPGA reads and writes? Is it just reading a series of sequential registers or data locations and then writing some sequential locations? Is it doing read-test-repeat for status or read-modify-write for bit setting/clearing?

Because the EMIF timing is so much slower than the DSP clock, the RDRATE is not likely to have much effect. 32 cycles would be counted at the CPU/3 rate which is 96ns, which is less than a single EMIF 16-bit cycle.

If you use self-chaining or any chaining to pace the DMA reads, be sure to use the TCCMOD=NORMAL setting to make sure the DMA Transfer Requests do not get bunched up even with chaining.

We can talk more, if you like.

Regards,
RandyP

0 Pawel Dabrowski over 11 years ago in reply to RandyP

Prodigy 120 points

Hi Randy,

Thanks for the answers, especially for clearing up 1) and 2).

4) Here's another qestion:
In application report "Throughput Performance Guide for C66x KeyStone Devices"
www.ti.com/lit/pdf/sprabk5

On page 17:

"There is no contention between TCs, but because the DDR3 EMIF cannot do a read
from DDR3 and a write to DDR3 in parallel, the EDMA needs to wait until the previous
operation is completed."

There are no specifics of EDMA transfer, however if I could break the transfer into multiple smaller transfers I could
both read and write to DDR3 EMIF? Or this would work if EDMA was reading to a different endpoint and writing from a different endpoint?

DDR3 EMIF is much more complex than EMIF16, but isn't it similar ?

5)

RandyP said:
There may be an undocumented register bit in the EMIF that causes extra delays. I am not sure if this was changed in the C6678 or not, but you can search this forum and the C64x Single Core DSP forum for something like "undocumented EMIF slow" (no quotes) to find some discussions about how to fix it if, it exists and is needed.

Yes I have already seen this bit, and already tested it. It seems that it adds some performance to EMIF. It doesn't generate any issues?

6) Regarding wrong comments, this is the code that I copied from our bootloader.

I suppose that some has simply taken it from examples, change the value and didn't update the comment.

RandyP said:

What is the DSP doing with the FPGA reads and writes? Is it just reading a series of sequential registers or data locations and then writing some sequential locations? Is it doing read-test-repeat for status or read-modify-write for bit setting/clearing?

DSP is simply reading sequential registers - informations from these registers are used for some decisions. After that it writes sequential sequential registers to set operation of FPGA..

TCCMOD is set for NORMAL.

Is there anything else that I can do to determine the issue ?

Best,

Paweł

0 RandyP over 11 years ago in reply to Pawel Dabrowski

TI__Guru* 84110 points

Pawel,

Pawel Dabrowski said:
4) "... the EDMA needs to wait until the previous
operation is completed."

I am really not sure what you are asking with the reference and the statements here. Please continue with more clarification of your concern or question.

Pawel Dabrowski said:
DDR3 EMIF is much more complex than EMIF16, but isn't it similar ?

Yes, they are similar. Both are EMIFs. Again, I am not sure what is the concern or question.

Pawel Dabrowski said:

5) Yes I have already seen this bit, and already tested it. It seems that it adds some performance to EMIF. It doesn't generate any issues?

No, this fixes a problem. It does not add others.

Pawel Dabrowski said:
6) Regarding wrong comments . .

For your benefit and any other posters, the quality of the advice you get is proportional to the quality of information you present. I tend to read comments for often than numerical code values, so I suspect others do, too. I tend to lose heart when I realize the lack of value of my time spent on a thing. Of course, I would say something similar if there were no comments at all.

Pawel Dabrowski said:

7) DSP is simply reading sequential registers - informations from these registers are used for some decisions. After that it writes sequential sequential registers to set operation of FPGA..

My recommendation is to use a DMA operation (or QDMA) to read the FPGA registers into a buffer, then another DMA/QDMA to do the writes, too. This would be the most efficient way to get those operations done even if you were not running into this problem.

DSP reads will run one-at-a-time, waiting for the first to complete before moving to the next read. In the case where each read has to slip in between DMA read bursts, the delay for each DSP read will get extended a long time.

Pawel Dabrowski said:
Is there anything else that I can do to determine the issue ?

To definitely state what the problem is, we would need a lot more detail on exactly when the DSP instructions occur relative to when the associated EMIF operation occurs. Since you have not been offering any logic analyzer traces showing these relationship, that may not be an option in your lab. So we will have to go more with educated speculation. What I mean is that I agree that you are getting stalls on the DSP accesses due to the DMA reads from the slow EMIF.

How often does the WAIT0 line go active extending the FPGA reads and writes? This could affect my understanding so far of the timing relationships. If WAIT0 is used often, or only when certain things happen, or when switching from one address range to another, etc., that can matter for a solution.

In general terms, it would be good to understand the timing relationships between the DMA requests and the duration of those transfers. I would expect that if you created a self-chaining DMA that will breakup the transfers in lots of short bursts, the timing should not get hurt by much. But there will be some fixed delay between the bursts. I would run experiments with shrinking the burst size (ACNT for A sync or ACNT*BCNT for AB sync) down to pretty small.

If you use a DMA/QDMA for the DSP accesses and use the same EDMACC and EDMATC, you will get that transfer to slip in between the FPGA reads once the current short burst completes.

Regards,
RandyP

0 Pawel Dabrowski over 11 years ago in reply to RandyP

Prodigy 120 points

Hi Randy,

Sorry that I haven't replied earlier, we are pretty busy here, deadline is coming.

My question to 4) is why there is no interleaving of those transfer. Is it because src and dst
are both in one endpoint ?

RandyP said:

For your benefit and any other posters, the quality of the advice you get is proportional to the quality of information you present. I tend to read comments for often than numerical code values, so I suspect others do, too. I tend to lose heart when I realize the lack of value of my time spent on a thing. Of course, I would say something similar if there were no comments at all.

I tried to prepare my replies as thoroughly as possible, sorry for that, however we are really here under great pressure so I didn't have time to check it.

Thanks for your tips and answers, I got all I wanted so I'll verify the answer as completed. I won't be able to make any further investigations now and probably not in future (because we'll swtich to PCIe).

Best regards,
Pawel

Processors

Processors forum

EDMA3 transfer preemption by a DSP on EMIF-16 problem