EMIF16 to transfer data from FPGA to C6678

Andrey Savinkov

Other Parts Discussed in Thread: TMS320C6678

Hello All !

We have the TMS320c6678 EVM and we want to transfer a lot of data from an external FPGA (16 bit word). We are going to use EMIF16 for this purpose. Unfortunately, I'm beginner in this subject. I have read several discussions on this foum on EMIF16. As for as I understood the most effective way is:

1.EMIF16 CEn

2.CEn ->DDR by EDMA

So, in my project I have configured EMIF16 and tried to read data. My simple code is:

#define EMIF16_DATA_REG_ADR 0x70000000
CSL_Emif16Regs EMIF16_REGS;

void main ()
{
    volatile uint16_t    *pData;
   pData = (uint16_t*) EMIF16_DATA_REG_ADR;
   volatile uint16_t     EMIF_DATA;

   // Little Endian mode
   CSL_FINS(EMIF16_REGS.RCSR, EMIF16_RCSR_BE, 0);

   // enable 16-bit mode
   CSL_FINS(EMIF16_REGS.A1CR, EMIF16_A1CR_ASIZE , 1);

   /* FOR CHIP SELECT 0 */
   EMIF16_REGS.A0CR = (0                        \
              | (1 << 31)     /* selectStrobe */ \
              | (0 << 30)     /* extWait */ \
              | (1 << 26)     /* writeSetup 12 ns */ \
              | (3 << 20)     /* writeStrobe 24 ns */ \
              | (0 << 17)     /* writeHold    6 ns */ \
              | (1 << 13)     /* readSetup   12 ns */ \
              | (7 << 7)      /* readStrobe 48 ns */ \
              | (0 << 4)      /* readHold     6 ns */ \
              | (1 << 2)      /* turnAround 12 ns */ \
              | (1 << 0));    /* asyncSize   16-bit bus */ \

       /* Set the wait polarity */
          CSL_FINS(EMIF16_REGS.AWCCR, EMIF16_AWCCR_WP0, CSL_EMIF16_AWCCR_WP0_WAITLOW);
          CSL_FINS(EMIF16_REGS.AWCCR, EMIF16_AWCCR_CE0WAIT, CSL_EMIF16_AWCCR_CE0WAIT_WAIT0);
          EMIF16_REGS.AWCCR = (0x80            /* max extended wait cycle */ \
            | (0 << 16)     /* CE0 uses WAIT0 */    \
            | (0 << 28)); /* WAIT0 polarity low */ \

   clock_t    t_start, t_stop, t_overhead, t_opt;
   t_start = clock();
   t_stop = clock();
   t_overhead = t_stop - t_start;

   t_start = clock();
   EMIF_DATA = *pData;
   t_stop = clock();

   printf("\tIN DATA VALUE: %d\n", EMIF_DATA);
   printf("\tIN DATA: %d clock cycles\n", (t_stop - t_start) - t_overhead);
}

Here I have read data one time and measure how long time does the reading procedure take (marked red). I found that this time is about 600 clock cycles ! This is very-very slow. CPU frequency is 1 GHz, so 600 cycles = about 0.6 mks per one value (2 bytes). Totally this is only ~1.5 Mb/sec. To speed up the reading procedure I have to use EDMA3, but I have no idea how to do this. Could anybody help me ? How to configure and use EDMA ? Is there any sample code ?

over 12 years ago

0 Alberto Chessa over 12 years ago

Mastermind 6670 points

Hi,

I'm not sure how clock() in implemented. The TI libc redirect it to the HOSTCLock() function. To take small and precise time you have to use the build-int CPU timestamp counter TCSL/TCSH (resolution: CPU clock, that is 1ns):

#include <c6x.h>

...

TSCL=0; //start

volatile unsigned long t_start=TSCL;

EMIF_DATA=*pData;

volatile unsigned long t_stop=TSCL;

Note that, even so, measuring a single read could have a relative big overhead depending on the optimization level and surroinding code.

0 Andrey Savinkov over 12 years ago in reply to Alberto Chessa

Intellectual 375 points

Hi Alberto.

In my case clock() is working rather correct. I have check it.

I think the problem is how to configure EDMA. Any ideas ?

0 Johannes over 12 years ago in reply to Andrey Savinkov

Mastermind 6240 points

Hi,

The DMA transfer will be done after

EMIF_DATA = *pData;

right?

One thing that you can do to measure the execution time from the .c code is to take an average of some reads.

About the DMA take a look here at the manual www.ti.com/general/docs/lit/getliterature.tsp?baseLiteratureNumber=sprugs5 and also see this wiki: http://processors.wiki.ti.com/index.php/Programming_the_EDMA3_using_the_Low-Level_Driver_%28LLD%29

In the bottom of thje page you find a link to download it, once it's installed import the example project stored in edmaInstallationDIR\examples\edma3_driver\evm6678 to CCS.

There's another example in pdk_C6678InstallDIR\packages\ti\csl\example\edma

0 Alberto Chessa over 12 years ago in reply to Andrey Savinkov

Mastermind 6670 points

Hi,

Your message has remember me that I haven't yet measured my FPGA access time so I have done it: for me also the access time seems to be very hight. My EMIF16 setup is a bit faster, a single read should be about 72ns but I measured 240 cycles storing the data onto MCSM

To verify the measure method I run the same code addressing only MCSM memory (with cache enable) and I get 7 cycles, that correspond to the cycle count calculated from the assmebler listing (1 for read + 4 delay + 1 for write + 1 for reading the TSCL).

So, someone from TI can give us an advice?

0 Andrey Savinkov over 12 years ago in reply to Johannes

Intellectual 375 points

Thanks for information. Now I'm reading about DMA.

I will describe my results later.

0 Andrey Savinkov over 12 years ago in reply to Alberto Chessa

Intellectual 375 points

Hi Alberto,

Do you have any sample code ? Just I'm not sure that I configure EMIF16 right.

0 RandyP over 12 years ago in reply to Andrey Savinkov

TI__Guru* 84110 points

Andrey,

Andrey Savinkov said:
In my case clock() is working rather correct. I have check it.

Please explain how you checked the clock() function and determined that it is working correctly. I have seen other users who incorrectly used the clock() function, which is a wrapper for the HOSTclock() function, which is not going to give the results that you need.

Please try the TSCL method that Alberto Chessa describes above, and report back the number of cycles from that method. That is a known accurate method.

Andrey Savinkov said:
I'm not sure that I configure EMIF16 right.

Please look at the EMIF16 pins (EMIFCE0n and EMIFOEn, to start) with an oscilloscope to see how they are behaving. This is needed to find out if you have configured the EMIF16 correctly.

A quick look through my computer did not turn up any examples, although I might not have been searching for the right thing. The NOR Flash examples should be perfect, though. You should search this E2E forum for other discussions of the EMIF16, and also search TI.com and the Wiki.

Make sure cache is turned off by clearing the MAR bit associated with your EMIF16 address region. If cache is enabled, it will read an entire cache line as part of a single read request. In that case, a second read of the same location would be much faster.

You can read more about the MAR bits in the C66x CorePac User Guide and the C66x DSP Cache User Guide.

Regards,
RandyP

0 Alberto Chessa over 12 years ago in reply to Andrey Savinkov

Mastermind 6670 points

Andrey Savinkov said:

Do you have any sample code ? Just I'm not sure that I configure EMIF16 right.

My code is inside a bigger application so I don't have astand-alone sample. Anyway my set-up is essentialy the same as yourm (just a little difference in the cycles due to difference in the devices). I thiks your setup should be good.

0 Alberto Chessa over 12 years ago in reply to RandyP

Mastermind 6670 points

RandyP said:

Make sure cache is turned off [...]

My setup was with cache off. When I turn it on (and invalidating the cache prior to begin to read), reading for instance 512K, I found more or less the expected number. In my case 70 ticks (expected 68), against 250 ticks with cache disable (when reading 512K). In this case (cache on), a single read cost as 64 read (L2 cache, 128 bytes cache line).

So it seems there is an additional (undocumented?) overhead on a single, non-cached read.

I suppose the cached read time is more or less equal to the DMA time, but the CPU cached read is useless in case of a FIFO reading (I have not tried to read multiple time form the same address).

0 RandyP over 12 years ago in reply to Alberto Chessa

TI__Guru* 84110 points

Alberto,

In your case (not ignoring Andrey's case), what are you counting as "ticks"? TSCL counts or EMIFCLK's?

Let me ask if I understand clearly:

1. With cache on and reading 512K (Bytes?) using DSP code, you measure 70 ticks average per each of the 256K 16-bit read operations. This represents 4K cache-line reads due to cache read misses, each followed by 63 cache read hits.

2. With cache off and reading 512K (Bytes?) using DSP code, you measure 250 ticks average per each of the 256K 16-bit read operations.

3. With cache on and reading 2 Bytes, you measure 70 ticks * 64 reads (cache line) or 4480 ticks total.

Are these correct?

What is your DSP clock speed, your EMIFCLK speed, and your EMIF read timing parameters (programmed or +1 included)?

Regards,
RandyP

0 Alberto Chessa over 12 years ago in reply to RandyP

Mastermind 6670 points

RandyP said:

In your case (not ignoring Andrey's case), what are you counting as "ticks"? TSCL counts or EMIFCLK's?

TSCL count, with DSP clock 1GHz (on a custom board, verified with an external oscilloscope).

I check the bypass of the clock source for the other SYSCLCK and I suppose it is correct (not bypassed, so EMI16 use CPU clock /6).

With E=6ns, RS=2 (12ns), RST=4 (24ns), RH=1 (6ns) I expect (RS+RST+RH+3)*E+3 = 63 ns (= 63 TSCL ticks), turn around=3 (18ns)

My interpretation of the datasheet is that I haven't to add 1 to the clock count, but 0 is considered a 1.

1. With cache on and reading 512K (Bytes?) using DSP code, you measure 70 ticks average per each of the 256K 16-bit read operations. This represents 4K cache-line reads due to cache read misses, each followed by 63 cache read hits.

I measure a total of 18481160 ticks / 256K 16bits words = 70 ticks.

Cache line should be 128 bytes (L2 enable), that is 64 data

2. With cache off and reading 512K (Bytes?) using DSP code, you measure 250 ticks average per each of the 256K 16-bit read operations.

Yes, I measure 66060223 ticks (= 251 ticks for each 16 bit read)

3. With cache on and reading 2 Bytes, you measure 70 ticks * 64 reads (cache line) or 4480 ticks total.

One read (2 bytes) = 2071 ticks

8 read (2 bytes) = 2126 ticks

These two seems to be too low. Mybe I have to verify them.

These times are about a Magnetic RAM. I have collected data on a Flash and an FPGA (I have 3 devices connected on the EMIF). The result for 512K cached are coherent (considering the difference in the CS setup), but there something of strange in the single access with cache I have to check.

0 RandyP over 12 years ago in reply to Alberto Chessa

TI__Guru* 84110 points

Alberto,

Good data, and I look forward to hearing back on your re-check.

Alberto Chessa said:

With E=6ns, RS=2 (12ns), RST=4 (24ns), RH=1 (6ns) I expect (RS+RST+RH+3)*E+3 = 63 ns (= 63 TSCL ticks), turn around=3 (18ns)

My interpretation of the datasheet is that I haven't to add 1 to the clock count, but 0 is considered a 1.

Our documentation sometimes makes things very clear and sometimes it does not. The RS/RST/RH in your quote above and the datasheet's timing table are referring to the value programmed into the field of the AnCR registers. You supplied enough information to make it clear. I prefer the way Andrey commented his code, where a value of 1 in the RS field is commented to be 12ns which is the true width of the Read Setup time. The "+3" in the datasheet's equation tries to account for the "0 is considered a 1" issue. I appreciate that you supplied plenty of information to make it very clear what is correct. I suspect you have been doing this for a while; your experience shows.

Alberto Chessa said:
These times are about a Magnetic RAM. I have collected data on a Flash and an FPGA (I have 3 devices connected on the EMIF). The result for 512K cached are coherent (considering the difference in the CS setup), but there something of strange in the single access with cache I have to check.

Not that this is news to you, but other than when programming the Flash, it should be okay to have cache enabled for the RAM and Flash devices. With a 16MB region size per MAR register, it is usually very inconvenient to separate some parts of the FPGA's address space to be cacheable and some parts to be non-cacheable.

Andrey,

Even if you get the same number for your timing measurement with TSCL, please do report it for analysis. Please also include your DSP clock speed and your EMIF clock speed.

Regards,
RandyP

0 RandyP over 12 years ago in reply to RandyP

TI__Guru* 84110 points

Andrey, Alberto,

In some cases of use of the EMIF16, we have found that an unused internal feature can cause unintended delays between some EMIF16 accesses. This feature can be disabled by setting the msb of 0x20C00008 to 1. I recommend setting this in all cases for the C6678, and placing it near the top of your main() function.

*(Uint32*)0x20C00008 |= 0x80000000; // Disable unused internal EMIF feature

Regards,
RandyP

0 Alberto Chessa over 12 years ago in reply to RandyP

Mastermind 6670 points

RandyP said:

Andrey, Alberto,

In some cases of use of the EMIF16, we have found that an unused internal feature can cause unintended delays between some EMIF16 accesses. This feature can be disabled by setting the msb of 0x20C00008 to 1. I recommend setting this in all cases for the C6678, and placing it near the top of your main() function.

*(Uint32*)0x20C00008 |= 0x80000000; // Disable unused internal EMIF feature

Hi,

Now I have no time to read all the other forum message about the EMIF16, check and write here all the result on the three CS I use, but I did a run with the undocumented setup you suggest. At first glance the single access time is improved at least on the flash device. Also I notice that now the access time are more or less the same on the real device (custom board) and on the EVM.

Also I try wider data access (32 and 64 bits): in this case the performance of the single access is improved.

Single 16bits FLASH access without recommended setting: 345 ticks, with recommended setting: 256 ticks

Single 32bits FLASH access with recommended setting: 386 ticks, that is 196 ticks per single 16bits access

256K 16bits FLASH access cached with recommended setting: 37208480 ticks, that is 141 ticks per single 16bits access

My FLASH CS setup is:

CORECLOCK = 50MHz -> PLL( DIV=0, MUL=39) = 1GHz PLLOUT, so SYSCLK7 = PLLOUT/6 = 166.66MHz

AWCRR = reset value (extended Wait State not used)

0 Andrey Savinkov over 12 years ago in reply to RandyP

Intellectual 375 points

RandyP said:

Please explain how you checked the clock() function and determined that it is working correctly. I have seen other users who incorrectly used the clock() function, which is a wrapper for the HOSTclock() function, which is not going to give the results that you need.

I used the same method to calculate clocks for GPIO data reading:

#define GPIO_IN_DATA_REG_ADR 0x02320020

pData = (uint32_t*) GPIO_IN_DATA_REG_ADR;

volatile uint32_t GPIO_DATA = 0;

...

t_start = clock();
GPIO_DATA = *pData;
t_stop = clock();

...

I found that "GPIO_DATA = *pData" takes about 100 cycles. Then I set data like "1, 2, 3, 4, 5, ... 4095, 0, 1, 2, ..." on GPIO pins with different rate: 1.5 MB/sec, 3 MB/sec and faster. As a result I found that the estimation of GPIO reading time 100 ns is close to reality. EMIF16 shows about 600 cycles in my case. This is 6 times slower. So, probably I can use this method to estimate the reading time for EMIF16. But now I'm not sure in it.

RandyP said:

Please try the TSCL method that Alberto Chessa describes above, and report back the number of cycles from that method. That is a known accurate method.

I tried to use the TSCL method. There are good examples in TI wiki: http://processors.wiki.ti.com/index.php/Porting_GPP_code_to_DSP_and_Codec_Engine

However I got error message when I try to compile this code:


#include <c6x.h> 
...
TSCL = 0;

unsigned long t1, t2;

t1 = TSCL; EMIF_DATA = *pData; t2 = TSCL;

I got error message "error: identifier "TSCL" is undefined". So strange ! TSCL is defined in <c6x.h>. My board is TMS320C6678. May be I have to use another method for C6678 ? I also tried to calculate cycles like this:

 TSC_enable(); ... unsigned long long t1, t2;

t1 = TSC_read();

EMIF_DATA = *pData;

 t2 = TSC_read();

In this case I get error message from Linker:

<Linking>

undefined first referenced
symbol        in file
--------- ----------------
TSC_enable ./main.obj
TSC_read   ./main.obj

error: unresolved symbols remain

Probably, I have to link some module in *cfg file. Could you give me an advice on TSCL method ? Maybe I have to add something to #include to define TSCL ? Which module should I add to my project to use TSC_read() and TSC_enable() ?

0 Andrey Savinkov over 12 years ago in reply to Andrey Savinkov

Intellectual 375 points

TSCL is defined in <c6x.h> as extern:

extern __cregister volatile unsigned int TSCL;

So, this is just declaration of TSCL. Probably TSCL is defined in other file or library... Who know where ?

0 RandyP over 12 years ago in reply to Andrey Savinkov

TI__Guru* 84110 points

Andrey,

Which version of the Code Generation Tools are you using?

Which device version are you building for?

In the CCS Debug perspective, find your project, then right-click Properties. Click on the General item in the left-hand pane. Please report the Compiler version (and possibly Effective compiler version) and also the Device->Family and ->Variant selections.

In your compiler output spew, there will be a command such as -mv6600.

Regards,
RandyP

0 Johannes over 12 years ago in reply to Andrey Savinkov

Mastermind 6240 points

Hi,

To use the two registers you only need c6x.h.

Other way of doing this is using csl functions (note that in this case you have to add the csl library to the project - you can find it in the csl directory in the pdk folder)

CSL_Uint64 time0, time1;
CSL_tscEnable();
time0 = CSL_tscRead();

// your code here

time1 = CSL_tscRead();

printf("Time = %lld\n", time1 - time0);

0 Andrey Savinkov over 12 years ago in reply to RandyP

Intellectual 375 points

RandyP said:

Which version of the Code Generation Tools are you using?

Which device version are you building for?

I use the Code Generation Tool v.7.2.1.

As for device, I was building for C6000. Then I changed the device to C66xx and all became OK. Now I can use TSCL register. By the way, TSCL also shows ~600 cycles for reading of one 16-bit word. I don't have FPGA now. Other team is working with this. I study how to read data from FPGA using EMIF16 when this FPGA will be ready. Maybe the data reading from 0x70000000 is so slow because nothing is connected to EVM board ?

0 Andrey Savinkov over 12 years ago in reply to Johannes

Intellectual 375 points

Johannes said:

Other way of doing this is using csl functions (note that in this case you have to add the csl library to the project - you can find it in the csl directory in the pdk folder)

CSL_Uint64 time0, time1;
CSL_tscEnable();
time0 = CSL_tscRead();

// your code here

time1 = CSL_tscRead();

printf("Time = %lld\n", time1 - time0);

Could you tell me how to add the csl library to my project ? I'm just beginner.

0 Johannes over 12 years ago in reply to Andrey Savinkov

Mastermind 6240 points

Right Click on project -> Properties -> CCS Build (or just Build, depending on if advanced settings are shown or not) -> C6000 Linker -> File Search Path

In "Include library file or..." add ti.csl.ae66, it's located in your pdk installation dir, something like: ti\pdk_C6678_1_0_0_21\packages\ti\csl\lib\ti.csl.ae66

.ae66 if for little endian (default) and .ae66e is for big endian

Regards

0 Andrey Savinkov over 12 years ago in reply to Johannes

Intellectual 375 points

Johannes said:

Right Click on project -> Properties -> CCS Build (or just Build, depending on if advanced settings are shown or not) -> C6000 Linker -> File Search Path

In "Include library file or..." add ti.csl.ae66, it's located in your pdk installation dir, something like: ti\pdk_C6678_1_0_0_21\packages\ti\csl\lib\ti.csl.ae66

Thanks ! It works in my project !

0 RandyP over 12 years ago in reply to Andrey Savinkov

TI__Guru* 84110 points

Andrey,

Does this complete your question for now?

Regards,
RandyP

0 Johannes over 12 years ago in reply to RandyP

Mastermind 6240 points

Randy,

Before closing the thread, I have other question based on a previous Andrey's question. He asked if the times he get could be because there's nothing connected to the EVM, my question is, is there any (electrical) problem to leave an output (such as EMIFA and EMIFD) floating, with nothing connected, when they are being toggled?

Thanks

0 RandyP over 12 years ago in reply to Johannes

TI__Guru* 84110 points

Johannes,

Any normal CMOS output can be left unconnected with no problem. That just means it is not loaded.

Any normal CMOS input must not be left floating because of potential noise created if it happens to be settled around the switching region.

Many of our inputs have internal pull-ups or pull-downs on them. If these are active, then that relieves the problem with floating inputs.

SERDES signals may need to be handled differently, and the datasheet will generally address those.

Always check the datasheet and errata and relevant User Guides and Hardware Design Guide for specific requirements for any of our devices.

The EMIFWAIT0 and EMIFWAIT1 signals could cause problems if they are not at the right level during EMIF operations, but that can also be controlled by software.

Regards,
RandyP

0 Andrey Savinkov over 12 years ago in reply to RandyP

Intellectual 375 points

RandyP said:

Does this complete your question for now?

Before closing the thread I'd like to ask two more questions on EMIF.

1. What does influense on access time ? I defined EMIF16 setings as it is shown in my first post and turned cache off:

#include <ti/csl/csl_cacheAux.h>

CACHE_disableCaching(112); // MAR = 112

As far as I understood ti wiki and RandyP's posts I have to disable caching. So, now I have about 600 cycles for 2-byte word read. It there any recommendation how to decrease the access time ?

2. If nothing is connected toEVM, can this influence on EMIF16 access time ?

Processors

Processors forum

EMIF16 to transfer data from FPGA to C6678