DMA/IDMA restrictions

Chris Thomas

I am looking into some ways to accelerate some algorithms I have to deal with, one of the issues is that the data is processed from arrays but in a non-contiguous manner so I am unable to gain from mem8 type commands.

Now I am wondering if it will be quicker if I rearrange the data before processing write the results in an easier format and then shuffle the result after I am done.

The DMA controller looks l like it can all kinds of trick addressing and will do what I need, but supposing I wanted to do this from L1Data to L1Data, is this allowed, there is an internal DMA controller that seems to handle internal stuff, if that because the EDMA cannot? The IDMA controller cannot do CFG to CFG or SRAM to SRAM, are there similar rules for the EDMA (SDMA)?

Also I am a little short on L1RAM, I see the EMAC controller has 8K I could use, can I use that with EDMA/IDMA? Is it L1 speed?, and do I need to power up the EMAC to use it?

I realize I could steal a bit of L2 cache to be SRAM but it seems to have a nasty advisory in the errata in just this area (DM6433 1.3.11) so I am staying clear of that.

Chris

over 12 years ago

0 RandyP over 12 years ago

TI__Guru* 84110 points

Chris,

There is a DM64x forum, but for now this is probably the best place to leave this thread. Someone else may decide it should be moved there, so do not be surprised.

The DM6433 has extra L1D SRAM that is very useful. Are you using all of that, in addition to the 32K L1D cache?

I did not realize the IDMA could not transfer L1D to L1D. It can at least transfer L1D to L2 SRAM, but it would not be done in a non-contiguous manner like the EDMA3 can do.

The EMAC RAM could be used, but it would not be as fast as the L1D SRAM. You might be able to access it using the CFG path of the IDMA but I have not tested that myself. You can try it and let us know what you see. Yes, the EMAC would have to be powered and clocked, if it is not that way by default.

The EDMA can access L1D SRAM and is a great tool for doing reordering for many applications. You will need to take care for how that fits with your system and the Advisory on the SDMA/IDMA stalls. But it is certainly the easiest way to do this without a lot of DSP overhead. Using the EDMA3 for L1D to L1D transfers should keep you away from the SDMA/IDMA Advisory's stall situation, from my understanding of that Advisory. Avoiding L2 for that is good, if you would have other things that might be sensitive to that Advisory.

Regards,
RandyP

0 Chris Thomas over 12 years ago in reply to RandyP

Expert 1310 points

Hi Randy,

Sorry about the wrong forum, not sure I have the power to move it.

My application is for telecoms, I have to move some data from a hardware interface that runs at a fixed speed (via MCBSP as TDM) to a custom network stack which is just soft real time, but has lots of small packets. Overall I have to keep several channels of data in order and meet latency targets for each.

I do this by DMA McBSP to L1, process and DMA L1 to EMIF (FPGA), this is fine if I am just passing the data on but in some circumstances I need to do on they fly compression on a per channel basis, it is a custom compression scheme that relies on a knowledge of previous data so I have to do some buffering.

The total data buffered is around 256KB so I would hope to be mainly in L2 cache, it does not feel this way to me, the performance is not where I hoped it would be, turning off the caches slows things right down so I am sure I have turned them on.

I am using every byte of L1data already, I was thinking I could evict something to the EMAC memory area to make room for a scratch buffer where I could DMA in a channels worth of data at a time and then DMA it back into the big buffer in DDR.

If the EMAC RAM is about the same as L2 SRAM that would be fine, just thought it would be easier to ask rather than to try and benchmark it in the middle of a complex system (that and my dev system is dead right now!).

Also is it cachable?

Chris

0 RandyP over 12 years ago in reply to Chris Thomas

TI__Guru* 84110 points

Chris,

Either you figured out a way to move the thread or someone did it for us.

Comparing EMAC RAM to L2 SRAM will require you to benchmark it in your system. The answer may depend on how you are using it. But in general I would say it is much faster to use L2 SRAM than EMAC RAM for the DSP to access, but it might be a wash for EDMA3.

The best performance aid that I can recommend is to be very aware of how your DDR is being accessed. The best performance is when you use EDMA3 and access contiguous bytes so it can do bursts. The DSP's cache controller also does bursts, but it is less processing overhead on the DSP to use the EDMA3 plus the EDMA3 is usually faster.

I do not believe the EMAC RAM is cacheable. It is in config space, so it is also going to be VERY slow for the DSP to access it. But you can use IDMA0 to do transfers but not channel sorting.

Regards,
RandyP

0 Chris Thomas over 12 years ago in reply to RandyP

Expert 1310 points

Well some good news, I have just squeaked over the line for my performance goal by packing and trimming the data structures - which confirms my problems are related to visiting the main memory.

Just as well because the refactoring to use the MDAC ram for 8 KB of RAM would have been hard to justify.

However I had another thought, I have a DM6433 not because I need the video stuff, it just happens to be cheaper (so I am told) than the same part without, but it does mean I have some video hardware lying around, which means frame buffers.

Looking at the docs it looks like the shared buffer logic (SBL) is a big block of RAM, but as I read it I can only use it via some dedicated specialized DMA accesses. So it would be hard to use, would that be a fair assessment?

Supposing I could change chip but had to keep to the same 376BGA layout, are there any alternates beyond the DM6431/3/5/7? The chip selector pages are a bit hit and miss about offering the pin-out. - Basically same chip, more cache?

Chris

0 RandyP over 12 years ago in reply to Chris Thomas

TI__Guru* 84110 points

Chris,

The C6424 is based on the same architecture and uses a 376-ball ZDU package. I do not know if it is pin-compatible with the DM6431/3/5/7 but you can compare if it would be a possible fit for you.

I am pretty sure that the DM6437 is the superset of the features of all of these devices, so it will have the maximum of anything in this package layout.

Regards,
RandyP

0 Chris Thomas over 12 years ago in reply to RandyP

Expert 1310 points

Hi Randy,

That one is the version with the video hardware disabled, that seems to cost more?

I was wondering if there is one of these with an arm core alongside and whether that would be any use to me?

Chirs

0 Chris Thomas over 12 years ago in reply to Chris Thomas

Expert 1310 points

I had a play with the EMAC ram anyway - but got nowhere, with code like this...

CSL_IdmaRegsOvly idma1 = (CSL_IdmaRegsOvly)CSL_IDMA_0_REGS;

        idma1->IDMA1_SOURCE = (uint32_t)dblk1;
       idma1->IDMA1_DEST   = (uint32_t)CSL_EMAC_DSC_BASE_ADDR;    // 0x01C8 2000 - 0x01C8 3FFF EMAC Control Module Descriptor Memory
       idma1->IDMA1_COUNT = 10;

        // this line makes no difference in or out - while (idma1->IDMA1_STAT) {;}

       idma1->IDMA1_SOURCE = (uint32_t)CSL_EMAC_DSC_BASE_ADDR;
       idma1->IDMA1_DEST   = (uint32_t)dblk2;
       idma1->IDMA1_COUNT = 10;

        while (idma1->IDMA1_STAT) {;}

When I now do a compare of dblk1 and dblk2 they are different, I tried using RAM in DDR or L1SRAM, I tried with and without powering up the EMAC and/or MDIO.

If I look at CSL_EMAC_DSC_BASE_ADDR in the ram window it is junk, can it see config space?

What am I missing?

Chris

0 RandyP over 12 years ago in reply to Chris Thomas

TI__Guru* 84110 points

Chris,

Your labels are mixing IDMA0 and IDMA1. Only IDMA0 can reach the Config bus, and the other address must be from L1D SRAM or L2 SRAM.

The EMAC RAM space is visible from the Memory WIndow. You can test it by trying to write to a location in that window (double-click on a location), and test it before and after you do any power and clock and reset setup for it, as a sanity test.

Regards,
RandyP

0 Chris Thomas over 12 years ago in reply to RandyP

Expert 1310 points

Hi Randy,

I tried that first, got nowhere, I thought I had messed up the mask so just moved to idma1. I just changed back to idma0 and tried again, again no use. I can read and write the block freely even without powering up the EMAC.

The current mechanism is:

McBSP <--> EDMA <--> L1Data <--> memcpy (amem8 version) <--> DDR buffer.

I am thinking I can gain a bit more easy DMA dest/srce ram from the EMAC buffer so I can push more buffers into L1data and relieve the load on the cache further. So before I refactor greatly can this EMAC ram interoperate with EDMA?

I may as well skip the IDMA and just copy it straight to DDR.

McBSP <--> EDMA <--> EMAC <--> memcpy (32 bit version) <--> DDR buffer.

or even possibly

McBSP <--> EDMA <--> EMAC (8K buffer) <--> EDMA <--> L1Data (1K buffer) <--> memcpy (32 bit version) <--> DDR buffer.

Maybe this is just madness, this RAM is CPU/6 - I could just make some DDR non-cached and get the same sort of speed.

Chris

(I realize that I can DMA to DDR but then I have to do all the cache consistency, and worse I have to pad the data structures which decreases performance by using more cache)

0 RandyP over 12 years ago in reply to Chris Thomas

TI__Guru* 84110 points

Chris Thomas said:
can this EMAC ram interoperate with EDMA?

Yes, the EDMA can access the EMAC RAM.

RandyP

0 RandyP over 12 years ago in reply to RandyP

TI__Guru* 84110 points

Chris,

IDMA0 works for accessing the EMAC RAM. You have to clear the ->IDMA0_MASK register to enable all the words to copy, and each count is 32 words.

Sorry about my confusion on the label names, but the actual problem there was that you should use IDMA0 and not IDMA1 for the transfer.

Regards,
RandyP

0 Chris Thomas over 12 years ago in reply to RandyP

Expert 1310 points

Hi Randy,

I got it worked out now, I did not realize the L1Data appeared twice in the memory map, both source and destination need to be 32 byte aligned and I suspect the examples in spru871k have the mask inverted.

Once I did all that I could see the data changing as expected.

In my application the data flowing through was corrupted, I suspect becasue the slow RAM was being overrun.

Chris.

0 Chris Thomas over 12 years ago in reply to Chris Thomas

Expert 1310 points

One more question, using the cfg ram was too slow for me but the IDMA controller does seem a useful fire and forget memcpy or memset.

I assume I can memset in L1 with no issues.

Can I memcpy within L1, the docs say I will not get full BW if I use the same dst and src port, does that really mean don't attempt it?

Chris

0 RandyP over 12 years ago in reply to Chris Thomas

TI__Guru* 84110 points

Chris,

IDMA is a sadly underused feature of the C64x+ and later DSP cores. It is very useful, but I think the biggest concern people have is how to allocate this single resource. But that just means always testing whether there is a pending transfer.

You can memset in L1 with no issues.

Copying from/to the same port or memory module simply takes more time. "to obtain full throughput" seems like a true but misleading statement to me, since if you want to memcpy the same memory module you will still do well to use IDMA1.

By the way, I was surprised to read in the MegaModule Reference Guide that IDMA1 source and destination addresses could include CFG. This is not correct and I will try to point this out to someone about the document. IDMA1 can only access the L1P, L1D, and L2 memories.

Regards,
RandyP

0 Chris Thomas over 12 years ago in reply to RandyP

Expert 1310 points

Perhaps a rename...

IDMA0 and IDMA1 sounds like you have 2 internal DMA engines, what you really have is more like CFGDMA and LXDMA.

Chris

0 Chris Thomas over 12 years ago in reply to Chris Thomas

Expert 1310 points

One more slight query.

I worried about something else using the IDMA at the same time as me so I did a scan of the ti source, I thought odds on I would be fighting for it with the EDMA package, but could find no mention in the sources.

Is this a part of the plug in mechanism (like semaphores) that I need to plug into EDMA to improve its efficiency, or does it use a more direct approach to avoid dependencies between libraries?

(But I cannot rule out Windows7s strange search function.)

Chris

0 RandyP over 12 years ago in reply to Chris Thomas

TI__Guru* 84110 points

Chris,

Chris Thomas said:
IDMA0 and IDMA1 sounds like you have 2 internal DMA engines, what you really have is more like CFGDMA and LXDMA.

IDMA0 and IDMA1 are two "orthogonal" channels of the internal DMA engine, designed to run concurrently. I have not tried running them simultaneously but if you run a benchmark, show us your results and code. I like your names better than ours since I always have to go to the documentation to figure out which is which.

Chris Thomas said:

... something else using the IDMA at the same time as me ... fighting for it with the EDMA package ...

Is this a part of the plug in mechanism (like semaphores) that I need to plug into EDMA ... or a more direct approach to avoid dependencies between libraries?

There is no designed-in protection mechanism for the IDMAn resource. This may be one reason why the CSL does not use IDMA: to avoid designed-in conflicts. Like you, I have not found any instances of our libraries using IDMA, but there is no direct relationship between IDMA and EDMA, so your questions are confusing to me.

For protection, you can setup and use a hardware semaphore to "protect" this resource, but that is only protected by software convention and not by any direct method - all users of IDMA must agree to and follow the use of the same semaphore.

If you find no other uses of IDMA, then you only need to protect yourself from your own code. But the safest thing would be to poll the ACTV bit before starting to write to the IDMA registers, then to disable interrupts just before writing to the IDMA registers and then restore interrupts just after writing to ->Count register. This will be a short amount of time to stall interrupts so it should not hurt anything in the system.

Regards,
RandyP

Processors

Processors forum

DMA/IDMA restrictions