This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Codec Speed and L1 Memory Question



I've written a codec that inputs a video frame, does some processing, and outputs a video frame.  This I/O is to regular memory, I believe, and not the fast (L1?) memory that I haven't studied yet.  Nevertheless, it's fast enough.  DM6467T, Arago Linux

The processing involves referencing a third memory buffer that contains a lookup table.  I could in theory use 24 bits to address a table containing 16,777,216 bytes.  However, the codec runs pretty slow in this case, presumably because of the size of the table and the access of slow memory.

When I back off my specs a little and use 18 bits to address the table instead, the table is only 262,144 bytes and my codec runs very fast.  Adding just a single bit, and doubling the size of the table to 524,288 causes a big slowdown.  So there's some threshold I'm going over.

I'm not intentionally using L1D.  And L1D is only 32K anyway, which is significantly smaller than even my faster 262,144 byte table.

Any clues what's going on here?  Is Linux or someone else doing some caching for me?  My speed is slightly video-frame-data dependent, and so I may only be visiting a portion of the table at a time, and might visit more of it on the bigger table.  But someone would have to be managing associating caching for me, because I am not?

Where is the magic elf?

I'm getting ready to need a bigger table anyway, so I need to understand the elf!

Thanks,

Helmut

 

  • Any comments would be sincerely appreciated...

  • I'll put in my 2 cents. You wandering into an area I've never had to deal with. Never had to use such relatively large amounts of memory. I am guessing that you may be running into paging and swap files/paritions. It's rare that embedded systems have swap space. Googling about got the command "swapon -s" to show what swap space you are using.

  • Interesting.  Must be a difference in what we think is large!  I'm processing NTSC 720x480 YUV422SP.  I think that multiplies out to a 675 KiB frame buffer.  (YUV422SP has net 2 bytes per pixel, right?  4 byte YYUV is 2 pixels.)  I guess that seems "large" to me, but it's par or small for video (much smaller than HD).  That's also bigger than DM6467T L1D memory size, of course, which is only 32K.  32K seems small to me, however, in this context.  However, I recently sliced my code into 17 pixel rows, overlapped 2 pixels.  17x480 is 15.9 KiB per slice.  Seems like TWO slices will fit in 32KiB.  

    Just barely enough!

    Meanwhile, CANNY_TI_do1DDma() chunk size is 0xFFFC or 63.9 KiB.  This chunk is much larger than my slice, or even L1D in general.  Therefore, CANNY_TI_do1DDma() can have the loop REMOVED altogether (along with assertion that size small enough).  Thus, splitting CANNY_TI_do1DDma() into MEMCPYstart() and MEMCPYwait() should be trivial.  Also, with TWO slices fitting in L1D memory, I can start transfer into one, wait then process earlier transfer into other, and vice versa.

    It all appears to work out perfectly.  (Dunno why chunk size is 0xFFFC, but I don't need it even that large, so I won't worry about it.)

    BTW, thanks for interacting with me on this and thus helping me think it through.  Such is often more valuable than simply being given the answer!

    -Helmut

  • More 2 cents. Well...large as in the old school embedded system where QVGA RGB565 was about as large as it got. I guess frames sizes of 675KiB common now. I assume that the 32KB of L1D memory is cache. That actually sounds large to me. Cache is an odd game changer. It might make unrolling a loop slower than retaining a tight loop.

    I would guess that the CANNY_TI_do1DDma() chunk size is 0xFFFC because of 16 bits of addressing minus a 4 byte DMA descriptor. And the chunk size is probably the maximum and you don't have to use the maximum.