The difference between library file .lib and .ae66?

tianxing hou

Other Parts Discussed in Thread: MATHLIB

Hello,

I would like to know the difference between file *.lib and *.ae66. Could you tell me?

And, How does use the *.ae66 at a project for the CCS v5?

I add the tsu_a.ae66 and the tsu_c.ae66 into the project, however when I compile the project, have the following errors:

I have add the source and header files into the project as followed:

Thank you.

over 13 years ago

0 Chad Courtney over 13 years ago

TI__Mastermind 30825 points

They're both extensions for Library files. .ae66 is used for ELF format and .lib is used for COFF format in general to identify them separately.

Best Regards,

Chad

0 Hongmei Gou over 13 years ago in reply to Chad Courtney

TI__Expert 4335 points

Hi tianxing,

It looks like you are using the TSU component from MCSDK Video 2.0. If that is true, please define tsuContext (and its entries) in your application to address the last linking error.

The first two linking errors can be due to project settings. If it is possible, please provide the complete compilation log or the CCS project so that we can take a look.

Thanks,

Hongmei

0 tianxing hou over 13 years ago in reply to Hongmei Gou

Expert 2435 points

Thank you, I have resolved the question.

0 tianxing hou over 13 years ago in reply to Hongmei Gou

Expert 2435 points

Hi Hongmei,

I used the TSU component from MCSDK Video 2.0. However, I found it will consume too much time.

Could you tell me the performance of the TSU component.

Thank you.

0 Hongmei Gou over 13 years ago in reply to tianxing hou

TI__Expert 4335 points

Hi tianxing,

The number of cycles consumed by TSU depends on input/output resolutions, as well as memory/cache configuration for the application.

Below please find the number of Million cycles taken by each frame in our benchmarking.

1080p to 720p	720p to 1080p	1080p to D1	D1 to1080p	720p to D1	D1 to 720p
14M cyles	18M cyles	8M cyles	16M cyles	5M cyles	7M cyles

These numbers are obtained with:

L1D cache: 32K

L1P cache: 32K

L2 cache: 64K

DDR: cache enabled; pre-fetch enabled.

TSU scratch: placed in local L2

Program: placed in MSMC

We are also optimizing the cycle performance of TSU. The optimized TSU will be packaged in the next MCSDK Video release.

Thanks,

Hongmei

0 tianxing hou over 13 years ago in reply to Hongmei Gou

Expert 2435 points

Thanks, Hongmei.

I find the TSU have two algorithm for the interpolation, one is based on the bicubic algorithm and another is based on the polyphase algorithm. What's the differences between them in performance.

And I have tried them without memory/cache configuration. The result is not satisfactory. Could you provide an example for us.

In the component location of tsu, I don't find the datasheet about the benchmarking and more information.

I have some others questions about tsu in the forum threads below. Could you give me some advice?

http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/t/198056.aspx

Thank you for your help again.

Tianxing

0 Hongmei Gou over 13 years ago in reply to tianxing hou

TI__Expert 4335 points

Hi Tianxing,

Attached below please find our optimized TSU (CPU copy replaced with EDMA transfters) along with TSU unit test application. Please unzip it <mcsdk_video_2_0_0_10_install_dir>\components\ti\mas\tsu and try it out.

1830.tsu.zip

The unit test application is a CCSv5 project, which is located at tsu\test\ccsProject. Several notes:

1) Current TSU supports resolution up to 1920x1088

2) Unit test is using EDMA, as "USE_EDMA" is pre-defined for the project. So the performance will be improved from what we posted earlier for using memcpy

3) Configuration of unit test is defined in tsu\test\testVecs\config\testVecs.cfg. Please modify it according to your application:

Line 1)Input YUV 4:2:0 clip name

Line 2)Output YUV 4:2:0 clip name

Line 3)Input Image Width

Line 4)Input Image Height

Line 5)Output Image Width

Line 6)Output Image Height

Line 7)TSU Algorithm (TSU_POLYPHASE or TSU_BICUBIC)

Line 8)READ_INPUT_FROM_FILE or READ_INPUT_FROM_DDR2

4) For file IO, use READ_INPUT_FROM_FILE in Line 8). Also place your yuv input at tsu\test\testVecs\input. The output will be generated at tsu\test\testVecs\output.

5) To avoid slow data IO, please use READ_INPUT_FROM_DDR2 in Line 8), and pre-load the input to 0x85000000 (#define DDR_READ_ADDR 0x85000000 as in tsu\test\src\main.c). In this mode, number of frames is set to 10 in the code (tsuTask()). Pleaes change that if needed.

6) Cycles taken by each frame will be printed out in CCS console. They are recorded in global cycleArray[] also. Please use this for performance evaluation.

Thanks,

Hongmei

0 tianxing hou over 13 years ago in reply to Hongmei Gou

Expert 2435 points

Hi Hongmei,

Thank you for your reply.

I have build the project and can execute it successful for the image's resolution is 176*144, if my image's resolution is 1920*1088, what's the value of SIU_TSU_SCRATCH_SIZE? I don't know the connective between the scratch size and the resolution of image.

When the resolution of image is 1920*1088, I can't read the file successful, the program will dead in line 339 of main.c. My file as follow:

4760.yuv420_1080p.rar

For more, I have some questions about the use of EDMA3. You used the ECPY APIs, how can I get the datasheet about it. What should I do if I want to use the EDMA3?

Thanks,

Tianxing.

0 Hongmei Gou over 13 years ago in reply to tianxing hou

TI__Expert 4335 points

Hi Tianxing,

Glad to know that you can build and run the TSU unit test application.

The TSU unit test provides two ways for data IO: 1) fread and fwrite; 2) read input data from pre-loaded DDR (starting from 0x85000000) and also save output data in DDR (starting from 0x88000000). For testing HD, e.g, 1920x1088 as you tried, please use method 2 to avoid slow fread and fread as follows:

1) Use READ_INPUT_FROM_DDR2 in Line 8) of tsu\test\testVecs\config\testVecs.cfg

2) Pre-load input YUV to 0x85000000 through "Memory Browser"

3) Run .out file

4) Save output YUV from 0x88000000 to PC through "Memory Browser"

The program when running with 1920x1088 is not getting stuck. Instead, it's reading the input and it can take ~18 minutes to read a single 1920x1088 frame when using XDS560v2. If you are using XDS100 USB emulator, it will take even much longer.

SIU_TSU_SCRATCH_SIZE in unit test has the same value as TSU_SCRATCH_SIZE in tsu\src\tsuinit.c. Currently this scratch size is defined for supporting up to 1920x1088. Cross check on the scratch size is in tsu\src\polyphase\tsuPolyphaseScaling.c: line 374-376.

As for your question about EDMA, the optimized TSU is using ECPY/RMAN/IRES modules from framework components to achieve EDMA based data transfers. Underneath, it still uses the EDMA3 peripheral on C6678. Hope this clarifies. For details of ECPY/RMAN/IRES, please refer to link of framework components @ http://software-dl.ti.com/dsps/dsps_public_sw/sdo_sb/targetcontent/fc/index.html.

Thanks,

Hongmei

0 tianxing hou over 13 years ago in reply to Hongmei Gou

Expert 2435 points

Hi Hongmei,

According to your instruction I have executed the project successful, and I try the resolution of 1920*1088 to 720*480, it consume 6784480 cycles in average. Thank you very much for your help.

I have a question for the TSU, now the TSU only support the 1080p resolution, however the resolution of my image is 2432*2048, what should I do if I want to implement resize the image to other resolution, for example 1080p, D1 and so on.

I have modified the code as follow:

#define GG_TSU_BLOCK_SIZE 24064 --> #define GG_TSU_BLOCK_SIZE 35840
#define IN_OUT_SIZE 3133440 --> #define IN_OUT_SIZE 7471104

I don't know how modify the SIU_TSU_SCRATCH_SIZE. I tried modify the value of SIU_TSU_SCRATCH_SIZE, however it didn't execute successful. I want to know if the size of scratch have upper limit. If it is association with that scratch placed in local L2.

The yuv data of image as follow:

3326.yuv.rar

For more, I have some questions.

1. What's the role of the tsuContext, if it is used only in the algorithm of polyphase filter?

2. Why should use the EDMA3 or memcpy in the program, if it is used only in the algorithm of polyphase filter?

Thank you for your instruction in these days again, it's so helpful for us.

0 Hongmei Gou over 13 years ago in reply to tianxing hou

TI__Expert 4335 points

Hi Tianxing,

To use TSU for resolutions higher than 1080p, we need to make changes in TSU source code and then recompile TSU libs. The following defines in tsu\src\bicubic\tsuCubic.h need to be increased for higher resolutions:

#define MAX_SIZE_X 1920
#define MAX_SIZE_Y 1088

The steps of recompiling TSU libs:

1) In command window, go to dsp\mkrel, and then run "setupenvMsys.bat bypass" (as for sv01or sv04 described in http://processors.wiki.ti.com/index.php/MCSDK_VIDEO_2.0_Getting_Started_Guide#Set_up_environment_variables)

2) go to TSU directory: bash-3.1$ cd ../../components/ti/mas/tsu

3) Run xdc command to rebuild: bash-3.1$ xdc XDCARGS="c66le_elf src"

Your changes for GG_TSU_BLOCK_SIZE and IN_OUT_SIZE are good. For SIU_TSU_SCRATCH_SIZE, you can start with a big number, say "#define SIU_TSU_SCRATCH_SIZE 177824". As there is cross check in tsu\src\polyphase\tsuPolyphaseScaling.c, exception will be reported if this large size is still not large enough. If no exception is reported, the actual scratch size can be found by recording the maximal value of (store_index + prev_pos) as used in the cross check. You can use a global variable to record this maximal usage in tsuPolyphaseScaling.c, recompile TSU lib and unit test, and then find its value in watch window after transizing is completed.

/* Cross check on size of the TSU scratch */
if( (store_index + prev_pos) > tsuContext.scratchSize) {
tsu_exception(instId, TSU_EXC_UNEXPECTED_ERROR);
}

With above changes, I tried your 2432*2048 YUV input and it can be transized to 720p successfully.

Thanks,

Hongmei

0 tianxing hou over 13 years ago in reply to Hongmei Gou

Expert 2435 points

Hi Hongmei,

Thank you for your help, I have execute the project successful, and implement the 2432*2048 --> 1920*1088, it consume 3185471 cycles in average, thank you very much.

For more, I have some questions for the program.

1. What's the mean of the tsuContext.alloc, tsuContext.free, tsuContext.availCoef, tsuContext.coeffHandle? I can't know how to use the struct of tsuContext. I find the use of the tsuContext.dataCopy and tsuContext.dataWait in the tsuPolyphaseDat.h and it will instead of the DAT_copy and DAT_wait in the tsuPolyphaseScaling.c. However I can't find the coeffHandle, availCoeff, alloc, free in the TSU code, you just init that in the main.c.

2. There are some modifications between your code and early code, for example you modified the struct of tsuContext_t, add the DataCopy and DataWait in it, what's the mean of that.

3. I tried to modify the SIU_TSU_SCRATCH_SIZE to a very large value, for example 2432*2048, however there are some errors when I build the project, as follow:

#define SIU_TSU_SCRATCH_SIZE 77824 --> #define SIU_TSU_SCRATCH_SIZE (2432*2048)

4. If the TSU is compliant with the XDAIS standard? If I set the scratch to L2SRAM, other XDAIS algorithm can set scratch to L2SRAM too?

5. What's the GMP and GMC modules?

6. I used the cubic algorithm, it consume 1798293871 cycles. If it need more modification while I used the cubic algorithm?

0 Hongmei Gou over 13 years ago in reply to tianxing hou

TI__Expert 4335 points

Hi Tianxing,

Glad that we can help.

tsuContext allows test application (instead of TSU lib) to have control on such items as buffer assignment, memory allocation/free, how to do data copying, and etc. This enables a more generic TSU. The structure of tsuContext_t is defined in tsu.h. Content of tsuContext is supplied from test application (e.g., main.c), including function pointer, base address and size of buffers. The test application also implements the related functions and allocate the related buffers. Internally in TSU, it's just using the function pointers and buffers supplied from the test application. You can search "tsuContext." inside TSU lib to find out how the tsuContext entries are used.

For example, as you pointed out, dataCopy and dataWait are newly added as two entries of tsuContext. This allows application to choose how to do data copying and how to wait until data copying is completed. In test application (main.c), tsuContext.dataCopy is pointing to function siutsu_data_xfer, which implements data copying and application can choose either "EDMA" or "memcpy" for it. For tsuContext.dataWait (siutsu_data_wait()), no actions are needed for memcpy, while wait is needed to complete the data copying with EDMA before the output data is used and/or input data is modified.

As for how to set SIU_TSU_SCRATCH_SIZE, please refer to our earlier post on 07/03 to find the maximal usage. There is no need to over-allocate. As the scratch buffer is allocated from local L2 (tsu\test\ccsProject\linker_c6678.cmd), a very large size which exceeds local L2 will result in linking error as you reported.

TSU is not compliant with XDAIS standard. If it is ensured that TSU and other XDAIS based algorithms will not access the scratch at the same time, you can use the same scratch. If not, you can allocate another scratch from local L2 for other XDAIS based algorithms to use, as long as it can fit in local L2.

GMP and GMC are global memory pool and global memory cell. Implementation details can be found from tsu\test\src\siuVigdkGmp.c and siuVigdkGmp.h.

For cycles with bicubic interpolation, is "1798293871 cycles" collected from your application or the TSU unit test we recently provided? If it's the former, please recollect with TSU unit test. Cache settings can largely affect the cycle performance.

Thanks,

Hongmei

0 Vivek Chengalvala over 13 years ago in reply to Hongmei Gou

TI__Expert 3715 points

Tianxing,

Bicubic interpolation is not hand-optimized for DSP, where as polyphase filter is optimized w/ scheduled assembly. Please use polyphase for your application.

Regards,

Vivek

0 tianxing hou over 13 years ago in reply to Hongmei Gou

Expert 2435 points

Hi Hongmei,

Thank you for your reply, it is so useful for us.

Regards,

Tianxing

0 tianxing hou over 13 years ago in reply to Vivek Chengalvala

Expert 2435 points

Hi, Vivek.

Thank you for your reply.

Regards,

Tianxing

0 tianxing hou over 13 years ago in reply to Hongmei Gou

Expert 2435 points

Hi Hongmei,

I have some questions for the code.

First, the function of siu_new_tsu() and siu_open_tsu() have an argument chnum, in main.c line 178 and line 224, what's the mean of them. I think it can implement some resize operations, for example when chnum = 1, it can implement 1080p --> 720p, when chnum = 2, it can implement 1080p --> D1.

Second, what should I do if I want to create *.lib used the TSU.

Thank you.

Tianxing.

0 Hongmei Gou over 13 years ago in reply to tianxing hou

TI__Expert 4335 points

Hi Tianxing,

The current TSU unit test supports one instance of TSU. In order to make it support multiple instances as you described, some code changes are needed. chnum when calling siu_new_tsu() and siu_open_tsu() is channel number. As you said, we can have multiple channels by specifying chnum 1, 2, 3, ...

In addition to chnum related changes, more changes are needed, such as:

1) define GG_TSU_NUM_CHANNELS N

2) Have multiple instances in tsuTask. Currently there is only one siuInst_t *inst = &siuInst[0];

3) Supply different TSU configuration for different TSU instances as needed;

4) Have multiple input/output buffers, one for each instance; If the same input is expected for all the instances, a single input buffer can be used;

5) With multiple channels, memory footprint will be increased. So, memory map changes may be needed in linker_c6678.cmd

The above gives the basic idea. Please give it a try.

For your second question: " what should I do if I want to create *.lib used the TSU", can you please provide more details?

Thanks,

Hongmei

0 tianxing hou over 13 years ago in reply to Hongmei Gou

Expert 2435 points

HI Hongmei,

Would I can understand as follow:

If I have some instances of TSU more than one, I can modify the GG_TSU_NUM_CHANNELS and allocate the data memory for the other instance. However I don't modify the scratch size, all instances share the scratch. If they share the coeff memory too?

For my second question, my mean is that if I want to get the coff library, what should I do? And the suffix of coff library is *.a66, however in the CCS3.3 studio, I add *.lib to the project as library, if I can use the *.ae66 and *.a66 in the project of CCS v3.3.

0 Hongmei Gou over 13 years ago in reply to tianxing hou

TI__Expert 4335 points

Hi Tianxing,

Yes, we only need one TSU scratch which can be shared by all the instances. The coeff memory is from GMP, and its multi-instance allocation will be taken care of when GG_TSU_NUM_CHANNELS is increased correspondingly.

Currently TSU component is not providing C66 COFF libraries. It would be straightforward to import a CCS3.3 project to CCSv5. Is it possible for you to do the importing?

Thanks,

Hongmei

0 tianxing hou over 13 years ago in reply to Hongmei Gou

Expert 2435 points

Hi Hongmei,

Thank you for your reply, now I don't want to transplant the TSU to a CCS3.3 project. I only want to know if the TSU can execute in the predecessors of C6678, for example DM648? Could you tell me about that for detail. Thank you.

I have another question about the library of C6678, I see that when I install the MCSDK, it also install the MATHLIB, DSPLIB, IMGLIB, I want to know if there are some other library support the C6678, for example VLIB?

Thanks,

Tianxing

0 Hongmei Gou over 13 years ago in reply to tianxing hou

TI__Expert 4335 points

Hi Tianxing,

Besides C6678, we have used this TSU component for C6472. TSU lib (with c64Ple as the target) can work with DM648, but TSU unit test needs modifications to address device difference in memory map, EDMA resources (if choose not using memcpy), and etc. We didn't run TSU on DM648. You can give it a try.

As for your second question, what does VLIB stand for? What specific library support are you looking for?

Thanks,

Hongmei

0 tianxing hou over 13 years ago in reply to Hongmei Gou

Expert 2435 points

Hi Hongmei,

VLIB is a video analytics and vision library, you can see it in the http://processors.wiki.ti.com/index.php?title=Software_libraries. I have been requested for it to your colleagues.

Thanks,

Tianxing

Processors

Processors forum

The difference between library file .lib and .ae66?

Processors

Processors forum

The difference between library file *.lib and *.ae66?

The difference between library file .lib and .ae66?