This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

What are the basic CFG File and Modules Needed for NDK Stack and OpenMP projects?

I don't understand the CFG files and required modules well. What are the required components to build a project to do the following?

1) create the NDK stack for TCP/IP packet sending over ethernet

2)multi-core FFT process data from DDR3 using OpenMP

The processing task is relatively simple. Using an array stored in DDR3, each core is given it's slice of the original data and runs an FFT and outputs to another buffer in DDR3. 

I have gotten the processing to work correctly (data-wise) based on the MCSDK image processing demo, but I have much slower times than I see in my stripped down processing project (based on the OpenMP HelloWorld project). My original benchmark for a N=2048 fft was ~30,000 cycles, and using the image_processing project, it jumps up to about ~130,000 cycles. I suppose there's a problem with caching and settings of that sort.

Thanks,

Ryan

  • Ryan,

    I can work witth you on the FFT part.

    Could you explain more on what you are trying to do?  Are you trying to use multiple cores to compute 2048 FFT on C6678?  How many cores did you use to have the 30,000 cycles?  Is this for one single 2048 FFT?  What kind of scheme did you use to use multiple cores to compute a single 2048 FFT?

    Xiaohui

  • Hi Xiaohui,

    Could you explain more on what you are trying to do?  Are you trying to use multiple cores to compute 2048 FFT on C6678? Yes, multiple cores are simultaneously running an FFT, they are not all calculating the same FFT. 

    How many cores did you use to have the 30,000 cycles? I used 8 cores to run the FFT in 30,000 cycles, each core ran an FFT and performed it in 30,000 cycles.

    Is this for one single 2048 FFT? Yes, this is the time to calculate one single FFT. See my attached code (it does not compile, it's a stripped down version of what my code does), I use the fft_forwardR2C1_16b_16b_32i() function.

    What kind of scheme did you use to use multiple cores to compute a single 2048 FFT? I used a scheme similar to that outlined in SPRABB7, this is not 8 cores splitting the task of a single FFT. This is for 8 cores simultaneously computing 1 FFT each. 


    Thanks for your help!

    Ryan


    /*
     * dsp.c
     */

     

    #define NTHREADS            8
    #define NUMBER_OF_LINES     800
    #define FFT_MAX_L           2048

     

    #pragma DATA_SECTION( inputData, "ddr3" );
    int_least16_t inputData[ 2048 * NUMBER_OF_LINES ]; // stored in DDR3
    Fft_16b_16b_32i_Params_t param_s;

     

    /*
     * processFrame
     *
     * This is called from main.c within main().
     */
    void processFrame( void )
    {
           uint8_t  tid;
        uint8_t  nthreads = NTHREADS;
        uint32_t numberOfLines = NUMBER_OF_LINES;

          

        omp_set_num_threads( NTHREADS );

          

           fftParametersInit(); // generate twiddle factors, etc for FFT, these are stored in global param_s struct

          

           // fork into the OpenMP parallel region
           #pragma omp parallel default( none ) firstprivate( param_s ) shared( inputData, nthreads, numberOfLines ) private( tid )
           {
                  tid = omp_get_thread_num();
                  coreProcess( &inputData[ tid * ( numberOfLines / nthreads ) * FFT_MAX_L ], &param_s, tid );
           }
    }

     

    /*
     * coreProcess
     *
     * This function is run simultaneously on each core. NUMBER_OF_LINES is divided by the NTHREADS to divide the inputData into batches.
     * With NUMBER_OF_LINES = 800, and NTHREADS = 8, the batchSize = 100, so each core is sent off into performing the FFT in a loop for
     * 100 iterations.
     *
     * The FFT takes ~30,000 cycles on each core. It is regardless if NTHREADS = 1 or NTHREADS = 8 -- this is the time spent in the FFT for
     * individual cores.
     */
    void coreProcess( uint_least16_t *coreInputData, Fft_16b_16b_32i_Params_t *fftParams, uint8_t coreId )
    {
           int_least16_t windowOutputData[FFT_MAX_L];
           cplx_least16_t fftOutputData[ FFT_MAX_L ];
           int32_t batchSize = NUMBER_OF_LINES / NTHREADS;

          

           // fill in fft parameters to working buffers
           populateFftParameters();

          

           for( i = 0; i < batchSize; i++ )
           {
                  // Window data before FFT
                  window( &coreInputData[ i * FFT_MAX_L ], windowOutputData );

                 

                  // FFT -- time to complete this scope should be 30,000 cycles
                  fft_forwardR2C1_16b_16b_32i( windowOutputData, fftParams, fftOutputData );
           }
    }
  • Ryan,

    Let me try to understand more clearly.  Both the 30,000 cycles and 130,000 cycles performance are multiple core implementations.  The 30,000-cycle performance is based on OpenMP.  The 130,000-cycle performance is your implementation merged into the image processing demo from MCSDK.  Is this understanding correct?  Could you please forward the map files from both projects?

    Thanks,

    Xiaohui

  • Hi Xiaohui,

    OpenMP is letting the code fork into 8 cores. Each core does a separate 2048 point FFT at the same time. Timing of the 2048 point FFT in each core is 130,000 cycles, when I know that it should be 30,000 cycles. 30,000 cycles is the performance benchmark in the DSPLIB / STK-MED library, and I've confirmed this before in 8 cores, simultaneously. 

    Thanks,

    Ryan

  • Hi Ryan,

    Do you have the exact configuration such as cache and memory for the two experiments you did, the one you achieved 30,000 cycles and the other one you achieved 130,000 cycles.

    There are some overheads from OpenMP, but it does not explain the cycle jump from 30,000 to 130,000?

    Regards,

    Xiaohui