This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

How to Increase Speech Recognition Accuracy Using TIesr API on C5535

Hi,

I am working on a speech recognition project on eZdsp C5535 right now. This project would use 1 trigger word and several (4~7) short commands, such as 'recording', 'complete'.

I started from the reference design project introduced here: Speech Recognition Reference Design on the C5535 eZdsp

Initially the accuracy is not satisfactory, I made following changes:

1. Increased sampling rate from 8kHz to 16kHz. 

In "TIesr_C55_demo/inc/audio_data_collection.h" file, I changed following:

#define SAMP_RATE                 ( SAMP_RATE_8KHZ )                    --->       ( SAMP_RATE_16KHZ )
#define NUM_SAMP_PER_MS ( SAMPS_PER_MSEC_8KHZ )        --->       ( SAMPS_PER_MSEC_16KHZ )  

In "TIesr_C55_demo/C55/TIesrEngine/src/winlen.h" file, I changed following:

#define FRAME_LEN 480                 --->         320
#define SAM_FREQ 24000               --->         16000

In "TIesr_C55_demo/C55/TIesrEngine/src/mfcc_f.h":

if (WINDOW_LEN == 512 && SAM_FREQ == 24000 )     --->    if (WINDOW_LEN == 512 && SAM_FREQ == 16000 )

2. Increased the Rx circular buffer size from the default value 10 to 20 in "TIesr_C55_demo/inc/audio_data_collection.h" file.

#define RX_CIRCBUF_NUM_FRAMES       ( 20 )  // 10 frames in Rx circular buffer

3. Increased the codec gain configuration.

Though the accuracy is improved after these modifications, I am not very sure if I was making changes at the right places. Could you help me to see/verify if these changes are correct?

Right now, I am still working on increasing the recognition accuracy. 

1. I noticed that if I build less words/phrases into one grammar model file, the accuracy would increase. I guess if I use multiple grammar model file to store these 5~8 words/phrase, the overall accuracy would increase. 

If this assumption is correct, May I ask how to switch grammar model during the run-time? Or it is better to create multiple different TIesr Engine instances loaded with different grammar model?

2. Does TI provide other reference design on techniques such as echo cancellation and adaptive noise cancellation for the TIesr to handle the noisy environment?

3. Any other techniques to improve the recognition accuracy? 

Thank you

Da

 

 

  • Da,

    We will need to take a look at this and get back to you on (1).

    For (2),  please see the C55x AER package at http://software-dl.ti.com/libs/aer/latest/index_FDS.html 

    There isn't an example that combines TIesr with AER at this time.

    Lali

  • Hi Lali,

    Could you please let me know if you have any information on the accuracy?
    I tried the speech recognition on eZdsp C5535. But it hardly recognized the word "TI voice trigger".

    Best regards,
    j-breeze
  • j-breeze,

    Unfortunately, I do not have more details on improving accuracy. The techniques you have already tried are the best knobs to turn. Please note that acoustically rich keywords are the best for response reliability. These are words that have multiple syllables and unique sounds (e.g start recording, or capture complete). 

    I'm unware of a way to switch grammar models during run-time.

    Audio gain certainly has an impact.  Operation would not be reliable at very high or very low gains. Have to experiment for a sweet-spot. Enabling filters or integrating noise cancelling algorithms will also help improve accuracy.

    Also note that TIesr does not work well when trigger phrases are spoken with a heavy accent. Don't know how to get around that.

    Lali

  • The TIesr recognizer consists of two main parts TIesr_Flex which creates word pronunciation networks and model sets based on an input grammar, and TIesr_SI which is the recognizer itself. The accuracy of TIesr depends on the acoustic models that TIesr_Flex uses. These acoustic models must be trained for the language and accents that will be encountered in an application. TIesr comes with acoustic models for general American English only. New models may need to be trained.


    You must be careful if you wish to change sampling rates. This requires that you also specify the proper frequency parameters of the front-end. 

    To operate well in a keyword spotting capacity, consider creating a unique Filler model that covers general sounds of the language that operates as an optional explanation of the input speech in the recognition grammar.

    To obtain good recognition of single words, consider creating acoustic models that are word-based rather than the default TIesr phonetic based models. The docmentation and examples that come with the TIesr open source project explain how this can be done.

    TIesr includes the capabilities to change from one grammar to another at run time, by switching between different grammar and model set files. I am not sure whether this was included in the C55 design.  TIesr can also dynamically create grammars from input grammar text strings if TIesr_Flex is included in the executable, but that does require memory resources to store the acoustic models, dictionary and decision trees as well as the code.

    Lorin

  • Hi Lali, Hi Lorin,

    Thank you very much for your detailed information.

    Best regards,
    j-breeze