Please use this identifier to cite or link to this item:
http://hdl.handle.net/10603/335523
Title: | A multimodal approach to develop robust speech recognition systems |
Researcher: | Radha, N |
Guide(s): | Shahina, A |
Keywords: | Multimodal speech Speech recognition Signal Processing |
University: | Anna University |
Completed Date: | 2020 |
Abstract: | In adverse conditions, building a robust Automatic Speech Recognition (ASR) system is a challenging task. The performance of an ASR system degrades due to environmental noise, speaker variability, and transmission variability, among others. One way to improve the performance of such a system is using a multimodal approach. This thesis focuses on building a robust ASR system by combining the complementary evidence present is the multiple modalities through which speech is expressed. These complementary evidences of speech, as recorded using a Normal Microphone (NM), a Throat Microphone (TM), and a camera, respectively, are combined for building a robust multimodal speech recognition system for the highly confusable Consonant-Vowel (CV) units of Hindi language. These CV sound units differ in their place and manner of articulation, yet have high acoustic similarity and hence are highly confusable. A comparative study on the recognition performance of the trimodal ASR system (built based on the three modalities) and the bimodal ASR systems that are built both with and without NM data (using only TM and Visual Lip Reading (VLR)) is carried out to analyze the extent to which the audio and visual cues provide complementary evidence in recognizing the acoustically similar, highly confusable sound units. Such a study would help in designing robust multimodal ASR systems even for highly confusable datasets. First, unimodal systems are built for NM, TM, and VLR speech for the syllabic units. The syllabic units of Hindi language are categorized into three groups (Vowel, Place of Articulation (POA), and Manner of Articulation(MOA)). Mel Frequency Cepstrum Coefficient (MFCC) features are extracted from NM and TM signal. Pixel-based transforms such as Discrete Cosine Transform (DCT), Discrete Wavelet Transform (DWT), and motion-based features such as Motion History Image (MHI) based DCT, DWT, and Zernike Moments (ZM) are used as features extracted from visual speech signals. A simple Left to Right (L-R) Hidden Markov Model - Ga |
Pagination: | xxix,179 p. |
URI: | http://hdl.handle.net/10603/335523 |
Appears in Departments: | Faculty of Information and Communication Engineering |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
01_title.pdf | Attached File | 21.38 kB | Adobe PDF | View/Open |
02_certificates.pdf | 105.61 kB | Adobe PDF | View/Open | |
03_vivaproceedings.pdf | 381.17 kB | Adobe PDF | View/Open | |
04_abstracts.pdf | 13.37 kB | Adobe PDF | View/Open | |
05_bonafidecertificate.pdf | 324.74 kB | Adobe PDF | View/Open | |
06_acknowledgements.pdf | 409.78 kB | Adobe PDF | View/Open | |
07_contents.pdf | 12.79 kB | Adobe PDF | View/Open | |
08_listoftables.pdf | 19 kB | Adobe PDF | View/Open | |
09_listoffigures.pdf | 128.95 kB | Adobe PDF | View/Open | |
10_listofabbreviations.pdf | 54.85 kB | Adobe PDF | View/Open | |
11_chapter1.pdf | 456.78 kB | Adobe PDF | View/Open | |
12_chapter2.pdf | 284.61 kB | Adobe PDF | View/Open | |
13_chapter3.pdf | 887.08 kB | Adobe PDF | View/Open | |
14_chapter4.pdf | 3.55 MB | Adobe PDF | View/Open | |
15_chapter5.pdf | 878.42 kB | Adobe PDF | View/Open | |
16_chapter6.pdf | 1.02 MB | Adobe PDF | View/Open | |
17_chapter7.pdf | 883.85 kB | Adobe PDF | View/Open | |
18_conclusion.pdf | 57.36 kB | Adobe PDF | View/Open | |
19_references.pdf | 110.72 kB | Adobe PDF | View/Open | |
20_listofpublications.pdf | 15.24 kB | Adobe PDF | View/Open | |
80_recommendation.pdf | 97.36 kB | Adobe PDF | View/Open |
Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).
Altmetric Badge: