Please use this identifier to cite or link to this item: http://hdl.handle.net/10603/335523
Title: A multimodal approach to develop robust speech recognition systems
Researcher: Radha, N
Guide(s): Shahina, A
Keywords: Multimodal speech
Speech recognition
Signal Processing
University: Anna University
Completed Date: 2020
Abstract: In adverse conditions, building a robust Automatic Speech Recognition (ASR) system is a challenging task. The performance of an ASR system degrades due to environmental noise, speaker variability, and transmission variability, among others. One way to improve the performance of such a system is using a multimodal approach. This thesis focuses on building a robust ASR system by combining the complementary evidence present is the multiple modalities through which speech is expressed. These complementary evidences of speech, as recorded using a Normal Microphone (NM), a Throat Microphone (TM), and a camera, respectively, are combined for building a robust multimodal speech recognition system for the highly confusable Consonant-Vowel (CV) units of Hindi language. These CV sound units differ in their place and manner of articulation, yet have high acoustic similarity and hence are highly confusable. A comparative study on the recognition performance of the trimodal ASR system (built based on the three modalities) and the bimodal ASR systems that are built both with and without NM data (using only TM and Visual Lip Reading (VLR)) is carried out to analyze the extent to which the audio and visual cues provide complementary evidence in recognizing the acoustically similar, highly confusable sound units. Such a study would help in designing robust multimodal ASR systems even for highly confusable datasets. First, unimodal systems are built for NM, TM, and VLR speech for the syllabic units. The syllabic units of Hindi language are categorized into three groups (Vowel, Place of Articulation (POA), and Manner of Articulation(MOA)). Mel Frequency Cepstrum Coefficient (MFCC) features are extracted from NM and TM signal. Pixel-based transforms such as Discrete Cosine Transform (DCT), Discrete Wavelet Transform (DWT), and motion-based features such as Motion History Image (MHI) based DCT, DWT, and Zernike Moments (ZM) are used as features extracted from visual speech signals. A simple Left to Right (L-R) Hidden Markov Model - Ga
Pagination: xxix,179 p.
URI: http://hdl.handle.net/10603/335523
Appears in Departments:Faculty of Information and Communication Engineering

Files in This Item:
File Description SizeFormat 
01_title.pdfAttached File21.38 kBAdobe PDFView/Open
02_certificates.pdf105.61 kBAdobe PDFView/Open
03_vivaproceedings.pdf381.17 kBAdobe PDFView/Open
04_abstracts.pdf13.37 kBAdobe PDFView/Open
05_bonafidecertificate.pdf324.74 kBAdobe PDFView/Open
06_acknowledgements.pdf409.78 kBAdobe PDFView/Open
07_contents.pdf12.79 kBAdobe PDFView/Open
08_listoftables.pdf19 kBAdobe PDFView/Open
09_listoffigures.pdf128.95 kBAdobe PDFView/Open
10_listofabbreviations.pdf54.85 kBAdobe PDFView/Open
11_chapter1.pdf456.78 kBAdobe PDFView/Open
12_chapter2.pdf284.61 kBAdobe PDFView/Open
13_chapter3.pdf887.08 kBAdobe PDFView/Open
14_chapter4.pdf3.55 MBAdobe PDFView/Open
15_chapter5.pdf878.42 kBAdobe PDFView/Open
16_chapter6.pdf1.02 MBAdobe PDFView/Open
17_chapter7.pdf883.85 kBAdobe PDFView/Open
18_conclusion.pdf57.36 kBAdobe PDFView/Open
19_references.pdf110.72 kBAdobe PDFView/Open
20_listofpublications.pdf15.24 kBAdobe PDFView/Open
80_recommendation.pdf97.36 kBAdobe PDFView/Open
Show full item record


Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Altmetric Badge: