A multimodal approach to develop robust speech recognition systems

Radha, N

Please use this identifier to cite or link to this item: http://hdl.handle.net/10603/335523

Title:	A multimodal approach to develop robust speech recognition systems
Researcher:	Radha, N
Guide(s):	Shahina, A
Keywords:	Multimodal speech Speech recognition Signal Processing
University:	Anna University
Completed Date:	2020
Abstract:	In adverse conditions, building a robust Automatic Speech Recognition (ASR) system is a challenging task. The performance of an ASR system degrades due to environmental noise, speaker variability, and transmission variability, among others. One way to improve the performance of such a system is using a multimodal approach. This thesis focuses on building a robust ASR system by combining the complementary evidence present is the multiple modalities through which speech is expressed. These complementary evidences of speech, as recorded using a Normal Microphone (NM), a Throat Microphone (TM), and a camera, respectively, are combined for building a robust multimodal speech recognition system for the highly confusable Consonant-Vowel (CV) units of Hindi language. These CV sound units differ in their place and manner of articulation, yet have high acoustic similarity and hence are highly confusable. A comparative study on the recognition performance of the trimodal ASR system (built based on the three modalities) and the bimodal ASR systems that are built both with and without NM data (using only TM and Visual Lip Reading (VLR)) is carried out to analyze the extent to which the audio and visual cues provide complementary evidence in recognizing the acoustically similar, highly confusable sound units. Such a study would help in designing robust multimodal ASR systems even for highly confusable datasets. First, unimodal systems are built for NM, TM, and VLR speech for the syllabic units. The syllabic units of Hindi language are categorized into three groups (Vowel, Place of Articulation (POA), and Manner of Articulation(MOA)). Mel Frequency Cepstrum Coefficient (MFCC) features are extracted from NM and TM signal. Pixel-based transforms such as Discrete Cosine Transform (DCT), Discrete Wavelet Transform (DWT), and motion-based features such as Motion History Image (MHI) based DCT, DWT, and Zernike Moments (ZM) are used as features extracted from visual speech signals. A simple Left to Right (L-R) Hidden Markov Model - Ga
Pagination:	xxix,179 p.
URI:	http://hdl.handle.net/10603/335523
Appears in Departments:	Faculty of Information and Communication Engineering

Files in This Item:

File	Description	Size	Format
01_title.pdf	Attached File	21.38 kB	Adobe PDF	View/Open
02_certificates.pdf		105.61 kB	Adobe PDF	View/Open
03_vivaproceedings.pdf		381.17 kB	Adobe PDF	View/Open
04_abstracts.pdf		13.37 kB	Adobe PDF	View/Open
05_bonafidecertificate.pdf		324.74 kB	Adobe PDF	View/Open
06_acknowledgements.pdf		409.78 kB	Adobe PDF	View/Open
07_contents.pdf		12.79 kB	Adobe PDF	View/Open
08_listoftables.pdf		19 kB	Adobe PDF	View/Open
09_listoffigures.pdf		128.95 kB	Adobe PDF	View/Open
10_listofabbreviations.pdf		54.85 kB	Adobe PDF	View/Open
11_chapter1.pdf		456.78 kB	Adobe PDF	View/Open
12_chapter2.pdf		284.61 kB	Adobe PDF	View/Open
13_chapter3.pdf		887.08 kB	Adobe PDF	View/Open
14_chapter4.pdf		3.55 MB	Adobe PDF	View/Open
15_chapter5.pdf		878.42 kB	Adobe PDF	View/Open
16_chapter6.pdf		1.02 MB	Adobe PDF	View/Open
17_chapter7.pdf		883.85 kB	Adobe PDF	View/Open
18_conclusion.pdf		57.36 kB	Adobe PDF	View/Open
19_references.pdf		110.72 kB	Adobe PDF	View/Open
20_listofpublications.pdf		15.24 kB	Adobe PDF	View/Open
80_recommendation.pdf		97.36 kB	Adobe PDF	View/Open

Show full item record

Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Altmetric Badge:

Shodhganga : a reservoir of Indian theses @ INFLIBNET