Please use this identifier to cite or link to this item: http://hdl.handle.net/10603/540614
Title: Multilingual Text to Speech Synthesis using Sequence to Sequence Neural Networks
Researcher: Sivanand, Achanta
Guide(s): Kishore, Prahallad
Keywords: Engineering
Engineering and Technology
Engineering Electrical and Electronic
University: International Institute of Information Technology, Hyderabad
Completed Date: 2018
Abstract: Keywords: Statistical Parametric Speech Synthesis, Recurrent Neural Networks, Polyglot Synthesis, newlineMultilingual Synthesis, Sequence-to-Sequence Learning, End-to-End Synthesis. newlineText-to-speech (TTS) synthesis is typically carried out in two ways: (1) By concatenating waveform newlinesegments of units (often dubbed unit selection synthesis (USS)) and (2) By predicting speech parameters newlinefrom text using statistical models (also called statistical parametric speech synthesis systems (SPSS)). newlineMost commercial TTS systems use USS approach as it produces highly natural speech. However, newlinethe USS approach requires the recorded waveforms to be stored which demands memory, but the newlinestatistical approach alleviates this by modeling the speech compactly in a parametric form. Also, using newlinethe waveform directly offers little scope to alter the characteristics to produce different varieties like newlinespeakers, genders, voice-qualities, languages, etc. On the other hand, the parameters of a statistical newlinemodel can be suitably transformed to produce the desired variations. newlineThe above advantages (compactness and flexibility) come at the cost of the speech sounding slightly newlinerobotic than the unit-selection counter-part. A typical SPSS system has several components namely text newlinefeature extraction, speech parameter extraction, aligning text and speech features, a text feature-to-speech newlineparameter regression model and a duration prediction model. Each of these components are independently newlinehand-engineered making the SPSS system susceptible to errors in any one of them. The loss in naturalness newlineof SPSS output has been majorly attributed to the limitations of the regression model (also dubbed newlineacoustic model) to capture the complexity of mapping from text features to speech parameters and the newlinerepresentations used for text and speech data. In addition, the use of separate alignment model leads to newlineerroneous averaging in acoustic modeling. newlineIn this thesis, we address the issues of acoustic modeling, textual representation, acoustic representation, newlinemultilingual multispea
Pagination: 109
URI: http://hdl.handle.net/10603/540614
Appears in Departments:Department of Electronic and Communication Engineering

Files in This Item:
File Description SizeFormat 
80_recommendation.pdfAttached File88.6 kBAdobe PDFView/Open
abstract.pdf69.38 kBAdobe PDFView/Open
annexures.pdf98.14 kBAdobe PDFView/Open
chapter 1.pdf946.84 kBAdobe PDFView/Open
chapter 2.pdf382.97 kBAdobe PDFView/Open
chapter 3.pdf523.37 kBAdobe PDFView/Open
chapter 4.pdf1.56 MBAdobe PDFView/Open
chapter 5.pdf186.55 kBAdobe PDFView/Open
chapter 6.pdf373.61 kBAdobe PDFView/Open
chapter 7.pdf194.91 kBAdobe PDFView/Open
chapter 8.pdf71.77 kBAdobe PDFView/Open
content.pdf77.46 kBAdobe PDFView/Open
preliminary pages.pdf123.44 kBAdobe PDFView/Open
title page.pdf73.73 kBAdobe PDFView/Open
Show full item record


Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Altmetric Badge: