Language Identification key techniques

来源 :商品与质量·消费视点 | 被引量 : 0次 | 上传用户:THINKPAD_sl400
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
  Department of Electronics and Communication Engineering, Beijing Institute of Technology, Haidian-Beijing, China
  Abstract:In speech recognition,language identification of speech refers to the process of using a computer to identify a language of a spoken utterance automatically. It has played more and more important role in applications currently such as multilingual conversation system,spoken language,translation system and multilingual information retrieval system [1]. The main task of a language identifier is to design an efficient algorithm which helps a machine to identify correctly a particular language from a given audio sample. Researchers have given a lot of emphasis in the task of Language identification and over the last two decades there has been significant progress in this area. When the task of LID is performed, we can identify a particular language better than machines if the language is familiar to us. With the use of machine learning a computer can be trained properly so that it can identify as many numbers of languages given to it as input whereas human beings can identify at most 10–15 languages [2]. In this paper, I have discussed language identification using mat lab programming for three languages based on our standard database. After extracting set of features using (MFCC) I have done training using vector quantization and finally for better classification i used GMM.
  Keywords: MFCC,GMM,VQ,K-SVD
  1.Introduction
  Today,when we call most large companies,a person doesn't usually answer the phone. Instead,an automated voice recording answers and instructs you to press buttons to move through option menus. Many companies have moved beyond requiring you to press buttons,though. Often you can just speak certain words (again,as instructed by a recording) to get what you need. The system that makes this possible is a type of speech recognition program -- an automated phone system.
  People with disabilities that prevent them from typing have also adopted speech-recognition systems. If a user has lost the use of his hands,or for visually impaired users when it is not possible or convenient to use a Braille keyboard,the systems allow personal expression through dictation as well as control of many computer tasks. Some programs save users' speech data after every session,allowing people with progressive speech deterioration to continue to dictate to their computers [3].
  Current programs fall into two categories:   Small-vocabulary/many-users,these systems are ideal for automated telephone answering. The users can speak with a great deal of variation in accent and speech patterns, and the system will still understand them most of the time. However,usage is limited to a small number of predetermined commands and inputs,such as basic menu options or numbers.
  Large-vocabulary/limited-users,
  These systems work best in a business environment where a small number of users will work with the program. While these systems work with a good degree of accuracy (85 percent or higher with an expert user) and have vocabularies in the tens of thousands,you must train them to work best with a small number of primary users. The accuracy rate will fall drastically with any other user.
  Speech recognition systems made more than 10 years ago also faced a choice between discrete and continuous speech. It is much easier for the program to understand words when we speak them separately, with a distinct pause between each one. However,most users prefer to speak in a normal, conversational speed. Almost all modern systems are capable of understanding continuous speech.
  How it Works
  To convert speech to on-screen text or a computer command,a computer has to go through several complex steps. When we speak,we create vibrations in the air. The analog-to-digital converter (ADC) translates this analog wave into digital data that the computer can understand. To do this, it samples,or digitizes,the sound by taking precise measurements of the wave at frequent intervals. The system filters the digitized sound to remove unwanted noise,and sometimes to separate it into different bands of frequency (frequency is the wavelength of the sound waves,heard by humans as differences in pitch). It also normalizes the sound, or adjusts it to a constant volume level. It may also have to be temporally aligned. People don't always speak at the same speed,so the sound must be adjusted to match the speed of the template sound samples already stored in the system's memory.
  Speech Recognition and Statistical Modeling
  Early speech recognition systems tried to apply a set of grammatical and syntactical rules to speech. If the words spoken fit into a certain set of rules,the program could determine what the words were. However,human language has numerous exceptions to its own rules,even when it's spoken consistently. Accents,dialects and mannerisms can vastly change the way certain words or phrases are spoken. Imagine someone from Boston saying the word "barn." He wouldn't pronounce the "r" at all,and the word comes out rhyming with "John." Or consider the sentence,"I'm going to see the ocean." Most people don't enunciate their words very carefully. The result might come out as "I'm goin' da see tha ocean." They run several of the words together with no noticeable break,such as "I'm goin'" and "the ocean." Rules-based systems were unsuccessful because they couldn't handle these variations. This also explains why earlier systems could not handle continuous speech -- you had to speak each word separately,with a brief pause in between them.   How to extract feature vector (MFCC)
  In general a feature vector is a list of values (numbers) that contain the relevant features to our signal for some specific task (here, use as input to a speech recognition algorithm) in some efficient and expressive way.
  Some concrete examples: Suppose that at the first step of our procedure, we divide our audio signal (say,24khz,mono signal) in frames( fragments of fixed length,say 50ms)We are going now to build an appropriate ‘feature vector’ for each of the this frames.
  A frame here is composed of 1200samples,which we store (say) in a row matrix in MATLAB. Well we could consider this matrix already as a ‘feature vector’ (it certainly represents the audio signal: each number is the audio intensity as a function of time. But this ‘trivial’ vector is not very appropriate; because they are too many numbers and because they aren’t in themselves very ‘expressive’. We want to distinguish a vowel from a consonant,for example and this 1200 numbers say little about it;the same speaker saying the same vowel will probably produce such vector that are very different. We don’t want that
  A first transformation,that will give us a more useful feature vector is the FOURIER TRANSFORM (or rather,the spectrogram) of audio. We probably have a basic idea (from music graphic equalizers,etc). Instead of having a list ‘vector’ of 1200samples (each for an instant of time) .We now have a ‘vector;of say 128numbers that tell us how much energy the audio has in each ‘frequency band’ (always inside the frame).This is more efficient(less numbers) and expressive perhaps we can start roughly distinguishing vowels and consonants,male and female voices,etc just by looking at this numbers.
  From this,other transformation follows (MEL;change the scale of the frequencies: CEPSTRUM: log followed by inverse Fourier transform or rather DCT,conceptually Similar here.) And finally we trim the least important elements from our feature vector. Each step gives us a different feature vector, hopefully more efficient/expressive than the previous one. The whole procedure can sound a little complex and esoteric if not familiar with all this. But conceptually from the point of understanding what means to compute a suitable ‘feature vector’. These last steps are conceptually analogous with the first one
  The next phase of this paper is our decision on which algorithm is suitable for language identification, which we have chosen GMM (Gaussian Mixture Model).   NB We use absolute energy (1),MFCCs (12) (often referred to as absolute MFCC),first and second order derivatives of these absolute MFCCs to get a basic 39 dimension MFCC
  13 Absolute Energy(1) and MFCC(12)
  13 Delta First order derivatives of the thirteen absolute coeffficients
  13 Delta-Delta Second order derivatives of the thirteen absolute coefficients
  39 Total Basic MFCC Front End
  Gaussian mixture model
  GMM works on the EM optimization technique which is a clustering based learning process. While doing classification using GMM, EM (Expectation Maximization) algorithm runs in the background for finding maximum-likelihood parameter estimation and it does many to one mapping from an underlying distribution. EM algorithm consists of two major steps: E (Expectation) step followed by M (Maximization) step[2]. The Expectation step is done with respect to unknown underlying variables using current estimate of parameters and conditioned upon observation. The maximization step provides new estimate of parameters and both the steps are iterated until convergence is achieved as shown in below. For d dimensions, the Gaussian distribution of a given
  Vector x is defined by:
  choose initial parameter set
  Where μ is the mean vector and Σ is the covariance matrix of the Gaussian. The probability given in a mixture of N Gaussians is:
  Where N is the number of Gaussians and wi is the weight of Gaussian i, with
  Selection of the number of number of Gaussian mixtures is very essential for designing a good GMM system. For example,in the hybrid as well as in GMM experiments that we have considered here,we have selected Gaussian mixture model up to size 1024. GMM classification method is used in image recognition,computer vision and speech recognition[2].
  During recognition an unknown utterance is compare to each of the GMM’s. The likelihood that the unknown utterance was spoken in the same language as the speech used to train each model is computed and the most likely model is determined as the hypothesized language.
  Experiment Setup
  Archietecture of our System
  Database;Words taking for training and testing the three languages
  Conclusion
  Our audio data is being recorded and saved in a folder.
  We then use MFCC feature extraction approach to extract our necessary information from the folder with our audio data.
  After that we use our algorithm of choice which is GMM to train my models by using our extracted and save the MFCC data.   For our recognition step,both the MFCC feature vectors and GMM trained models make a comparison to find similarities in their properties. This is done by using vector quantization. It is a process to search for a similar feature vector to represent the input feature vector. A codebook is constructed in advance by collecting and processing sufficient number of feature vectors. This step utilizes K-SVD’s k-means clustering method to recognize the similarities.
  In short the recognition is a simple step. We just look at our results in our command window,if they do not coordinate with our audio data,we can simply conclude by saying the system couldn’t recognize our audio data. On the other hand if they do correspond, we say it recognized it.
  We can improve our accuracy rate by doing re-doing our recording very slowly or clearly. Also, retraining models and the background noise is something we need to take into consideration if we want to get a very efficient and high result. Below is a list of things we should consider for efficient accuracy.
  Interface of experiment results showing:-
  Extraction, Training, and Recognition processing and Results in Mat lab.
  Weaknesses and Flaws
  No speech recognition system is 100 percent perfect; several factors can reduce accuracy. Some of these factors are issues that continue to improve as the technology improves. Others can be lessened -- if not completely corrected -- by the user[3]. The flaws and weakness below are factors we need to take into consideration.
  Low signal-to-noise ratio
  Overlapping speech
  Intensive use of computer power
  Homonyms
  References:
  [1] Automatic Language Identification of Telephone Speech Marc A. Zissman
  [2] A hybrid VQ-GMM approach for identifying Indian languages Pinki Roy ? Pradip K. Das
  [3] How speech recognition works http://electronics.howstuffworks.com/gadgets/high-tech-gadgets/speech-recognition.htm
其他文献
摘要:近年来,国内民航业整体发展趋势十分迅速,从基础性发展阶段过渡至发展新时期。基于民航业而言,民航飞机属于其事业要素中的重要构成部分,以空客A320飞机为例,不仅已经满足民航基本需求,其建造技术也逐渐趋向于高峰阶段。在此背景之下,深入分析与研究民航飞机维修技术和方法显得尤其重要。本文以空客A320飞机作为主要研究对象,通过深入分析空客A320飞机基本信息与维修现状,并在此基础之上详细介绍民航飞机
期刊
摘要:计量工作将如何融于生产、融于各项科学技术的研究;如何为提高产品质量提出具有开拓性的试验、测试的新技术、新方法;如何严格遵循国际标准法则,结合产品的特点制定出一套行之有效的质量控制检测方法,是计量工作急需研究解决的问题,而实现与完善测试和测试管理的自动化、网络化和一体化,是计量工作自身建设的发展方向。  关键词:质量保证;计量检测;措施  一、计量管理概况  一般计量管理是对计量器具和计量工作
期刊
摘要:电机驱动特有的车体系统,应能匹配着最佳参数,提升整体架构的电动性能。整车驱动这一视角看,电动特性的汽车,应低于额定情形下的恒转矩,且高于类似范畴以内的恒功率。电机驱动特性的这种系统,应当明晰峰值状态、配套架构内的传动系数;这样的数值,是匹配计算的本源根据。拟定的基速、转速之中的最高数值,都关涉整车性能。特性描绘范畴的关联参数,能解析多层级的动力特征。  关键词:电动汽车;电机驱动系统;动力特
期刊
摘要:随着社会经济的发展以及人们生活水平的提高,为了减轻城市交通压力以及方便人们的出行,国内各大城市纷纷开设了城市轨道交通,从而为广大百姓带来了极大的方便,然而城市轨道交通在运行的过程中可能会出现接触网故障问题,以至于给广大乘客的出行造成不便,甚至对人身安全造成威胁。为此,本文以下需要探讨接触网跳闸的主要原因,并且针对这些跳闸问题而提出几点有效预防措施,从而降低城市地铁运行风险,进而提高其运行安全
期刊
摘要:近些年来随着交通系统的不断发展,为了满足城市轨道的基本要求,对地铁车辆的基本性能的要求也越来越高,其中牵引电气系统是地铁列车的重要组成部分,也是日常檢修工作的重点和难点。本文将以南京地铁列车牵引系统技术应用特点为研究点,对故障进行系统的分析。  关键词:地铁列车;牵引电气系统;技术特点  牵引系统集成是列车集成创新的核心所在,需要在前期技术基础上,明确牵引系统的应用特点。根据技术体系的不断变
期刊
民用航空油库担负着机场供油的职责,油品来源于石油炼化企业,油库本身并不能生产航空煤油,由于油价的波动,往往压缩了企业的利润,在油价处于下跌趋势时,前期高价采购的油品就面临减值,为了避免油品减值降低企业利润,有必要采取措施,在油价上涨前,增加库存,在油价下跌前降低库存,用油价上涨时,油价的升值平抑油价下跌时油价的减值。  首先,要根据国际油价的波动趋势图,预测油价是处于下跌趋势还是上涨趋势,以美国原
期刊
摘要:大容量储能技术在微电网运行系统中的应用,能够有效打破电力供需实施平衡的限制问题,在促进大容量储能技术大规模发展的过程当中,微电网系统的稳定性可以得到有效的提升,电能质量与水平能够得到合理的改善,且对于降低昼夜峰谷差,促进新能源大规模介入电网而言均有至关重要的意义。由此可见,在未来电网的发展中,储能电源的应用价值是相当突出的。本文以微电网中的储能电源作为研究对象,对储能电源在选择与应用方面的关
期刊
摘要:本文根据对火力发电厂(以下简称:火电厂)运煤系统的调查与研究、实践,深入分析了运煤系统产生粉尘的根源,提出解决问题的有效方案并实施。通过对运煤系统的设备治理,有效降低运煤系统粉尘浓度,达到国家标准,为职工营造安全、健康、环保的工作环境。  关键词:运煤系统;输煤系统;粉尘;治理;除尘器;落料管;导料槽  一、煤系统粉尘概述  1.粉尘来源  所谓粉尘就是空气中悬浮的固体微粒,煤在筛分、破碎、
期刊
摘要:随着城市化进程的不断加快,城市内人口数量不断增加,本地电话也在逐渐扩大网络规模。但是依然运用传统的半手工方式来填写表格,不仅影响工作效率,对于资料保存也带来了困难,查询效率低。特别是在南京地铁,本地网电话起着重要的作用,各部门之间,基本上都是靠电话进行沟通联系。开发信息化维护系统有着非常重要的意义,能够有效地提升工作效率,保证工作质量。本文将针对本地网电话交换维护信息化系统的构建进行分析。 
期刊
摘要:随着科学技术的发展,人们越来越多的用计算机来实现控制。近年来,计算机技术、自动控制技术、检测与传感器技术、CRT显示技术、通信与网络技术和微电子技术的高速发展,给计算机控制技术带来了巨大的发展。然而,设计一个性能好的计算机控制系统是非常重要的。计算机控制系统主要由硬件和软件两大部分组成,一个完整的控制系统还需要考虑系统的抗干扰性能,系统的抗干扰性能力是关系到整个系统可靠运行的關键。  关键词
期刊