论文部分内容阅读
在语音信号处理系统中,基于帧能量的语音端点检测(voiceactivitydetection,VAD)往往受到语音段能量不平稳及噪声的影响,为了提高语音端点检测的性能和鲁棒性,引入视觉信息。该文提出采用基于数据驱动的线性变换生成视觉特征,在提出一个基于统计的VAD一般模型的基础上构建两个单模式的VAD系统,通过两步式的融合方法,得到了多模式的VAD系统。实验表明:同时利用音频和视觉信息的多模式VAD比基于帧能量的听觉VAD在帧错误率上有55.0%的相对下降,在断句错误率上有98.5%的相对下降。这一结果说明多模式VAD方法基本可以避免断句错误,也能够显著改善帧检测性能,是一种相当有效的方法。
In speech signal processing system, the speech energy detection (VAD) based on frame energy is often affected by the instability of speech energy and the noise. In order to improve the performance and robustness of speech endpoint detection, visual information is introduced. In this paper, the visual characteristics based on data-driven linear transformation are proposed. Two single-mode VAD systems are constructed based on a general VAD-based model. By using the two-step fusion method, a multi-mode VAD system . Experiments show that the multi-modal VAD, which uses both audio and visual information, has a 55.0% relative decrease in frame error rate and a relative decrease of 98.5% in the sentence error rate over the frame-based auditory VAD. This result shows that the multi-mode VAD method can basically avoid the sentence-breaking error and also can significantly improve the frame-detection performance, which is a very effective method.