科技>计算机>计算机科学
实时语音处理实践指南

实时语音处理实践指南"

作者:葛世超等
ISBN:9787121387593
定价:¥99.0
字数:560千字
页数:352
出版时间:2020-04
开本:16开
版次:01-01
装帧:
出版社:电子工业出版社
简介

本书主要介绍基于互联网场景的交互式实时语音处理流程,内容涉及智能语音助手、智能音箱、音/视频会议等,具体包括实时语音信号处理、数字音效、网络传输编/解码和语音唤醒识别四部分。在阐述各部分的内容时,本书从基本概念和原理入手,将理论和实践相结合,并细致分析了极具商业价值的实例,以帮助读者了解相关算法在工程上是如何实现的。另外,为便于有兴趣的读者快速进行算法验证并将其改进和应用到实际的项目中,作者也开源了书中算法的源码。

前言

前言 语音信号处理是利用数字信号处理技术对语音信号进行一系列处理,从而达到存储、传输、增强、识别和合成等目的的一门学科,早在一百多年前就被应用到了通信行业。近年来,传统的语音信号处理技术和人工智能技术的结合使得基于智能语音技术的应用呈现爆炸式增长。语音技术在电话、网络音/视频会议、音效、语音助手、实时翻译、智能音箱、智能接待、智能客服、安防监控、家庭休闲娱乐等方面都有着非常重要的应用,国内一些大型科技公司也相继组建了自己的语音开发团队,如阿里巴巴iDST实验室、腾讯音视频实验室等,同时,一些专业的语音技术服务公司,如科大讯飞、思必驰、云知声等也借着语音技术的风口迅猛发展。此外,从事语音研发的初创公司也如雨后春笋般破土而出,这就导致对语音技术人员的需求与日俱增。然而语音处理技术的门槛较高,它与信息处理科学、计算机科学、声学、语言学、生理学、认知科学、数学、统计学以及概率学等许多学科有着密切的联系。对从业者来说,要想在该领域达到高级工程师的级别,往往需要多年的行业经验积累,而市场上一些关于语音技术的书籍往往偏向于理论化,给人感觉晦涩难懂,注重实际应用、针对性强的相关技术书籍是这方面紧缺的资料,而针对交互式实时应用场景的语音处理书籍则更为稀缺。本书正是在这一背景下所写的。本书介绍了基于互联网传输的交互式实时语音处理技术栈,并开源了书中众多源码工程。 本书结构 根据交互式实时应用场景的语音处理流程,全书内容可划分为4 部分,第1 部分是语音增强处理,第2 部分是数字音效,第3 部分是网络传输编/解码,第4 部分是语音唤醒和识别。 第1~8章是第1部分,内容主要是语音增强处理,包括基于传统信号处理方法的语音检测、单通道降噪、自适应回声消除、声源定位、波束形成和盲源分离等。另外,随着深度学习技术的发展,深度学习模型的泛化能力已有大幅提升,因而特定场景下基于深度学习的语音增强方法已经达到商用要求,这部分也介绍了相关方法和实例。 第9 章是第2 部分,内容主要是数字音效。音效除了可以弥补音频器件的不足和声场的缺陷,还常用于实现各种混音效果,以提升听觉体验。手机、耳机、音箱、家庭影院等产品都会用到音效技术,当前绝大多数音频专用芯片中已经具备音效调节功能,但在一些场景中还需要使用软件算法以满足定制化音效需求。 第10 章和第11 章是第3 部分,主要是与网络相关的内容。这里不是介绍网络传输协议,如UDP实时语音处理实践指南和RTP等,而是介绍用于解决网络传输不利因素(带宽波动、丢包、抖动和延迟)而采用的技术,具体来说,是传输拥塞控制、编/解码和网络均衡,其目的是在充分利用网络带宽的前提下获得最佳的语音质量。这些技术在智能音箱、智能助手、音/视频会议等交互式实时应用场景中必不可少。 第12章 和第13 章是第4 部分,内容主要是唤醒和识别。有些交互实时应用场景需要唤醒词识别技术以实现设备唤醒和设备控制。本书基于深度学习技术实现了嵌入式设备的唤醒,并在GitHub官网上提供了可以直接编译运行的安卓应用程序源码。大规模自然语音识别常用于服务器端,本书基于Kaldi语音识别工具搭建了基于GMM-HMM的自然语言识别系统,并介绍了基于深度学习的端到端的语音识别开源工程DeepSpeech,还介绍了如何使用该工具定制特定场景的自然语音识别。 本书所述技术围绕交互式实时网络应用场景特点展开,如智能语音助手、智能翻译机、VoIP电话等,通常这类场景希望所有的语音响应在限定的时间内完成。如询问智能语音助手天气情况时,理想状态是语音增强、网络传输编/解码和识别结果回传在200 毫秒内完成,当延迟超过500毫秒时,用户便能明显感知这一延迟。而VoIP这类场景除了音频,通常还伴随着视频数据,因而留给音频的网络带宽更为有限,且这类场景语音的双向接收对象均为人耳,对实时性和音质要求更高。概括地说,本书所述语音技术就是在限定的条件下获得最短的传输延迟和最高的语音品质。 学习建议 对于零基础读者,建议按本书编排顺序逐步阅读,本书中很多语音概念和语音信号处理方法在前两章有介绍;对于语音相关专业的在校学生,本书提供了众多理论和实际相结合的示例,动手操作这些示例有助于提升对算法工程应用的理解;对于工程技术和研发人员,由于本书的一些示例已经达到了商用水准,可以参考其中的一些处理方法和思路改进和优化相关算法。 由于语音处理基于一些数学和信号处理基础知识,所以本书在阐述相关算法原理和实现时不得不配以必要的数学公式。此外,限于篇幅,在阐述相关算法实现时以代码片段为主,详细代码可参见本书开源的众多源码。对于零基础且只想了解可以对语音进行哪些处理以及如何实现这些处理的读者,在阅读本书时可注重基本概念和原理并跳过相关公式推导内容。 致谢 本书得以出版,要感谢曾经帮助过我的同事和朋友们。这里要特别感谢我的前领导朱磊博士给予的帮助,感谢我的同事卢和春对本书第1 章的审阅和修改,还要感谢和我一起完成写作的几位作者,他们是吕强、钱思冲、张博伦和张硕,也十分感谢Zoom 公司的诸位同事。另外,要特别感谢我的妻子张卫洁,她在写作过程中给予了我无私的支持和帮助,也感谢电子工业出版社的李利健编辑和她的同事们,正是在他们的帮助下,本书才得以顺利出版。 最后,十分感谢各位网友在阅读我的博客和电子书时给予的反馈,感谢你们的支持,希望你们能从本书中获益。 本书由几位合作者合作完成,其中第5 章由张博伦整理完成,第8 章由钱思冲完成,第9 章由吕强完成,第13章由张硕完成,其他章节由葛世超完成。限于篇幅,诸如啸叫抑制、去混响、声纹识别、语音合成以及语义解析等算法在本书中并未涉及。另外,由于深度学习并不是本书重点,它只是作为一种工具在本书中使用,因而在语音增强、唤醒以及识别时只阐述了深度学习在这方面的具体使用,并未对深度学习这一工具做过多介绍,感兴趣的读者可进一步参考我在GitHub 官网上相关例子的完整源码。 因作者水平有限,书中错漏之处难免,如果你对本书内容有什么建议,或者发现了错误,都希望你能联系我(shichaog@126.com),我将尽快回复你。本书的源码工程和后期勘误可以通过我的邮箱地址和微信号获得。 葛世超

目录

绪论······································································.1 第1章 信号处理··············································.7 1.1 数字和模拟频率··········································.7 1.2 离散傅里叶变换···········································8 1.2.1 实数DFT ·····································.9 1.2.2 复数DFT ···································.10 1.2.3 负频分量····································.10 1.2.4 DFT变换性质···························.10 1.3 FFT···························································.11 1.3.1 FFT 结果举例····························.12 1.3.2 实信号FFT································.13 1.3.3 短时傅里叶变换························.14 1.3.4 STFT语音窗函数选择··············.14 1.4 重叠相加法和重叠保留法·························.16 1.4.1 OLA············································.17 1.4.2 OLS ············································.19 1.5 加权重叠相加法········································.21 1.5.1 WOLA 计算过程·······················.22 1.5.2 WOLA 窗函数选择···················.22 1.6 滤波器组···················································.23 1.7 语音预加重····································.27 1.8 高斯分布···················································.27 1.8.1 单高斯分布································.27 1.8.2 多维高斯分布····························.29 1.9 HMM模型················································.31 1.10 卡尔曼滤波·············································.32 本章小结·····························································.33 参考文献·····························································.33 第2章 发音机理和器件·······························.34 2.1 语音的产生和接收········································.34 2.1.1 语音产生机理····························.34 2.1.2 发声模型····································.36 2.1.3 发音单位····································.36 2.1.4 发音分类····································.37 2.1.5 声音接收····································.37 2.1.6 声音传播····································.38 2.2 扬声器·······················································.38 2.2.1 电学性能····································.38 2.2.2 声学性能····································.39 2.2.3 底噪············································.40 2.2.4 频响特性····································.41 2.2.5 THD+N POUT···························.41 2.2.6 电压(功率)和失真················.42 2.3 麦克风·······················································.42 2.3.1 麦克风性能指标························.42 2.3.2 麦克风的选择····························.43 2.4 结构设计····················································45 2.4.1 扬声器相关音腔设计················.45 2.4.2 麦克风和扬声器························.45 2.5 音频设备···················································.46 2.5.1 听音设备····································.46 2.5.2 声场表现力································.47 2.5.3 发声设备····································.48 2.5.4 消声室测试································.48 2.6 声学测试···················································.49 2.6.1 声学音量····································.50 2.6.2 失真度THD·······························.50 2.6.3 频响混叠····································.51 2.6.4 麦克风阵列一致性····················.53 2.6.5 AEC参考通路···························.54 2.6.6 扬声器镜频································.56 2.6.7 扬声器最大幅度下的THD·······.57 本章小结·····························································.58 参考文献·····························································.58 第3章 语音端点检测····································.59 3.1 特征选取···················································.59 3.2 判决准则···················································.61 3.2.1 门限············································.61 3.2.2 统计模型法································.61 3.2.3 机器学习法································.62 3.3 VAD 实例·················································.63 3.3.1 高斯分布····································.63 3.3.2 算法流程····································.63 3.3.3 计算流程····································.68 3.4 语音/非语音帧的初始参数························.75 3.4.1 模型参数计算····························.75 3.4.2 高斯混合模型····························.76 3.4.3 EM算法·····································.76 本章小结·····························································.78 参考文献·····························································.78 第4章 单通道降噪········································.79 4.1 谱减法·······················································.79 4.1.1 谱减法原理································.79 4.1.2 谱减法实现································.81 4.1.3 音乐噪声控制····························.83 4.1.4 滤波法········································.83 4.2 维纳滤波···················································.84 4.3 子空间降噪···············································.86 4.4 WebRTC 单通道降噪实现······················.87 4.4.1 算法原理····································.87 4.4.2 算法初始化································.88 4.4.3 信噪比计算:ComputeSnr ·······.90 4.4.4 语音噪声概率计算····················.91 4.4.5 特征选取····································.94 4.4.6 平坦度计算································.96 4.4.7 噪声估计更新函数: UpdateNoiseEstimate···············.97 4.4.8 消除噪声····································.98 4.4.9 信号合成····································.99 4.4.10 仿真结果··································.99 4.5 深度学习降噪········································.101 本章小结···························································.104 参考文献···························································.105 第5章 声学回声消除·································.106 5.1 回声消除原理·········································.106 5.2 自适应滤波器·········································.108 5.2.1 维纳滤波器······························.108 5.2.2 LMS算法································.109 5.2.3 NLMS算法······························.110 5.2.4 PBFDAF 算法··························.111 5.3 WebRTC 回声消除算法·······················.113 5.3.1 延迟估计··································.113 5.3.2 自适应滤波······························.114 5.3.3 非线性处理(NLP)··············.117 5.3.4 MATLAB代码解读················.118 5.3.5 仿真实验··································.127 5.4 Speex 回声消除算法·····························.128 5.4.1 变步长计算······························.129 5.4.2 双线性滤波器及预处理··········.130 5.4.3 MATLAB代码解读················.132 5.4.4 算法流程示意图······················.141 5.4.5 仿真实验··································.144 本章小结···························································.146 参考文献···························································.146 第6章 声源定位··········································.147 6.1 GCC算法·····················.147 6.2 SRP-PHAT算法··································.149 6.3 MUSIC算法···········································.150 6.4 TOPS 算法·············································.152 6.5 FRIDA算法············································.154 6.6 后处理抗噪·············································.155 6.6.1 统计方法··································.155 6.6.2 卡尔曼方法······························.156 6.6.3 声源定位建模··························.158 6.6.4 粒子滤波法······························.160 本章小结···························································.160 参考文献···························································.161 第7章 波束形成技术··································.162 7.1 麦克风阵列·············································.163 7.1.1 麦克风数量和间距··················.163 7.1.2 空域混叠··································.165 7.1.3 波束形成指标··························.165 7.1.4 噪声场······································.166 7.1.5 声辐射······································.167 7.2 常见波束形成方法··································.168 7.2.1 延迟和波束形成方法··············.168 7.2.2 滤波和波束形成方法··············.169 7.2.3 恒定宽度波束形成方法··········.169 7.2.4 超分辨波束形成方法··············.170 7.2.5 广义旁瓣相消波束形成方法··.171 7.2.6 最小方差信号无畸变响应波束形成方法················.172 7.3 WebRTC 波束形成实例·······················.174 7.3.1 编译测试文件··························.174 7.3.2 测试文件处理流程··················.175 7.3.3 测试命令··································.176 7.3.4 算法的基本思想······················.176 7.3.5 测试源码··································.178 7.3.6 算法处理流程··························.181 7.3.7 权重计算函数··························.185 7.3.8 权重相乘操作··························.186 7.4 后置滤波(Post-filtering) ·················.187 7.4.1 MMSE后置滤波·····················.189 7.4.2 Zelinski 后置滤波····················.190 7.4.3 mccowan后置滤波·················.191 7.4.4 STSA后置滤波·······················.192 本章小结···························································.193 参考文献···························································.194 第8章 盲源分离··········································.196 8.1 基本概念及数学预备知识······················.196 8.1.1 ICA基本概念··························.196 8.1.2 梯度和最优化方法··················.197 8.2 盲语音分离预处理——PCA··················.199 8.3 频域独立成分分析法——FDICA··········.200 8.3.1 频域ICA··································.200 8.3.2 去相关估计方法······················.200 8.3.3 不确定性问题··························.201 8.4 后置滤波处理··········································.205 8.4.1 噪声估计··································.205 8.4.2 衰减因子计算··························.206 8.5 GSC 与ICA联合估计···························.209 8.5.1 峭度··········································.209 8.5.2 经典GSC·································.210 8.5.3 动态权重向量估计··················.210 本章小结···························································.212 参考文献···························································.213 第9章 音效处理··········································.214 9.1 声道的分类·············································.214 9.1.1 单声道······································.214 9.1.2 双声道······································.215 9.1.3 立体声······································.215 9.1.4 多声道······································.215 9.1.5 全景声······································.216 9.2 后端音效处理··········································.217 本章小结···························································.226 参考文献···························································.226 第10章 语音编/解码··································.227 10.1 LPC 编码·············································.230 10.2 SILK编/解码········································.231 10.2.1 编码参数································.232 10.2.2 编码器····································.234 10.2.3 解码器····································.239 10.3 opus 编/解码概览································.239 10.3.1 opus 解码·······························.242 10.3.2 opus 编码·······························.243 10.3.3 opus 语音/音乐检测·············.244 10.4 语音质量评估·······································.247 10.4.1 主观测试································.248 10.4.2 客观测试································.248 10.4.3 无参考质量评估····················.249 本章小结···························································.249 参考文献···························································.249 第11章 语音网络传输·······························.251 11.1 拥塞控制···············································.252 11.1.1 GoogleCC拥塞控制··············.255 11.1.2 基于PCC的拥塞控制··········.260 11.1.3 基于BBR 的拥塞控制··········.264 11.2 NetEQ ·················································.266 11.2.1 NetEQ原理····························.266 11.2.2 抖动和收包····························.268 11.2.3 NetEQ代码框架····················.269 11.2.4 延迟计算································.272 11.2.5 DSP 处理·······························.274 11.2.6 变速不变调····························.275 本章小结···························································.277 参考文献···························································.277 第12章 语音唤醒·······································.278 12.1 语音唤醒技术简介································.278 12.2 特征提取···············································.279 12.2.1 FBank ·····································.279 12.2.2 MFCC·····································.283 12.2.3 PCEN ·····································.284 12.3 模型结构···············································.284 12.3.1 DNN ·······································.284 12.3.2 CNN ·······································.286 12.3.3 CRNN·····································.287 12.3.4 DSCNN ··································.288 12.3.5 子带CNN ······························.289 12.3.6 Attention·································.290 12.4 计算加速···············································.292 12.4.1 硬件资源评估························.292 12.4.2 加速方向································.294 本章小结···························································.299 参考文献···························································.299 第13章 语音识别·······································.301 13.1 语音特征提取·······································.303 13.1.1 MFCC特征····························.304 13.1.2 PLP 特征································.305 13.1.3 归一化····································.306 13.2 声学模型···············································.306 13.2.1 高斯混合模型························.307 13.2.2 参数估计································.307 13.2.3 隐马尔科夫模型····················.308 13.2.4 Baum-Welch法······················.309 13.2.5 HMM识别器·························.309 13.3 语言模型···············································.310 13.3.1 N-gram语言模型··················.311 13.3.2 加权有限状态转换机············.312 13.4 YES 和NO识别实例···························312 13.4.1 数据准备································.312 13.4.2 数据预处理····························.313 13.4.3 词汇和发音词典····················.314 13.4.4 语言学模型····························.315 13.4.5 特征提取································.319 13.4.6 声学模型训练························.320 13.4.7 解码和测试····························.321 13.5 Kaldi 中文语音识别······························321 13.5.1 数据集准备····························.321 13.5.2 声学模型训练························.322 13.5.3 安装portaudio ·······················.322 13.5.4 在线识别································.323 13.6 DeepSpeech 语音识别······················.324 13.6.1 识别建模································.325 13.6.2 网络组成································.325 13.6.3 模型训练和部署····················.326 本章小结···························································.330 参考文献···························································.330 附录A 本书涉及的专业术语··························.331

作者简介

编辑推荐

作者寄语

电子资料

www.luweidong.cn

下一个