
本书不仅讲解了机器学习基本原理和基本方法,而且通过大量医疗领域的案例实现对医疗健康数据的处理和分析,能够在很大程度上辅助医护人员进行临床决策。通过本书学习,读者不仅能够掌握机器学习算法建模前的数据准备、筛选构造机器学习算法指标的特征工程、不同类别的机器学习算法,还能够掌握临床诊疗数据、电子病历档案数据及影像数据等多源异构数据的处理方法,以及医疗图像、文本等数据的读取、预处理、可视化等知识。同时,本书还介绍了具有开源、去编程化的TipDM 数据挖掘建模平台,通过拖曳的图形化操作就能实现数据分析的全流程。本书可以作为医学类院校数据科学与大数据技术专业的核心课程教材,以及医工专业的专业核心课程或选修课程教材。在此基础上,还可以作为临床、口腔、医技、检验、影像、公共卫生等医学类专业进阶层次的专业限选课程或拓展课程的教材。
序 随着大数据时代的到来,移动互联网络和智能手机迅速普及,多种形态的移动互联应用蓬勃发展,电子商务、云计算、互联网金融、物联网、虚拟现实、机器人等不断渗透并且重塑传统产业,大数据当之无愧地成了新的产业革命核心。 联合国教科文组织以 6 种联合国官方语言正式发布的《北京共识——人工智能与教育》中提出,各国要制定相应政策,推动人工智能与教育、教学和学习系统性融合,利用人工智能加快建设开放灵活的教育体系,促进全民享有公平、有质量、适合每个人的终身学习机会。这表明基于大数据的人工智能和教育进入了新的阶段,这是一个数据科学的“百年未有之大变局”。 高等教育是教育系统中的重要组成部分,高等院校作为人才培养的重要载体,肩负着为社会培育人才的重要使命。然而,大数据和人工智能相关专业是2016 年才获批的新专业,专业建设、师资、课堂都面临着巨大考验,如何培养学生服务社会经济发展的实践能力,成为目前亟待解决的问题。2018 年6 月21 日,教育部陈宝生部长在新时代中国高等学校本科教育工作会议首次提出了“金课”的概念,“金专”“金课”“金师”迅速成为中国高等教育新时代的热词,大数据和人工智能相关专业如何形成中国特色、世界水平的金专、金课、金师和金教材是当代教育教学改革的难点和热点。 同时,实践教学是在一定的理论指导下,通过引导学习者的实践活动,从而传承实践知识、形成技能、发展实践能力、提高综合素质的教学活动。目前,高校教学体系的设置有诸多限制因素,过多地偏向理论教学,课程设置与企业实际应用切合度不高,学生无法把理论转化为实践应用技能。课程内容设置方面看似繁多又各自为“政”,课程设置存在冗余、缺漏、体系不健全等问题。为此,“泰迪杯”组委会与电子工业出版社共同策划“大数据专业系列图书”,该系列图书采用校企联合编写的形式,希望能有效解决大数据相关专业教材紧缺的问题。这与2019 年10 月24 日教育部发布的《关于一流本科课程建设的实施意见》(教高〔2019〕8 号)提出的“坚持分类建设、坚持扶强扶特、提升高阶性、突出创新性、 增加挑战度”遵循原则完全契合。本系列图书的第一大特点是注重学生实践能力的培养,根据高校实践教学中的痛点,首次提出“鱼骨教学法”的概念。以企业真实需求为导向,学生学习技能紧紧围绕企业实际应用需求,将学生需要掌握的理论知识通过企业案例的形式进行衔接,达到知行合一、以用促学的目的。 大数据专业应该以大数据技术应用为核心,紧紧围绕大数据应用闭环的流程进行教学,使学生从宏观上理解大数据技术在行业中的具体应用场景及应用方法。高校现有的大数据课程集中在如何进行数据处理、建模分析、参数调整,使得模型的结果更加准确上,但是,完整的大数据应用却往往是容易被忽视的部分。本系列图书的第二大特点是围绕大数据应用的整个流程,从数据采集、数据迁移、数据存储、数据分析与挖掘,最终到数据可视化。覆盖完整的大数据应用流程,涵盖企业大数据应用中的各个环节,符合企业大数据应用真实场景。 在教育部全面实施“六卓越一拔尖”计划 2.0 的背景下,如何响应我国高等教育人才培养体制机制的综合改革,如何重新定位和全面提升我国高等教育的质量?希望本系列图书能够起到抛砖引玉的作用,从而加快推进新工科、新医科、新农科、新文科为代表的一流本科课程的“双万计划”建设;落实“让学生忙起来,管理严起来和教学活起来”,让中国大数据和人工智能的专业、课程、课堂、慕课等相关本科与高职的人才培养质量有一个质的提升;借助数据科学的引导,在文、理、农、工、医等全方位发力,培养各个行业的卓越人才,培养未来的领军人才。“泰迪杯”自2013 年创办以来,赛题来源于企业、管理机构和科研院所等经过适当简化加工的实际问题,贴近现实热点需求;数据只做必要的脱敏处理,保持原始状态。竞赛围绕大数据挖掘的整个流程,从数据采集、数据迁移、数据挖掘、专题应用到数据可视化,覆盖完整的数据挖掘流程,涵盖企业应用中的各个环节,与目前大数据专业人才培养目标高度一致,因而得到全国各高校的热烈反响,也得到了全国各界专家学者的倾力支持与协助。其不依赖于数学建模,甚至不依赖于传统模型的竞赛形式,获得了工业界、产业界、行业界的高度认可,已成为国内大学生乃至研究生的重要学科竞赛。2018 年,“泰迪杯”增加数据分析技能赛子赛项,为高职及中职技能型人才培养提供理论、技术和资源方面的支持。经过多年的发展,“泰迪杯”已经成为全国高校大学生大数据技术最主要的交流平台。截至2019 年,全国共有近800 所高校,约1 万名研究生、5 万名本科生、2 万名高职生参加了“泰迪杯”的相关比赛。 不断探究数据科学类专业课程体系、课程教学改革,以及课程思政建设,积极开展融入新时代中国特色社会主义建设中的成就和需要解决的重大课题也正是大数据和人工智能相关专业需要研究的教学课题。本系列图书正是思考与实践“立德树人”这一根本任务在大数据专业、技术和课程上的具体化、操作化和目标化,并逐次展开,也希望读者能将使用、实践过程中的意见、建议及时反馈给我们,形成大数据时代的新型“编写、使用、反馈”螺旋式上升的系列教材建设样板。 前 言 目前,无论是手机助手一类的应用,还是类似扫地机器人的实物产品,都在以更加智能化的方式,方便人们的工作与生活。这一切的基础是海量的数据,而实现应用与产品智能化目标背后依靠的则是人工智能技术。海量的数据和人工智能技术之间相辅相成,如果没有海量的数据,人工智能技术无从发展;如果没有人工智能技术,海量的数据也无法发挥其应有的价值。虽然人工智能技术取得了令人瞩目的成就,但其还尚未在真正意义上深入各个细分领域,市场上缺少人工智能和细分领域知识两方面都熟悉的专业人才。就医疗健康领域而言,医护从业人员具有极强的医疗健康领域的专业知识,但是缺乏对人工智能技术的认知与运用能力,无法发挥现有数据的价值,而人工智能相关的从业者往往缺乏医疗健康领域的专业知识。编写本书主要目的就是打破人工智能技术和医疗健康领域的壁垒,推动人工智能技术与医疗健康领域的融合。 本书特色 本书内容由浅入深地进行安排,不仅讲解机器学习基本原理和基本方法,而且通过大量医疗领域的案例实现对医疗健康数据的处理和分析,能够在很大程度上辅助医护人员进行临床决策。通过本书学习,读者不仅能够掌握机器学习算法建模前的数据准备,筛选构造机器学习算法指标的特征工程、不同类别的机器学习算法,还能够掌握临床诊疗数据、电子病历档案数据及影像数据等多源异构数据的处理方法,以及医疗图像、文本等数据的读取、预处理、可视化 等知识。同时,本书还介绍了具有开源、去编程化的TipDM 数据挖掘建模平台,通过拖曳的图形化操作就能实现数据分析的全流程。希望通过本书,能够提升医学类学生的数据处理能力,医学领域的创新创业能力,以及通过人工智能技术解决医学领域实际问题的能力。本书可以作为医学类院校数据科学与大数据技术专业的核心课程教材,以及医工专业的专业核心课程或选修课程教材。在此基础上,还可以作为临床、口腔、医技、检验、影像、公共卫生等医学类专业进阶层次的专业限选课程或拓展课程的教材。目前,本书配套的课程是上海健康医学院的优质在线课程和校重点课程,同时是上海高校大学计算机课程教学改革立项项目。 本书适用对象 (1)学习机器学习相关课程的高校学生 目前国内不少高校将机器学习引入教学中,在互联网、金融、医疗等行业的相关专业开设了与机器学习相关的课程,但目前这一课程将Python 基础与机器学习割裂开来,在知识不够系统的同时,也增加了课业负担。本书将Python 基础与机器学习常用编程精炼整合,帮助零基础的读者更快地学会机器学习编程。 (2)学习机器学习应用的开发人员 机器学习应用的开发人员的主要工作是将机器学习相关的算法应用到实际业务系统中。本书提供了详细的机器学习接口的用法与说明,能够帮助机器学习应用的开发人员快速而有效地建立起数据分析应用的算法框架,迅速完成机器学习应用的开发。 (3)进行机器学习应用研究的科研人员 科研人员理论基础强,但其要实现机器学习算法,需要花费大量的时间。本书可以为科研人员提供一个算法快速实现的通道,在短时间内实现理论验证,同时本书也可为科研系统提供机器学习相关的功能支撑。 代码下载及问题反馈 为了帮助读者更好地使用本书,泰迪云课堂提供了配套的教学视频。对于本书配套的原始数据文件、Python 程序代码,读者可以从“泰迪杯”数据挖掘挑战赛网站免费下载。为方便教师授课,本书还提供了PPT 课件等教学资源。 本书第 1 章由刘巧红编写,第2 章由张良均编写,第3 章由李萍编写,第4 章由陈栋编写,第5 章由张敏编写,第6 章由任和、李建华编写,第7 章由凌晨编写,第8 章~第11 章由孙丽萍编写。 我们已经尽最大努力避免在文本和代码中出现错误,但是由于水平有限,编写时间仓促,书中难免出现一些疏漏和不足的地方。如果您有更多的宝贵意见,欢迎在微信公众号:泰迪学社回复“图书反馈”进行反馈,更多本系列图书的信息可以在“泰迪杯”数据挖掘挑战赛网站查阅。
第1 章机器学习 ··············································································································1 1.1 机器学习简介·······································································································1 1.1.1 机器学习的概念······························································································1 1.1.2 机器学习的应用领域························································································1 1.2 机器学习通用流程································································································2 1.2.1 目标分析·······································································································2 1.2.2 数据准备·······································································································3 1.2.3 特征工程·······································································································4 1.2.4 模型训练与调优······························································································5 1.2.5 性能度量与模型应用························································································6 1.3 Python 机器学习工具库简介·················································································6 1.3.1 数据准备相关工具库························································································6 1.3.2 数据可视化相关工具库·····················································································7 1.3.3 模型训练与评估相关工具库···············································································8 小结····························································································································9 课后习题 ··················································································································.10 第 2 章数据准备 ···········································································································.12 2.1 数据质量校验····································································································.12 2.1.1 一致性校验·································································································.12 2.1.2 缺失值校验·································································································.15 2.1.3 异常值校验·································································································.17 2.2 数据分布与趋势探查·························································································.18 2.2.1 分布分析····································································································.18 2.2.2 对比分析····································································································.22 2.2.3 描述性统计分析···························································································.25 2.2.4 周期性分析·································································································.28 2.2.5 贡献度分析·································································································.29 2.2.6 相关性分析·································································································.31 VIII 2.3 数据清洗···········································································································.35 2.3.1 缺失值处理·································································································.35 2.3.2 异常值处理·································································································.38 2.4 数据合并···········································································································.39 2.4.1 数据堆叠····································································································.39 2.4.2 主键合并····································································································.43 小结·························································································································.45 课后习题 ··················································································································.45 第 3 章特征工程 ···········································································································.48 3.1 特征变换···········································································································.48 3.1.1 标准化·······································································································.48 3.1.2 独热编码····································································································.54 3.1.3 离散化·······································································································.55 3.2 特征选择···········································································································.58 3.2.1 子集搜索与评价···························································································.58 3.2.2 过滤式选择·································································································.59 3.2.3 包裹式选择·································································································.59 3.2.4 嵌入式选择与L1 范数正则化···········································································.60 3.2.5 稀疏表示与字典学习·····················································································.61 小结·························································································································.63 课后习题 ··················································································································.63 第 4 章有监督学习 ········································································································.66 4.1 有监督学习简介································································································.66 4.2 性能度量···········································································································.66 4.2.1 分类任务性能度量························································································.66 4.2.2 回归任务性能度量························································································.68 4.3 线性模型···········································································································.69 4.3.1 线性模型简介······························································································.69 4.3.2 线性回归····································································································.69 4.3.3 逻辑回归····································································································.72 4.4 k 近邻分类········································································································.75 4.5 决策树··············································································································.78 4.5.1 决策树简介·································································································.78 4.5.2 ID3 算法·····································································································.79 4.5.3 C4.5 算法····································································································.81 4.5.4 CART 算法··································································································.83 4.6 支持向量机·······································································································.86 4.6.1 支持向量机简介···························································································.86 4.6.2 线性支持向量机···························································································.87 4.6.3 非线性支持向量机························································································.91 4.7 朴素贝叶斯·······································································································.94 4.8 神经网络···········································································································.98 4.8.1 神经网络介绍······························································································.98 4.8.2 BP 神经网络································································································.99 4.9 集成学习···········································································································104 4.9.1 Bagging ······································································································104 4.9.2 Boosting ·····································································································106 4.9.3 Stacking ······································································································115 小结·························································································································116 课后习题 ··················································································································116 第 5 章无监督学习 ········································································································118 5.1 无监督学习简介································································································118 5.2 降维··················································································································118 5.2.1 PCA ··········································································································118 5.2.2 核化线性降维······························································································121 5.3 聚类任务···········································································································123 5.3.1 聚类性能度量指标························································································124 5.3.2 距离计算····································································································125 5.3.3 原型聚类····································································································126 5.3.4 密度聚类····································································································137 5.3.5 层次聚类····································································································139 小结·························································································································142 课后习题 ··················································································································142 第 6 章智能推荐 ···········································································································144 6.1 智能推荐简介····································································································144 6.1.1 推荐系统····································································································144 6.1.2 智能推荐的应用···························································································144 6.2 推荐系统性能度量·····························································································146 6.2.1 离线实验评价指标························································································146 6.2.2 用户调查评价指标························································································148 6.2.3 在线实验评价指标························································································149 6.3 基于关联规则的推荐技术··················································································149 6.3.1 关联规则和频繁项集·····················································································150 6.3.2 Apriori 算法·································································································150 6.3.3 FP-Growth 算法····························································································154 6.4 基于协同过滤的推荐技术··················································································159 6.4.1 基于用户的协同过滤·····················································································159 6.4.2 基于物品的协同过滤·····················································································163 小结·························································································································166 课后习题 ··················································································································167 第 7 章医疗保险的欺诈发现 ··························································································169 7.1 目标分析···········································································································169 7.1.1 背景··········································································································169 7.1.2 数据说明····································································································170 7.1.3 分析目标····································································································171 7.2 数据准备···········································································································172 7.2.1 描述性统计分析···························································································172 7.2.2 数据清洗····································································································172 7.2.3 分析投保人和医疗机构的信息·········································································173 7.3 特征工程···········································································································177 7.3.1 特征选择····································································································177 7.3.2 特征变换····································································································178 7.4 模型训练···········································································································182 7.5 性能度量···········································································································184 7.5.1 结果分析····································································································184 7.5.2 聚类性能度量······························································································188 小结·························································································································190 第 8 章中医证型关联规则分析 ······················································································191 8.1 目标分析···········································································································191 8.1.1 背景··········································································································191 8.1.2 数据说明····································································································191 8.1.3 分析目标····································································································192 8.2 数据准备···········································································································193 8.2.1 数据获取····································································································193 8.2.2 数据清洗····································································································195 8.3 特征工程···········································································································196 8.3.1 特征选择····································································································196 8.3.2 特征变换····································································································197 8.4 模型训练···········································································································201 8.5 性能度量···········································································································202 8.5.1 结果分析····································································································203 8.5.2 模型应用····································································································204 小结·························································································································204 第 9 章糖尿病遗传风险预测 ··························································································205 9.1 目标分析···········································································································205 9.1.1 背景··········································································································205 9.1.2 数据说明····································································································206 9.1.3 分析目标····································································································207 9.2 数据准备···········································································································207 9.2.1 数据探索····································································································207 9.2.2 数据清洗····································································································209 9.3 特征工程···········································································································209 9.4 模型构建···········································································································211 9.4.1 交叉验证····································································································211 9.4.2 模型训练····································································································213 9.5 性能度量···········································································································214 9.5.1 结果分析····································································································214 9.5.2 模型评价····································································································216 小结·························································································································216 第 10 章基于深度残差神经网络的皮肤癌检测································································217 10.1 目标分析·········································································································217 10.1.1 背景·········································································································217 10.1.2 图像数据说明·····························································································218 10.1.3 分析方法与过程··························································································219 10.2 图像数据预处理······························································································219 10.2.1 图像预处理································································································219 10.2.2 查看处理后的图像·······················································································222 10.3 模型构建·········································································································223 10.3.1 卷积神经网络(CNN) ················································································223 10.3.2 残差网络(Residual Network) ·······································································226 10.3.3 ImageDataGenerator 参数说明·········································································228 10.3.4 训练深度残差神经网络模型···········································································229 10.4 性能度量·········································································································231 10.4.1 性能分析···································································································231 10.4.2 结果分析···································································································232 小结·························································································································234 第 11 章基于 TipDM 数据挖掘建模平台实现医疗保险的欺诈发现··································236 11.1 TipDM 数据挖掘建模平台················································································236 11.1.1 首页·········································································································237 11.1.2 数据源······································································································238 11.1.3 工程·········································································································239 11.1.4 系统组件···································································································240 11.1.5 TipDM 数据挖掘建模平台的本地化部署···························································241 11.2 快速构建医疗保险的欺诈发现工程··································································243 11.2.1 获取数据···································································································244 11.2.2 数据准备···································································································247 11.2.3 特征工程···································································································250 11.2.4 模型训练···································································································253 小结·························································································································255 参考文献 ·························································································································256
http://www.hxedu.com.cn/hxedu/fg/book/bookinfo.html?code=G0400390