
在数据湖仓的所有新增要素中,排名第一的就是可以利于数据分析和机器学习所用的分析基础设施。分析基础设施包括一众大家广为熟悉的东西,当然也包括一些可能对大家还有些陌生或略带新鲜感的概念。比如包括:元数据、数据血缘、 数据体量的度量 、数据创建的历史记录、数据转换描述。
数据湖仓的第二个新增要素,是识别和使用通用连接器。通用连接器允许合并和比较所有不同来源的数据。如果没有通用连接器,就很难(实际上是几乎不可能)将数据湖仓中的不同数据关联起来。但有了这个中西,就可以关联任何类型的数据。
使用数据湖仓,就有可能实现以往任何其它方式都不可行或不可能实现的某种程度的数据分析和机器学习。 但与其它架构一样,我们需要理解数据湖仓的架构以及它的能力,以便于我们基于这种架构创建数据分析蓝图和开展数据分析规划。
与过去相对简单的应用程序不同,当今的应用形态丰富多样,各种类型的数据、技术、硬件和小工具等充斥着这个世界。数据以不同的形式从四方涌来,甚至体量多得有些令人无法招架。
数据是用来分析的。对于企业等组织,可分析的数据有三种类型。首先是经典的结构化数据,这种类型的数据出现最早,存在时间最长,是由业务开展所产生的。其次是文本数据,这些数据可能来自电子邮件、呼叫中心的通话记录,也可能来自商业合同、医疗记录或其他文本数据。对于计算机而言,文本数据一度是个“黑匣子”,因为它只能被计算机存储而不便于分析,但如今文本的提取、转换和加载(ETL)技术为处理文本数据进行标准化分析大开方便之门。最后是模拟数据和物联网数据,各种类型的机器,例如无人机、电子眼、温度计和电子手表等都能产生这样的数据。模拟数据和物联网数据的形式比结构化数据或文本数据要粗糙得多,并且有大量数据是自动生成的,这类数据多属于数据科学家研究的范畴。
起初,我们把上述这些数据都扔进了一个叫作“数据湖”的坑洞里。但我们很快发现,仅仅把数据丢进去似乎毫无意义。因为如果要想让数据能够发挥作用,它就需要被分析,而分析数据则需要:
(1)将数据与其他数据相互关联;
(2)需要数据湖自身拥有分析基础设施并向终端用户提供服务。
除非我们满足这两个条件,否则数据湖就很容易变成“数据沼泽”,而这个沼泽在一段时间后便会开始变味发臭。
总而言之,不满足分析标准的数据湖只会浪费时间和金钱。
而数据湖仓正是针对上述需求和当前不足而诞生的。它在数据湖的基础上增加了一些要素,能够让数据变得有用且富有成效。换个方式来说,如果现在你还在构建一个数据湖,而没有将其升级转变为数据湖仓的话,那你构建的仅仅是一个昂贵且碍眼的东西,随着时间的推移,它只会变成沉重的负担。
在数据湖仓的所有新增要素中,第一个是用于数据分析和机器学习的分析基础设施(analytical infrastructure)。分析基础设施包括一些广为大家所熟悉的东西,当然也包括一些可能大家还有些陌生的概念。比如:
● 元数据;
● 数据血缘;
● 数据体量的度量;
● 数据创建的历史记录;
● 数据转换描述。
数据湖仓的第二个新增要素是识别和使用通用连接器。通用连接器允许合并和比较所有不同来源的数据。如果没有通用连接器,就很难(实际上是几乎不可能)将数据湖仓中的不同数据关联起来。但有了这个东西,就可以关联任何类型的数据。
使用数据湖仓,就有可能实现任何其他方式都不可行或不可能实现的某种程度的数据分析和机器学习。但与其他架构一样,我们需要理解数据湖仓的架构及其能力,以便于我们基于这种架构创建数据分析蓝图和开展数据分析规划。
目 录
引 言
第一章 向数据湖仓演进
1. 技术的演进 ······································································3
2. 组织内的全部数据 ······························································8
3. 商业价值在哪里? ··························································· 12
4. 数据湖 ··········································································· 13
5. 当前数据架构的挑战 ························································· 14
6. 数据湖仓的出现 ······························································· 15
第二章 数据科学家和终端用户
1. 数据湖 ·········································································· 20
2. 分析基础设施 ································································· 21
3. 不同的受众 ····································································· 21
4. 分析工具不同 ·································································· 22
5. 分析目的不同 ·································································· 23
6. 分析方法不同 ·································································· 24
7. 数据类型不同 ·································································· 24
第三章 数据湖仓中的不同类型数据
1. 数据的类型 ····································································· 28
2. 不同数据的容量 ······························································· 31
3. 跨越不同类型数据的关联数据 ············································· 32
4. 基于访问概率对数据进行分片 ············································· 33
5. 模拟和物联网环境中的关联数据 ·········································· 33
6. 分析基础设施 ································································· 35
第四章 开放的湖仓环境
1. 开放系统的演进 ······························································· 38
2. 与时俱进的创新 ······························································ 39
3. 建立在开放、标准文件格式之上的非结构化湖仓 ······················ 39
4. 开源数据湖仓软件 ···························································· 40
5. 数据湖仓提供超越 SQL 的开放 API······································· 41
6. 数据湖仓支持开放数据共享 ················································ 42
7. 数据湖仓支持开放数据探索 ················································ 43
8. 数据湖仓通过开放数据简化数据发现 ······························ 44
9. 利用云原生架构的数据湖仓 ················································ 45
10. 向开放的数据湖仓演进 ···················································· 46
第五章 机器学习和数据湖仓
1. 机器学习 ········································································ 47
2. 机器学习需要湖仓提供什么? ············································· 48
3. 从数据中挖掘出新价值 ····················································· 48
4. 解决这个难题 ·································································· 48
5. 非结构化数据问题 ··························································· 49
6. 开源的重要性 ·································································· 51
7. 发挥云的弹性优势 ··························································· 51
8. 为数据平台设计“MLOps”··················································52
9. 案例:运用机器学习对胸透 X 光片进行分类 ··························· 53
10. 数据湖仓的非结构化组件的演进 ········································· 55
第六章 数据湖仓中的分析基础设施
1. 元数据 ··········································································· 58
2. 数据模型 ······································································· 59
3. 数据质量 ······································································· 60
4. ETL ·············································································· 61
5. 文本 ETL········································································ 62
6. 分类标准 ········································································ 62
7. 数据体量 ······································································· 63
8. 数据血缘 ········································································ 64
9. KPI ··············································································· 65
10. 数据的粒度 ··································································· 66
11. 事务 ············································································ 66
12. 键 ··············································································· 66
13. 处理计划 ······································································ 67
14. 汇总数据 ····································································· 67
15. 最低要求 ······································································ 68
第七章 数据湖仓中的数据融合
1. 湖仓和数据湖仓 ······························································ 69
2. 数据的源头 ···································································· 70
3. 不同类型的分析 ······························································ 70
4. 通用标识符 ····································································· 72
5. 结构化标识符 ································································· 72
6. 重复数据 ······································································· 73
7. 文本环境中的标识符 ························································ 74
8. 文本数据和结构化数据的融合 ············································· 76
9. 匹配的重要性 ································································· 81
第八章 跨数据湖仓架构的分析类型
1. 已知查询 ········································································ 83
2. 启发式分析 ····································································· 85
第九章 数据湖仓仓务管理
1. 数据集成和互操作 ···························································· 92
2. 数据湖仓的主数据及参考数据 ············································· 94
3. 数据湖仓的隐私、保密和数据保护 ········································ 96
4. 数据湖仓中面向未来的数据 ················································ 97
5. 面向未来的数据的五个阶段 ··············································· 101
6. 数据湖仓的例行维护 ························································ 108
第十章 可视化
1. 将数据转化为信息 ··························································· 110
2. 什么是数据可视化?为什么它很重要? ································· 112
3. 数据可视化、数据分析和数据解释之间的差异 ························ 113
4. 数据可视化的优势 ··························································· 115
第十一章 数据湖仓架构中的数据血缘
1. 计算链 ·········································································· 124
2. 数据选取 ······································································· 126
3. 算法差异 ······································································· 126
4. 文本数据血缘 ································································· 127
5. 其他非结构化环境的数据血缘 ············································ 128
6. 数据血缘 ······································································· 129
第十二章 数据湖仓架构中的访问概率
1. 数据的高效排列 ······························································ 131
2. 数据的访问概率 ······························································ 131
3. 数据湖仓中不同的数据类型 ··············································· 133
4. 数据量的相对差异 ··························································· 133
5. 数据分片的优势 ······························································ 134
6. 使用大容量存储 ······························································ 134
7. 附加索引 ······································································· 135
第十三章 跨越鸿沟
1. 合并数据 ······································································· 136
2. 不同种类的数据 ······························································ 137
3. 不同的业务需求 ······························································ 137
4. 跨越鸿沟 ······································································· 137
第十四章 数据湖仓中的海量数据
1. 海量数据的分布 ······························································ 145
2. 高性能、大容量的数据存储 ··············································· 146
3. 附加索引和摘要 ······························································ 146
4. 周期性的数据过滤 ··························································· 148
5. 数据标记法 ···································································· 148
6. 分离文本和数据库 ··························································· 149
7. 归档存储 ······································································· 149
8. 监测活动 ······································································· 150
9. 并行处理 ······································································· 151
第十五章 数据治理与数据湖仓
1. 数据治理的目的 ······························································ 152
2. 数据生命周期管理 ··························································· 154
3. 数据质量管理 ································································· 156
4. 元数据管理的重要性 ························································ 157
5. 随着时间推移的数据治理 ·················································· 157
6. 数据治理的类型 ······························································ 158
7. 贯穿数据湖仓的数据治理 ·················································· 159
8. 数据治理的注意事项 ························································ 160
第十六章 现代数据仓库
1. 应用程序的普及 ······························································ 162
2. 信息孤岛 ······································································· 163
3. 复杂网络环境 ································································· 164
4. 数据仓库 ······································································· 165
5. 数据仓库的定义 ······························································ 166
6. 历史数据 ······································································· 167
7. 关系模型 ······································································· 167
8. 数据的本地形式 ······························································ 168
9. 集成数据的需要 ······························································ 169
10. 时过境迁 ····································································· 170
11. 当今世界 ····································································· 170
12. 不同体量的数据····························································· 172
13. 数据与业务的关系 ·························································· 173
14. 将数据纳入数据仓库 ······················································· 173
15. 现代数据仓库 ······························································· 174
16. 什么时候我们不再需要数据仓库? ····································· 175
17. 数据湖 ········································································ 176
18. 以数据仓库作为基础 ······················································· 177
19. 数据堆栈 ····································································· 178
**数据库与数据湖,新一代数据管理的新模式——数据湖仓为数据分析带来新变革,为更有效、更便捷、更科学、更可靠、更灵活的数据分析提供基础。
胡博,国际数据管理协会(DAMA)中国理事,国家重点研发计划课题负责人。发表过学术论文20余篇,在云平台、数据中台等方面授权国家发明专利12项;是中国计算机学会高级会员、中国计算机协会服务计算专委会执行委员、SCI期刊IJWSR 执行主编、华中农业大学、深圳大学、武汉科技大学和海南师范大学硕士生导师。