大数据 | Taking Smart Notes With Org-mode

IRT

tags: Algorithm,Bigdata,Educational Measurement

Streaming

tags: Bigdata

Why local state is a fundamental primitive in stream processing

tags: Bigdata,Streaming,Stateful Stream Processing source: Kreps, Jay. “Why Local State Is a Fundamental Primitive in Stream Processing - O’Reilly Radar.” Accessed January 5, 2022. http://radar.oreilly.com/2014/07/why-local-state-is-a-fundamental-primitive-in-stream-processing.html. Why local state is much faster than a distribut database. local state can easily restore by some middleware like Kafka: by writing changes to a Kafka topic.

Streaming 102: The world beyond batch

tags: Bigdata,Flink,Dataflow Model,Streaming source: “Streaming 102: The World beyond Batch – O’Reilly.” Accessed January 5, 2022. https://www.oreilly.com/radar/the-world-beyond-batch-streaming-102/. Three more concepts: Watermarks: Useful for event time windowing. All input data with event times less than watermark have been observed. Triggers: Signal for a window to produce output. Accumulation: The way to handle multiple results that are observed for the same window. Streaming 101 Redux What: Transformations Where: windowing Make a temporal boundary for a unbounded data source. ...

Dataflow Model

tags: Bigdata,Streaming source: Akidau, Tyler, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, Reuven Lax, Sam McVeety, et al. “The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, out-of-Order Data Processing.” Proceedings of the VLDB Endowment 8, no. 12 (August 2015): 1792–1803. https://doi.org/10.14778/2824032.2824076.

Streaming 101: The world beyond batch

tags: Bigdata,Flink,Streaming source: Akidau, Tyler. “Streaming 101: The World beyond Batch.” O’Reilly Media, August 5, 2015. https://www.oreilly.com/radar/the-world-beyond-batch-streaming-101/. Streaming: a type of data processing engine that is designed with infinite data sets in mind. Other common uses of “streaming” that will be avoid in the rest of the post: Unbounded data: A type of ever-growing, essentially infinite data set. Unbounded data processing: An ongoing mode of data processing, applied to the aforementioned type of unbounded data. Low-latency, approximate, and/or speculative results: These types of results are most often associated with streaming engines. Limitations of streaming To beat batch at its own game, you really only need two things: ...

Beam

tags: Bigdata source: https://beam.apache.org/

Spark

tags: Bigdata Spark 编程语言选择毋庸置疑，Python 应该是最简单也是大部分的选择，但是如果有依赖那么将要付出额外的心智负担（Spark 管理 Python 依赖）。 JVM 语言的依赖组织方式则具有天然的优势，可以将依赖（排除 Spark 生态之后）都 bundle 进 Jar 包里。其中 Scala 兼具简单和 JVM 的优势，但是它「不流行」。 Spark Driver & Executor Driver 执行 spark-commit 客户端，创建 SparkContext 执行 main 函数。 Executor Spark Worker 上的线程 See also: Understanding the working of Spark Driver and Executor Cluster Mode Overview Spark 代码执行我在配置 Spark 的时候就在好奇，从观察上看部分代码应该是执行在 Driver 上部分代码会执行在 Executer，这让我很好奇。但是我通过学习 Spark RDD 学习到了一些知识。以下代码是在 Executor 上执行的： Transformations 和 Actions 是执行在 Spark 集群的。传递给 Transformations 和 Actions 的闭包函数也是执行在 Spark 集群上的。其他额外的代码都是执行在 Driver 上的，所以想要在 Driver 打印日志需要上使用 collect： ...

Hive

tags: Bigdata bucketed map join

Hadoop

tags: Bigdata Hadoop Distributed File System MapReduce MapReduce shuffle 按照 reducer 分区，排序和将数据分区从 mapper 复制到 reducer。（令人困惑的术语，并不完全与洗牌一样，在 MapReduce 中其实没有随机性）。 MapReduce 的分布式执行 Hadoop MapReduce 并行化基于数据分区实现：输入：通常是 HDFS 中的一个目录。分区：每个文件或文件块都被视为一个单独的分区。处理：每个分区由单独的 map 任务来处理。每个 mapper 都会尽量实现计算靠近数据。代码复制：JAR 文件。 Reduce 任务的计算也被分隔成块，可以不必与 mapper 任务数量相同，MapReduce 框架使用关键字的哈希值来确保具有相同关键字的键值对都在相同的 reduce 任务中处理。键值对必须进行排序，排序是分阶段进行的：每个 map 任务都基于关键字哈希值，按照 reducer 对输出进行分块。每个分区都被写入 mapper 程序所在的本地磁盘上的已排序文件，参见 SSTables 和 LSM-Tree。 reducer 与每个 mapper 相连接：MapReduce 调度器会在 mapper 写入经过排序的输出文件后，通知 reducer 开始从 mapper 中获取输出文件，框架进行 MapReduce shuffle。 reducer 任务从 mapper 中获取文件并将它们合并在一起，同时保持数据的排序。不同 mapper 使用相同的关键字生成记录，会在合并后的 reducer 输入中位于相邻的位置。 reducer 可以使用任意逻辑来处理这些记录，并且生成任意数量的输出记录。记录被写入分布式文件系统中的文件。 MapReduce 工作流调度器 Oozie Azkaban Luigi Airflow Pinball 对比分布式数据库 MapReduce 中的并行处理和并行 join 算法已经在十多年前所谓的大规模并行处理（MPP）数据库中实现了。 ...

Hadoop Distributed File System

tags: Bigdata,分布式文件系统与网络连接存储（NAS）和存储区域网络（SAN）架构相比，HDFS 基于无共享原则，无需定制硬件和特殊网络基础设施（光纤）。 HDFS 创建了一个庞大的文件系统，来充分利用每个守护进程机器上的磁盘资源。 HDFS 包含一个在每台机器上运行的守护进程，并会开放一个网络服务以允许其他节点访问存储在该机器上的文件。名为 NameNode 的中央服务器会跟踪哪个文件块存储在哪个服务器上。考虑容错，文件快块复制到多台机器上，或者像 Reed-Solomon 代码中这样的纠删码方案（类似 RAID，但无需特殊硬件）。提供很好的扩展性，配合商业硬件和开源软件，可以运行在上万台机器，容量达几百 PB。计算靠近数据只要有足够的空闲内存和 CPU 资源，MapReduce 调度器会尝试在输入文件的副本的某台机器上运行 mapper 任务。

《数据密集型应用系统设计》读书笔记

tags: 读书笔记,Bigdata,分布式,数据库数据系统基础可靠、可扩展与可维护的应用系统数据模型与查询语言数据存储与检索数据编码与演化分布式数据系统目的：扩展性、容错和高可用、延迟考虑（多机房）扩展：垂直扩展：提升单机性能水平扩展：无共享结构，由软件实现核心逻辑复制与分区：复制：多节点冗余分区：数据库拆分分片：分区分配给不同的节点数据复制数据分区事务分布式系统挑战一致性与共识派生数据记录系统：真实数据系统，拥有数据的权威版本。派生数据系统：从另一个数据系统获取，丢失可以根据数据源重建，如缓存等。批处理系统流处理系统

Flink

tags: Bigdata,Dataflow Model,Streaming

Kafka

tags: Bigdata 相关知识点概念组成 Producer 消息产生者，往指定 Topic 的指定 Partition 发送消息 Consumer Group 消费指定 Topic 的消息 Consumer 消费指定 Topic 下某一分区的消息 Topic 区分不同消息主题 Partition 保证同一分区的有序性 Connector 消息可被不同的 Consumer Group 重复消费（广播或订阅）。同一 Consumer Group 下的不同 Consumer 分别消费不同的 Partition，Consumer 数量不能超过 Partition 数量。数据被持久化并分片成功后发送 ACK 保证里数据不被丢失。设计持久化基于文件系统基于队列是顺序的和磁盘的顺序访问要比内存的随机访问要快（参见 The Pathologies of Big Data）， Kafka 采用在磁盘文件系统上尾部写头部读的方式。 Kafka 没有采用 BTree 存储数据因为 BTree 的操作是 O(log N) ，而且对磁盘的 seek 操作要慢，且同时只能进行一次限制了并行，所以实际操作比 O(log N) 要慢基于磁盘的顺序访问进行在尾部写和头部读，可以实现读写都是 O(1) 的时间复杂度，并且读写互不干扰基于以上实现，Kafka 可以不必在消息一经消费就删除，而是可以保留消息一段相对较长的时间（比如一周）高效 ...

Links to this note