Nov. 3–6, 2013 | Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica
This paper introduces D-Streams, a new processing model for fault-tolerant streaming computation at scale. D-Streams overcome the limitations of existing distributed stream processing models, which often require expensive fault recovery mechanisms like hot replication or long recovery times, and do not handle stragglers. D-Streams enable a parallel recovery mechanism that improves efficiency over traditional replication and backup schemes, and tolerates stragglers. They support a rich set of operators while attaining high per-node throughput similar to single-node systems, linear scaling to 100 nodes, subsecond latency, and sub-second fault recovery. D-Streams can easily be composed with batch and interactive query models like MapReduce, enabling rich applications that combine these modes. The authors implement D-Streams in a system called Spark Streaming.
D-Streams structure streaming computations as a series of stateless, deterministic batch computations on small time intervals. This allows for efficient recovery and parallel processing. The system uses Resilient Distributed Datasets (RDDs) to keep data in memory and can recover it without replication by tracking the lineage graph of operations that were used to build it. This enables sub-second end-to-end latencies.
The system can process over 60 million records/second on 100 nodes at sub-second latency, and can recover from faults and stragglers in sub-second time. Spark Streaming's per-node throughput is comparable to commercial streaming databases, while offering linear scalability to 100 nodes, and is 2–5× faster than the open source Storm and S4 systems, while offering fault recovery guarantees that they lack. The authors illustrate Spark Streaming's expressiveness through ports of two real applications: a video distribution monitoring system and an online machine learning system.
D-Streams use the same processing model and data structures (RDDs) as batch jobs, allowing streaming queries to be seamlessly combined with batch and interactive computation. This feature is leveraged in Spark Streaming to let users run ad-hoc queries on streams using Spark, or join streams with historical data computed as an RDD. This is a powerful feature in practice, giving users a single API to combine previously disparate computations.
The system architecture of Spark Streaming includes a master that tracks the D-Stream lineage graph and schedules tasks to compute new RDD partitions, worker nodes that receive data, store the partitions of input and computed RDDs, and execute tasks, and a client library used to send data into the system. Spark Streaming reuses many components of Spark, but also modified and added multiple components to enable streaming.
The system has been evaluated using several benchmark applications and by porting two real applications to it: a commercial video distribution monitoring system and a machine learning algorithm for estimating traffic conditions from automobile GPS data. The results show that Spark Streaming scales nearly linearly to 100 nodes, and can process up to 6 GB/s (64M records/s) at sub-second latency on 10This paper introduces D-Streams, a new processing model for fault-tolerant streaming computation at scale. D-Streams overcome the limitations of existing distributed stream processing models, which often require expensive fault recovery mechanisms like hot replication or long recovery times, and do not handle stragglers. D-Streams enable a parallel recovery mechanism that improves efficiency over traditional replication and backup schemes, and tolerates stragglers. They support a rich set of operators while attaining high per-node throughput similar to single-node systems, linear scaling to 100 nodes, subsecond latency, and sub-second fault recovery. D-Streams can easily be composed with batch and interactive query models like MapReduce, enabling rich applications that combine these modes. The authors implement D-Streams in a system called Spark Streaming.
D-Streams structure streaming computations as a series of stateless, deterministic batch computations on small time intervals. This allows for efficient recovery and parallel processing. The system uses Resilient Distributed Datasets (RDDs) to keep data in memory and can recover it without replication by tracking the lineage graph of operations that were used to build it. This enables sub-second end-to-end latencies.
The system can process over 60 million records/second on 100 nodes at sub-second latency, and can recover from faults and stragglers in sub-second time. Spark Streaming's per-node throughput is comparable to commercial streaming databases, while offering linear scalability to 100 nodes, and is 2–5× faster than the open source Storm and S4 systems, while offering fault recovery guarantees that they lack. The authors illustrate Spark Streaming's expressiveness through ports of two real applications: a video distribution monitoring system and an online machine learning system.
D-Streams use the same processing model and data structures (RDDs) as batch jobs, allowing streaming queries to be seamlessly combined with batch and interactive computation. This feature is leveraged in Spark Streaming to let users run ad-hoc queries on streams using Spark, or join streams with historical data computed as an RDD. This is a powerful feature in practice, giving users a single API to combine previously disparate computations.
The system architecture of Spark Streaming includes a master that tracks the D-Stream lineage graph and schedules tasks to compute new RDD partitions, worker nodes that receive data, store the partitions of input and computed RDDs, and execute tasks, and a client library used to send data into the system. Spark Streaming reuses many components of Spark, but also modified and added multiple components to enable streaming.
The system has been evaluated using several benchmark applications and by porting two real applications to it: a commercial video distribution monitoring system and a machine learning algorithm for estimating traffic conditions from automobile GPS data. The results show that Spark Streaming scales nearly linearly to 100 nodes, and can process up to 6 GB/s (64M records/s) at sub-second latency on 10