When data is too big for a single computer, we need a cluster. Apache Spark is the industry standard for distributed data processing.
1The Distributed Brain
Spark uses a Master/Slave architecture. The Driver Program (the Master) coordinates the work, while Executors (the Workers) perform the actual computations on the data. By splitting a 1TB file into 1,000 pieces across 100 executors, Spark can process data in seconds that would take a single PC days to complete.
Cluster: [MASTER_NODE] <-> [WORKER_1, WORKER_2]
Data: [IN_MEMORY_RDD]
Status: SPARK_SESSION_ACTIVE
Strategy: DISTRIBUTED_COMPUTE2RDDs and DataFrames
The original building block of Spark was the RDD (Resilient Distributed Dataset), a low-level immutable collection of objects. Modern Spark uses DataFrames, which are like SQL tables. DataFrames are optimized by the Catalyst Optimizer, which automatically rewrites your code to run as efficiently as possible across the cluster.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('AI_App').getOrCreate()
df = spark.read.csv('huge_dataset.csv')
df.show()
# Output: A table with 1 Billion Rows