Designing Data-Intensive Applications (2nd Edition) - Martin Kleppmann & Chris Riccomini


Chapter 1 - Trade-Offs in Data Systems Architecture

Data-intensive vs compute-intensive

Frontends and backends

Operational Systems (backend engineers) vs Analytical Systems (business analysts and data scientists)

Data engineers and Analytics engineers

A transaction, a point query, Online Transaction Processing (OLTP)

Online Analytical Processing (OLAP)

Operational System (OLTP) vs Analytical Systems (OLAP)

Fixed sql queries in oltp vs arbitrary sql queries in olap

Tableau, Looker, Microsoft Power BI

Product Analytics or Real-time analytics - Pinot, Druid, ClickHouse

A data warehouse

Undesirable for business analysts and data scientists to query otlp systems, for several reasons,
Data silos

Extract-transform-load (ETL),
Sometimes ELT, transformation is done in the data warehouse, after loading

ETL for SaaS APIs,
Data connector services such as Fivetran, Singer, Airbyte

Hybrid transactional/analytical processing (HTAP)

Hundreds of seperate operational databases, but usually one data warehouse for an enterprise

A data warehouse often uses relational data with SQL

Training an ML model - features, feature engineering

Python data analysis libraries - pandas, scikit-learn
Statistical analysis languages - R
Distributed analytics frameworks - Spark

A data lake

File formats - avro, parquet

Commoditized file storage - object stores

ETL processes - Data pipelines

Sushi principle - raw data is better

DataOps Manifesto

General Data Protection Regulation (GDPR)
California Consumer Privacy Act (CCPA)

Files and relational tables - stream of events, stream processing

Output of analytical systems made available to operational systems - Reverse ETL

Machine learning models can be deployed to operational systems - TFX, Kubeflow, MLFlow

Systems of record (source of truth),
Derived data systems

Pros and cons of cloud services

Cloud native system architecture - an architecture that is designed to take advantage of cloud services

Examples of self-hosted and cloud native database systems (OLTP and OLAP)

Remote direct memory access (RDMA) network interfaces

The key idea of cloud native services - to build upon lower-level cloud services to create higher-level services

Object storage services - Amazon S3, Azure Blob Service, Cloudflare R2
Cloud-based analytical database (data warehouse) - Snowflake

Redundant array of independent disks (RAID)

Virtual disk storage - Amazon EBS, Azure managed disks, persistent disks in Google Cloud

Storage (disk) and computation (CPU and RAM) have become somewhat seperated in cloud native services

Cloud native systems are often multitenant

database administrators (DBAs) or system administrators (sysadmins)

DevOps philosophy, Site reliability engineers (SREs)

Cloud storage replaces fixed-size disks with metered billing

Distributed vs single-node systems

A system that involves several machines communicating via a network is called a distributed system. Each of the processes participating in a distributed system is called a node.

Various reasons for using distributed system,

  • inherent distribution
  • requests between cloud services
  • fault tolerance/high availability
  • scalability
  • latency
  • elasticity
  • specialized hardware
  • legal compliance
  • sustainability

Problems with distributed systems

Tracing tools like OpenTelemetry, Zipkin, Jaeger

Single-node databases - DuckDB, SQLite, KùzuDB

Service-oriented architecture (SOA)

Microservices architecture

Pros and cons of many independent services

Orchestration frameworks such as kubernetes

API description standards such as OpenAPI and gRPC

Microservices are primarily a technical solution to a people problem

Serverless or function as a service (FaaS)

Serverless approach is bringing metered billing to code execution

BigQuery, Kafka

Cloud computing is not the only way of building large-scale computing systems, an alternative is high-performance computing (HPC), also known as supercomputing

Differences between HPC and cloud/enterprise datacenter

Cloud datacenter networks are often based on IP and Ethernet, arranged in Clos topologies to provide high bisection bandwidth

Supercomputers often use specialized network topologies, such as multidimensional meshes and toruses


Chapter 2 - Defining Nonfunctional Requirements


Chapter 3 - Data Models and Query Languages


Chapter 4 - Storage and Retrieval


Chapter 5 - Encoding and Evolution


Chapter 6 - Replication


Chapter 7 - Sharding


Chapter 8 - Transactions


Chapter 9 - The Trouble with Distributed Systems


Chapter 10 - Consistency and Consensus


Chapter 11 - Batch Processing


Chapter 12 - Stream Processing


Chapter 13 - A Philosophy of Streaming Systems


Chapter 14 - Doing the Right Thing