Designing Data-Intensive Applications (2nd Edition) - Martin Kleppmann & Chris Riccomini
Chapter 1 - Trade-Offs in Data Systems Architecture
Data-intensive vs compute-intensive
Frontends and backends
Operational Systems (backend engineers) vs Analytical Systems (business analysts and data scientists)
Data engineers and Analytics engineers
A transaction, a point query, Online Transaction Processing (OLTP)
Online Analytical Processing (OLAP)
Operational System (OLTP) vs Analytical Systems (OLAP)
Fixed sql queries in oltp vs arbitrary sql queries in olap
Tableau, Looker, Microsoft Power BI
Product Analytics or Real-time analytics - Pinot, Druid, ClickHouse
A data warehouse
Undesirable for business analysts and data scientists to query otlp systems, for several reasons,
Data silos
Extract-transform-load (ETL),
Sometimes ELT, transformation is done in the data warehouse, after loading
ETL for SaaS APIs,
Data connector services such as Fivetran, Singer, Airbyte
Hybrid transactional/analytical processing (HTAP)
Hundreds of seperate operational databases, but usually one data warehouse for an enterprise
A data warehouse often uses relational data with SQL
Training an ML model - features, feature engineering
Python data analysis libraries - pandas, scikit-learn
Statistical analysis languages - R
Distributed analytics frameworks - Spark
A data lake
File formats - avro, parquet
Commoditized file storage - object stores
ETL processes - Data pipelines
Sushi principle - raw data is better
DataOps Manifesto
General Data Protection Regulation (GDPR)
California Consumer Privacy Act (CCPA)
Files and relational tables - stream of events, stream processing
Output of analytical systems made available to operational systems - Reverse ETL
Machine learning models can be deployed to operational systems - TFX, Kubeflow, MLFlow
Systems of record (source of truth),
Derived data systems
Pros and cons of cloud services
Cloud native system architecture - an architecture that is designed to take advantage of cloud services
Examples of self-hosted and cloud native database systems (OLTP and OLAP)
Remote direct memory access (RDMA) network interfaces
The key idea of cloud native services - to build upon lower-level cloud services to create higher-level services
Object storage services - Amazon S3, Azure Blob Service, Cloudflare R2
Cloud-based analytical database (data warehouse) - Snowflake
Redundant array of independent disks (RAID)
Virtual disk storage - Amazon EBS, Azure managed disks, persistent disks in Google Cloud
Storage (disk) and computation (CPU and RAM) have become somewhat seperated in cloud native services
Cloud native systems are often multitenant
database administrators (DBAs) or system administrators (sysadmins)
DevOps philosophy, Site reliability engineers (SREs)
Cloud storage replaces fixed-size disks with metered billing
Distributed vs single-node systems
A system that involves several machines communicating via a network is called a distributed system. Each of the processes participating in a distributed system is called a node.
Various reasons for using distributed system,
- inherent distribution
- requests between cloud services
- fault tolerance/high availability
- scalability
- latency
- elasticity
- specialized hardware
- legal compliance
- sustainability
Problems with distributed systems
Tracing tools like OpenTelemetry, Zipkin, Jaeger
Single-node databases - DuckDB, SQLite, KùzuDB
Service-oriented architecture (SOA)
Microservices architecture
Pros and cons of many independent services
Orchestration frameworks such as kubernetes
API description standards such as OpenAPI and gRPC
Microservices are primarily a technical solution to a people problem
Serverless or function as a service (FaaS)
Serverless approach is bringing metered billing to code execution
BigQuery, Kafka
Cloud computing is not the only way of building large-scale computing systems, an alternative is high-performance computing (HPC), also known as supercomputing
Differences between HPC and cloud/enterprise datacenter
Cloud datacenter networks are often based on IP and Ethernet, arranged in Clos topologies to provide high bisection bandwidth
Supercomputers often use specialized network topologies, such as multidimensional meshes and toruses