What Is This Course About?

Introduction

Dynamo: Introduction

High-Level Architecture

Data Partitioning

Replication

Vector Clocks and Conflicting Data

The Life of Dynamo’s put() & get() Operations

Anti-entropy Through Merkle Trees

Gossip Protocol

Dynamo Characteristics and Criticism

Summary: Dynamo

Quiz: Dynamo

Mock Interview: Dynamo

Dynamo: How to design a key value store?

Cassandra: Introduction

High-level Architecture

Cassandra Consistency Levels

Gossiper

Anatomy of Cassandra's Write Operation

Anatomy of Cassandra's Read Operation

Compaction

Tombstones

Summary: Cassandra

Quiz: Cassandra

Mock Interview: Cassandra

Cassandra: How to Design a Wide-column NoSQL Database?

Messaging Systems: Introduction

Kafka: Introduction

Kafka: Deep Dive

Consumer Groups

Kafka Workflow

Role of ZooKeeper

Controller Broker

Kafka Delivery Semantics

Kafka Characteristics

Summary: Kafka

Quiz: Kafka

Mock Interview: Kafka

Kafka: How to Design a Distributed Messaging System?

Chubby: Introduction

Design Rationale

How Chubby Works

File, Directories, and Handles

Locks, Sequencers, and Lock-delays

Sessions and Events

Master Election and Chubby Events

Caching

Database

Scaling Chubby

Summary: Chubby

Quiz: Chubby

Mock Interview: Chubby

Chubby: How to Design a Distributed Locking Service?

Google File System: Introduction

Single Master and Large Chunk Size

Metadata

Master Operations

Anatomy of a Read Operation

Anatomy of a Write Operation

Anatomy of an Append Operation

GFS Consistency Model and Snapshotting

Fault Tolerance, High Availability, and Data Integrity

Garbage Collection

Criticism on GFS

Summary: GFS

Quiz: GFS

Mock Interview: GFS

GFS: How to Design a Distributed File System Storage?

Hadoop Distributed File System: Introduction

Deep Dive

Data Integrity & Caching

Fault Tolerance

HDFS High Availability (HA)

HDFS Characteristics

Summary: HDFS

Quiz: HDFS

Mock Interview: HDFS

HDFS: How to Design File Storage System?

BigTable: Introduction

BigTable Data Model

System APIs

Partitioning and High-level Architecture

SSTable

GFS and Chubby

Bigtable Components

Working with Tablets

The Life of BigTable's Read & Write Operations

Fault Tolerance and Compaction

BigTable Refinements

BigTable Characteristics

 Summary: BigTable

Quiz: BigTable

Mock Interview: BigTable

BigTable: How to Design a Wide Column Storage System? 

Introduction: System Design Patterns

1. Bloom Filters

2. Consistent Hashing

3. Quorum

4. Leader and Follower

5. Write-ahead Log

6. Segmented Log

7. High-Water Mark

8. Lease

9. Heartbeat

10. Gossip Protocol

11. Phi Accrual Failure Detection

12. Split Brain

13. Fencing

14. Checksum

15. Vector Clocks

16. CAP Theorem

17. PACELC Theorem

18. Hinted Handoff

19. Read Repair

20. Merkle Trees

System Design Patterns

Quiz I

Quiz II

Final Assessment

Contact us

Appendix

Grokking the Advanced System Design Interview

## Goal

Design a distributed system that can store huge files (terabyte and larger). The system should be scalable, reliable, and highly available.

## What is Hadoop Distributed File System (HDFS)?

HDFS is a distributed file system and was built to store unstructured data. It is designed to store huge files reliably and stream those files at high bandwidth to user applications.

HDFS is a variant and a simplified version of the Google File System (GFS). A lot of HDFS architectural decisions are inspired by GFS design. HDFS is built around the idea that the most efficient data processing pattern is a **write-once, read-many-times** pattern.

## Background

[Apache Hadoop](https://hadoop.apache.org/) is a software framework that provides a distributed file storage system and distributed computing for analyzing and transforming very large data sets using the [MapReduce](https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html) programming model. HDFS is the default file storage system in Hadoop. It is designed to be a **distributed**, **scalable**, **fault-tolerant** file system that primarily caters to the needs of the **MapReduce** paradigm. 

Both HDFS and GFS were built to store very large files and scale to store petabytes of storage. Both were built for handling batch processing on huge data sets and were designed for data-intensive applications and not for end-users. Like GFS, HDFS is also not [POSIX](## "The Portable Operating System Interface (POSIX) is an IEEE standard designed to facilitate application portability. It defines the uniform APIs provided by variants of Unix and other operating systems.")-compliant and is not a mountable file system on its own. It is typically accessed via HDFS clients or by using application programming interface (API) calls from the Hadoop libraries.

Given the current HDFS design, the following types of applications are not a good fit for HDFS: 

1. **Low-latency data access:**  
HDFS is optimized for high throughput (which may come at the expense of latency). Therefore, applications that need low-latency data access will not work well with HDFS. 

2. **Lots of small files:**  
HDFS has a central server called NameNode, which holds all the filesystem metadata in memory. This limits the number of files in the filesystem by the amount of memory on the NameNode. Although storing millions of files is feasible, billions are beyond the capability of the current hardware.

3. **No concurrent writers and arbitrary file modifications:**  
Contrary to GFS, multiple writers cannot concurrently write to an HDFS file. Furthermore, writes are always made at the end of the file, in an append-only fashion; **there is no support for modifications at arbitrary offsets in a file**.

## APIs

HDFS does not provide standard POSIX-like APIs. Instead, it exposes user-level APIs. In HDFS, files are organized hierarchically in directories and identified by their pathnames. HDFS supports the usual file system operations, e.g., files and directories can be **created**, **deleted**, **renamed**, **moved**, and **symbolic links** can be created. All **read** and **write** operations are done in an append-only fashion.