Table Of Content

Gray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, Randolph Yao Outline  Background & the gray failure problem  Real-world gray failure cases in Azure  A model and a definition for gray failure  differential observability  Potential future directions 26 Rapid Growth of Cloud System Infra » Software user shift  direct: e.g., office 365, Google Drive  indirect: e.g., Netflix on AWS » Workload diversity  website, workflow, big data, machine learning » Internal composition  more data centers, larger cluster, special h/w  containerization, micro-services 27 Demanding Requirement on Availability » Users intolerant of service downtime » Failure more costly  SLA violation, reputation hit, customer loss, engineering resource waste » New availability bar  3 nines to 5 or 6 nines 28 Key: Embrace Fault-tolerance! Rich history since 1950s Steps: Decomposition Redundancy Correctness 29 Key: Embrace Fault-tolerance! Rich history since 1950s Steps: Decomposition Redundancy Correctness process pair triple modular redundancy Primary/backup RAID state machine replication checkpoint chain replication Building block: transaction 2PC Gossip N-version erasure coding … virtual synchrony Paxos Zab PBFT Zyzzyva 30 Status Quo 31 Status Quo » By and large, the efforts paid off  many faults successfully detected, tolerated, and repaired every day  few global outages  99.9% is achievable » But moving forward…  simplistic assumptions start to break  reasoning about availability becomes hard  frequent bizarre phenomenon in production  99.999% and beyond is challenging 32 Status Quo » By and large, the efforts paid off  many faults successfully detected, tolerated, and repaired every day  few global outages  99.9% is achievable » But moving forward…  simplistic assumptions start to break  reasoning about availability becomes hard  frequent bizarre phenomenon in production  99.999% and beyond is challenging scale & complexity ? availability t 33 Status Quo » By and large, the efforts paid off  many faults successfully detected, tolerated, and repaired every day  few global outages  99.9% is achievable » But moving forward… A common theme:  simplistic assumptions start to break the overlooked gray  reasoning about availability becomes hard failure problem  frequent bizarre phenomenon in production  99.999% and beyond is challenging scale & complexity ? availability t 34

Description:

A model and a definition for gray failure. ❖ differential .. App. 1. • distributed storage system. • IaaS platform. • data center network. • search engine.

Gray Failure: The Achilles' Heel of Cloud-Scale Systems PDF

97 Pages·2017·2.71 MB·English

by Ryan Huang

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Gray Failure: The Achilles' Heel of Cloud-Scale Systems

Description:

A model and a definition for gray failure. ❖ differential .. App. 1. • distributed storage system. • IaaS platform. • data center network. • search engine.

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.