When designing large-scale systems, one of the most important considerations is availability patterns. how reliably your system can serve users without downtime. Availability patterns provide strategies to ensure that even if some components fail, the system remains accessible and functional.
Failover is the process of automatically switching to a backup system when the primary system fails. It ensures business continuity and minimizes downtime.

There are two main failover strategies:
In this setup, both systems (or nodes) are active at the same time.
Requests are distributed among all active nodes.
If one node fails, the others continue serving traffic without interruption.
Example:
Imagine you have two web servers behind a load balancer. Both servers handle user requests simultaneously. If one server goes down, the load balancer redirects all traffic to the remaining server. Users won’t notice any downtime.
Advantages:
Better utilization of resources (nothing sits idle).
High fault tolerance.
Disadvantages:
More complex to manage.
Requires synchronization between nodes.
Here, one node is active while the other remains on standby.
The passive node becomes active only if the active one fails.
Example:
A database cluster with a primary database (active) and a standby replica (passive). If the primary crashes, the standby takes over.
Advantages:
Easier to configure than Active-Active.
Good for databases where only one primary is needed.
Disadvantages:
Standby resources may remain underutilized.
Slight delay during switchover.
Replication ensures that data is copied and synchronized across multiple systems. This increases availability and fault tolerance.

There are two common replication models:
One master node handles writes (insert, update, delete).
Slave nodes handle reads.
Slaves replicate data from the master.
Example:
A blogging platform where all new posts are written to the master database, while read requests (viewing posts) are served by slave databases.
Advantages:
Improves read performance (reads can scale horizontally).
Provides a backup in case the master fails.
Disadvantages:
Single point of failure at the master node.
Data replication lag between master and slaves.
Multiple master nodes can handle both reads and writes.
Each master synchronizes changes with the others.
Example:
An e-commerce system deployed in multiple regions where customers can update their carts or place orders in any region, and all masters sync the changes.
Advantages:
No single point of failure.
Both read and write requests can be scaled.
Disadvantages:
Conflict resolution needed (e.g., what if two masters update the same record at the same time?).
More complex to implement.
Availability is often measured in terms of “nines”. The more nines, the less downtime per year.
Example:
A payment gateway needs at least four nines (99.99%) because downtime directly impacts money transactions.
A personal blog might only need two or three nines (99%–99.9%), since downtime has minimal business impact.

| Availability | Downtime per Year | Downtime per Month | Downtime per Week | Nickname |
|---|---|---|---|---|
| 99% | 3.65 days | 7.31 hours | 1.68 hours | Two nines |
| 99.9% | 8.76 hours | 43.8 minutes | 10.1 minutes | Three nines |
| 99.99% | 52.6 minutes | 4.38 minutes | 1.01 minutes | Four nines |
| 99.999% | 5.26 minutes | 26.3 seconds | 6.05 seconds | Five nines |
99% (Two Nines): System can be down for about 3.5 days a year.
99.9% (Three Nines): System can be down less than 9 hours per year.
99.99% (Four Nines): Downtime less than 1 hour per year.
99.999% (Five Nines): Downtime only about 5 minutes a year!
Let’s say your e-commerce website earns ₹100,000 per hour.
If your system has 99.9% availability, you’ll lose:
→ 8.76 hours × ₹100,000 = ₹876,000 per year in downtime.
But if you increase to 99.99% availability,
→ 52.6 minutes ≈ 0.88 hours × ₹100,000 = ₹88,000 per year.
💡 That’s a 10x reduction in downtime loss!
When designing distributed systems, we often combine multiple components — databases, servers, APIs, load balancers, etc.
But how these components are connected (in sequence or parallel) has a huge impact on overall system availability.
Components are connected one after another.
The system fails if any one component fails.
So, all components must be working at the same time for the system to function.
💡 Think of it like a chain — if any one link breaks, the whole chain fails.


Simplify architecture: Fewer components in the critical path.
Add redundancy: Use backups or replicas (this moves you toward parallel design).
Introduce graceful degradation: If one component fails, system continues partially (e.g., show cached data if DB is down).
Components are connected in parallel.
The system can function as long as at least one component is working.
So, if one fails, another takes over — this is redundancy.
💡 Think of it like having two lights connected in parallel — even if one bulb burns out, the room stays lit.


Because in a parallel setup, the probability of all components failing together becomes very low.
If one fails, others are still available to serve the request — fault tolerance is built in.
Add redundancy for critical components
Example: multiple web servers behind a load balancer.
Use replication
Example: database replicas or distributed storage nodes.
Combine with failover
Example: Active-Passive setup where the passive node takes over automatically.
In real-world systems, we rarely have purely serial or purely parallel structures.
Most systems are hybrids.
Example:
You have 2 web servers (parallel)
1 database (single point → series)

That’s why end-to-end availability depends on the weakest link.
| Type | Condition to Stay Available | Formula | Availability Effect |
|---|---|---|---|
| Series | All components must work | A₁ × A₂ × … × Aₙ | ↓ Decreases as you add more |
| Parallel | At least one must work | 1 - [(1 - A₁)(1 - A₂)...] | ↑ Increases with redundancy |
When designing a reliable system, always ask:
“If one part fails, can my system still serve users?”
If the answer is no, your components are in series — add redundancy to make them parallel.
By mixing parallel redundancy with efficient failover, you can move from 99% uptime to 99.999%, dramatically reducing downtime.
Availability patterns are the backbone of modern system design. Whether it’s failover strategies (Active-Active, Active-Passive), replication methods (Master-Slave, Master-Master), or calculating availability in numbers, each plays a role in ensuring that systems stay online and resilient. Happy coding ! ❤️
