Julian Gonzalez

Recap

In LN11, we armed ourselves with an arsenal of scheduling algorithms — from FCFS to CFS — to decide when processes run. Scheduling determines the order of execution. But today we face a darker question: what happens when processes fight over shared resources?

Imagine a 4-way intersection with no traffic lights. Four cars arrive simultaneously, each occupying one lane. Each car needs the perpendicular lane to clear before they can proceed. But every lane is blocked by another car waiting for the same thing. Nobody moves. This is a deadlock — and it can happen in your operating system, your database, and your concurrent Rust code.

Traffic lights are the "scheduler" that prevents this by imposing an ordering on who goes first. Today we formalize this problem and develop tools to detect, avoid, and prevent it.

Today's Agenda

Sequential vs Concurrent Headaches — New failure modes from concurrency
Resources — What they are, how they're shared, and how they cause problems
Coffman's Four Conditions — The formal foundation of deadlocking
Resource Allocation Graphs — Using graph theory to detect deadlocks
The Dining Philosophers — The canonical deadlock illustration
Deadlock Handling Techniques — Detection, avoidance, prevention, and the ostrich
Rust Deadlock Examples — From broken to fixed
Safety-Critical Stories — When deadlocks kill
Extensions — Distributed deadlocks, formal verification, and beyond

Throughout this lecture, we color-code concepts by category:

Color Legend:ResourceConditionDangerSafeTechnique

Sequential vs Concurrent Headaches

As sequential programmers, you're already familiar with some infuriating bugs:

Sequential Issue	What Goes Wrong
Infinite loops	Control flow never terminates
Stack overflows	Recursive calls exhaust the stack
Blocking I/O	Program halts waiting for a slow device
Circular function calls	A calls B calls A — infinite recursion
Incorrect FSMs	State machine reaches an unrecoverable state

These all share one thing: a single thread gets stuck. But in a concurrent or parallel context, we gain access to entirely new classes of horror:

Concurrent Issue	What Goes Wrong
Deadlock	Multiple processes frozen, each waiting for the other
Livelock	Processes actively running but making zero progress
Starvation	Some processes progress, but one is perpetually ignored
Resource locking	Contention for shared resources degrades performance
Lock-order inversion	Acquiring locks in inconsistent orders creates cycles
Communication locking	Message-passing channels become blocked in both directions
Distributed deadlocks	Deadlocks spanning multiple machines with no global view

Unlike sequential bugs where nothing moves, concurrent systems can exhibit lock-based inefficiencies that manifest only at runtime — starvation and exhaustion let some work progress, but not all. These bugs are notoriously hard to reproduce because they depend on specific timing and scheduling decisions.

The Hallway Problem: Livelock

Ethernet's CSMA/CD protocol was specifically designed to solve livelock: when two devices transmit simultaneously and collide, each waits a random backoff time before retrying. The randomness breaks the symmetry. Without it, both devices would retry at the same instant forever — a livelock.

Deadlock vs Starvation vs Livelock

Failure Mode	Processes Moving?	All Blocked?	Cause	Example
Deadlock	No	Yes (in cycle)	Circular resource dependency	Two threads each hold one mutex, wait for the other
Livelock	Yes (busy)	No	Reactive behavior without coordination	Two people dodging the same direction in a hallway
Starvation	Partially	No	Unfair scheduling policy	Low-priority process never gets CPU time

📚 Historical Note: Edsger Dijkstra first formalized deadlocks in 1965 alongside his invention of semaphores. In 1971, he introduced the Dining Philosophers problem as the canonical illustration. The term "deadly embrace" was used in early IBM documentation before "deadlock" became standard. Dijkstra's contributions to both the problem (deadlock formalism) and the solution (semaphores) make this lecture a direct continuation of his work.

What Are Resources?

Resources span hardware and software:

Category	Examples	Human Analogy
Hardware	CD drive, HDD, CPU core, GPU, printer	Restroom stall, car seat, parking spot
Software	Database record, variable, file, directory, network port	License plate, phone number, SSN

Preemptable vs Non-Preemptable

Because resource allocation grants exclusive access, and processes can be preemptive or cooperative, resources inherit the same categories:

Type	Description	Examples
Preemptable	Can be reassigned mid-operation without harm	Memory (pages can be swapped), CPU time
Non-preemptable	Cannot be interrupted mid-operation — will cause failure	Burning a CD, printing a page, database transaction

In human terms: doing dishes is preemptable (you can stop and start freely). Using the restroom is non-preemptable. Please don't think too hard about a preemptable porta-potty.

Resource Lifecycle

Every resource follows the same lifecycle:

Request / Acquire — The process asks for and receives the resource
In Use — The process holds and operates on the resource
Release — Voluntary or involuntary relinquishment

This cycle repeats. The problem arises in step 1: waiting. If a process can never acquire what it needs, computation stalls.

Resource Instances

Some resources have multiple identical instances — a system with 4 printers, 8 CPU cores, or 16 database connection slots. Deadlock analysis must account for this: a resource type with k instances requires k+1 simultaneous requests to guarantee that at least one process is blocked.

Rust Connection: `Mutex<T>` as a Resource

Rust's Mutex<T> is the textbook non-preemptable resource. Combined with Arc for shared ownership across threads, it maps directly to the resource lifecycle:

use std::sync::{Arc, Mutex};
use std::thread;

let resource = Arc::new(Mutex::new(0)); // Create resource
let r = Arc::clone(&resource);

thread::spawn(move || {
    let mut data = r.lock().unwrap(); // 1. Request & Acquire
    *data += 1;                        // 2. In Use
    // MutexGuard drops here           // 3. Release (RAII)
});

The MutexGuard returned by .lock() implements Rust's RAII pattern — the lock is automatically released when the guard goes out of scope. This makes "forgetting to release" structurally impossible, unlike C/C++ where manual unlock() calls can be missed.

Semaphores: From Dijkstra to Rust

Dijkstra invented both the deadlock formalism AND semaphores as a coordination solution. A binary semaphore (0 or 1) is similar to a mutex for teaching purposes, though mutexes usually add ownership rules that plain semaphores do not. A counting semaphore (initialized to k) naturally handles multi-instance resources — it allows up to k concurrent holders.

The classic semaphore operations are P (proberen — "to test", decrements) and V (verhogen — "to increment"). These names come from Dutch, Dijkstra's native language. They're error-prone: nothing prevents you from calling V without P, or P twice without V.

Rust's Mutex<T> with RAII MutexGuard is the modern evolution: the type system enforces correct acquire/release patterns at compile time. Language design has evolved to make deadlock-prone patterns harder to write.

Lock Granularity Spectrum

How much should a single lock protect? This creates a fundamental engineering tradeoff:

Granularity	Strategy	Deadlock Risk	Parallelism	Example
Coarse	One lock for everything	None (no cycles possible)	None (fully serial)	Python's GIL
Fine	One lock per resource	High (many locks = many cycle opportunities)	High	Linux kernel per-inode locks
Lock-free	No locks at all	None (no locks to deadlock on)	Maximum	`std::sync::atomic` in Rust

The scheduling problem gave us the quantum tradeoff (too large = convoy, too small = overhead). The deadlock problem gives us the granularity tradeoff (too coarse = no parallelism, too fine = deadlock risk).

Python's GIL — The Coarse Extreme in Practice

Python's Global Interpreter Lock is a real-world case study of the coarsest possible lock granularity. CPython uses a single mutex that protects all interpreter state. Any thread executing Python bytecode must hold the GIL, meaning only one thread runs Python code at a time — even on a 128-core machine.

This is deadlock prevention by construction: with one lock, circular wait is structurally impossible (you can't have a cycle with one node). But the cost is that Python threads can never achieve true parallelism on CPU-bound work — only concurrency via interleaving. This is exactly the distinction from LN7!

CPython's developers chose the simplest possible solution to protect the interpreter's reference-counting garbage collector from race conditions. It prevented deadlocks on internal structures but became the single most notorious limitation of the language for parallel workloads. This is why:

multiprocessing exists (separate processes with separate GILs)
NumPy releases the GIL during C-level computation
PEP 703 ("nogil") is an ongoing effort to remove the GIL entirely

Students who have written Python since day one now have the formal vocabulary to understand exactly why threading doesn't speed up CPU-bound code and why Rust doesn't have this problem — ownership + borrow checker replaces the GIL's role at compile time, with zero runtime cost.

📚 Language Theory Connection: The GIL is a language runtime design decision, not a language specification decision. Other Python implementations (Jython on the JVM, PyPy's STM experiments) prove that the GIL is not inherent to Python's semantics — it's an implementation tradeoff that traded parallelism for simplicity in CPython's memory management.

Coffman's Four Conditions

So is this problem like the halting problem? Are we doomed to never find a general solution? Must we always rest uneasy, knowing our code might deadlock?

Actually, no! Unlike the halting problem, deadlocks are solvable in general for a known resource model. Coffman's research (1971) led to an amazing discovery: deadlock occurs if and only if four conditions are met simultaneously. Break even one, and the system makes progress.

The key insight: Remove ANY ONE condition and deadlock is impossible. This is what makes deadlocks fundamentally different from the halting problem — we have a complete, decidable characterization of when they occur. All we need is linear algebra or graph theory!

"Not-Quite-Deadlock" Examples

To reinforce that all 4 are necessary, consider what happens with only 3:

Missing Condition	What Happens	Why No Deadlock
No Mutual Exclusion	Resources are shared freely	No contention — everyone reads simultaneously
No Hold and Wait	Processes request all resources atomically	Either get everything or nothing — no partial holding
No No Preemption	The system can forcibly reclaim the contended resource	Cycles can be broken because the waiting chain no longer has to remain intact
No Circular Wait	Resources are ordered; always acquired in sequence	Linear chains can't form cycles

Before you toggle anything, ask yourself which condition is usually easiest to break in a real design: ordering, atomic acquisition, or reclaimability.

Toggle the conditions below to see this in action:

Coffman's Four Conditions

Toggle conditions to see when deadlock is possible. All four must hold simultaneously.

DEADLOCK

All 4 conditions are met — deadlock is guaranteed. Toggle any condition off to break it.

MEANDHWANDNPANDCW=DEADLOCK

Therefore, the practical design game is often: pick the cheapest Coffman condition to break and engineer around that one.

Resource Allocation Graphs

We can formalize deadlock detection using graph theory — specifically, directed graphs. If you've taken discrete mathematics, the cycle-detection algorithms here are closely related to the spanning tree algorithms you already know.

⚠️ Multi-instance caveat: With multiple resource instances (e.g., 3 printers), a cycle is necessary but not sufficient for deadlock. A knot — a strongly connected component with no outgoing edges — is required. For single-instance resources, cycle = deadlock.

Step-by-Step: From Safe to Deadlock

Watch how a deadlock forms step by step. As you click, narrate each edge as either holds or requests; that makes the cycle much easier to see.

Resource Allocation Graph

Step through resource requests and assignments. Dashed arrows = request, solid arrows = assignment. Red = cycle detected (deadlock).

Step 1 / 4

Two threads and two mutexes exist. No requests yet.

let mutex_a = Arc::new(Mutex::new(0));
let mutex_b = Arc::new(Mutex::new(0));

Process

Resource

Request

Assignment

In cycle

Notice the third scenario: the same processes, same resources, same work — but a different acquisition order prevents the cycle entirely. This is exactly the insight behind resource ordering as a prevention technique.

Therefore, the graph is useful because it shows the exact structural change: ordering removes the possibility of a circular wait.

Build Your Own Deadlock

Now try it yourself. First create a deadlock on purpose, then fix it by breaking one Coffman condition:

Build Your Own Deadlock

Add processes and resources, draw edges (request/assignment), and watch for cycles. Directed edge from Process to Resource = request. From Resource to Process = assignment.

Add some processes and resources to get started!

Process

Resource

|Process → Resource = RequestResource → Process = Assignment

Therefore, the sandbox is not just a detector; it is a design lab for testing which prevention move actually breaks the cycle.

The Dining Philosophers Problem

Dijkstra's 1971 formulation is the canonical deadlock illustration:

Five philosophers sit at a round table. Between each pair is a single fork (5 forks total). To eat, a philosopher needs both adjacent forks. They alternate between thinking and eating.

This maps perfectly to a resource allocation graph: philosophers are processes, forks are resources. If all five simultaneously pick up their left fork — circular wait — deadlock!

The solutions preview the four handling techniques:

Prevention (ordering): Number the forks; always pick up the lower-numbered fork first. This breaks circular wait.
Prevention (atomic): Only pick up both forks simultaneously, or neither. This breaks hold-and-wait.
Detection: A monitor detects the cycle and forces one philosopher to put down their fork.

Before running the demo, compare two styles of solution: changing timing and changing guarantees. Which one actually removes the possibility of deadlock?

Dining Philosophers Problem

5 philosophers, 5 forks. Each needs 2 adjacent forks to eat. Choose a strategy and step through the scenario.

Step 1 / 4

All philosophers are thinking. All forks are on the table.

P0thinking

P1thinking

P2thinking

P3thinking

P4thinking

Thinking

Hungry

Eating

Waiting

Fork (free)

Fork (held)

Therefore, the strongest solutions are usually the ones that change the resource-allocation rules, not just the ones that hope the timing works out.

Deadlock Handling Techniques

Four major approaches, organized by a medical analogy:

Detection and Recovery — "The Emergency Room"

Detection: Run cycle detection on the RAG periodically or on every block event. This is standard DFS-based graph traversal — O(V+E) per check.

Recovery options (in order of increasing violence):

Preempt a resource — Roll back the process, checkpoint/restart. Least destructive but requires checkpointing support.
Kill a victim process — Select the "cheapest" process to terminate (lowest priority? newest? least work done?).
Kill ALL deadlocked processes — The nuclear option. Guaranteed to work.

Cost of detection: Running cycle detection on every resource request is expensive. Amortized approaches run detection only when a process has been blocked for suspiciously long — trading detection latency for efficiency.

Timeout-Based Detection

A pragmatic middle ground between formal graph analysis and ignoring the problem: if you cannot acquire a resource quickly enough, back off, release what you can, and retry. This is how distributed systems (gRPC, HTTP clients, database connections) often handle deadlock risk in practice.

In Rust, Mutex::try_lock() is the non-blocking building block for that style:

use std::sync::Mutex;

let lock = Mutex::new(42);
match lock.try_lock() {
    Ok(guard) => println!("Got the lock: {}", *guard),
    Err(_) => println!("Couldn't acquire — backing off!"),
}

Real-world database detection: PostgreSQL and MySQL InnoDB use wait-for graphs (a simplified RAG containing only processes, not resources) and abort a victim transaction when a cycle is detected. You can actually trigger this and see the ERROR: deadlock detected message yourself.

Deadlock Avoidance — "The Doctor's Diet"

Two key definitions:

State	Meaning
Safe state	There exists some ordering of processes that allows ALL to complete
Unsafe state	No guaranteed ordering exists — deadlock is possible (but not certain)

The Banker's Algorithm

The most famous avoidance algorithm, invented by Dijkstra (1965). Named because a banker must ensure they never loan out so much money that they can't satisfy all customers. The OS is the banker; resources are money; processes are customers with credit limits.

Let's see this with actual money first. You're the bank manager. Each customer has deposited some cash and has a credit limit — the maximum they could ever request. The bank vault holds whatever cash hasn't been loaned out. This analogy is only a warm-up, but it makes the safe/unsafe distinction intuitive. Try adjusting the deposits, credit limits, and vault cash to see when the bank is safe vs. at risk:

🏦Dijkstra's Bank — The Original Analogy

You are the bank manager. Each customer has deposited some money and has a credit limit (the max they'll ever need). Your job: never loan out so much cash that you can't guarantee every customer gets their full credit limit eventually.

Bank Vault (Cash on Hand)

Total deposited by customers: $4,500 · Total credit limits: $12,000

Alice

Deposited:

Credit Limit:

Still needs:$3,000

$0$4,000

Bob

Deposited:

Credit Limit:

Still needs:$3,000

$0$5,000

Charlie

Deposited:

Credit Limit:

Still needs:$1,500

$0$3,000

SAFE — Loans can be honored!

The bank can serve all customers by processing them in this order. Each customer finishes their business and returns their deposit, replenishing the vault for the next:

Charlie→Alice→Bob

Already deposited

Still needs (loan remaining)

Notice the key insight: the bank doesn't need enough cash to cover everyone at once — it just needs enough to let at least one customer finish, get their deposit back, then use that to serve the next customer, and so on. A safe state is one where such an ordering exists. An unsafe state means no such ordering is guaranteed.

Therefore, the analogy is really about existence of a safe sequence, not about money itself.

Generalizing beyond money. In real operating systems, there isn't just one resource — there are multiple: memory pages, disk sectors, network sockets, printer queues, and more. The algorithm generalizes by tracking a vector of resources instead of a single dollar amount. Each process has an allocation vector (what it holds), a max vector (what it could ever need), and available becomes a vector of all free resources. The same "can at least one process finish?" check runs across all resource types simultaneously — a process can only proceed if its need is met in every dimension.

The algorithm checks, at each request: "If I grant this, can I still guarantee a safe sequence where every process eventually gets what it needs?" If not, the request is denied.

Try the full multi-resource version yourself — but remember one crucial distinction before you start: unsafe does not mean deadlocked right now. It means the system has lost its guarantee of being able to finish everyone safely.

Banker's Algorithm — Safe State Calculator

Edit allocation, max need, and available resources. The algorithm determines if a safe execution sequence exists.

Available Resources

Process	Allocation			Max Need			Need (Max - Alloc)
	A	B	C	A	B	C	A	B	C
P0							7	4	3
P1							1	2	2
P2							6	0	0
P3							0	1	1
P4							4	3	1

SAFE STATE

A safe execution sequence exists. The banker can satisfy all customers.

Safe sequence:P1→P3→P4→P0→P2

Algorithm trace:

1.P1need=[1,2,2]≤work=[3,3,2]✓→ work=[5,3,2]

2.P3need=[0,1,1]≤work=[5,3,2]✓→ work=[7,4,3]

3.P4need=[4,3,1]≤work=[7,4,3]✓→ work=[7,4,5]

4.P0need=[7,4,3]≤work=[7,4,5]✓→ work=[7,5,5]

5.P2need=[6,0,0]≤work=[7,5,5]✓→ work=[10,5,7]

Therefore, Banker's Algorithm is a refusal rule: it denies requests that would move the system from "safe but tight" into "unsafe and no longer guaranteed."

Deadlock Prevention — "An Ounce of Prevention"

Condition to Attack	Prevention Strategy	Tradeoff
Mutual Exclusion	Use spooling/virtualization (e.g., print spooler)	Not always possible — some resources are inherently exclusive
Hold and Wait	Require all resources upfront (atomic acquisition)	Low utilization; process may not know all needs in advance
No Preemption	Allow resource preemption (checkpoint and rollback)	Expensive; not always safe for non-preemptable resources
Circular Wait	Impose a total ordering on resources	Restricts flexibility; must be enforced globally

Resource Ordering in Practice

Linux kernel's lockdep tool automatically detects lock ordering violations at runtime. It builds a lock dependency graph and warns if a cycle is possible, even before a deadlock actually occurs. This is prevention via tooling — the system learns the correct ordering from observation and flags deviations.

Rust's Compile-Time Prevention

Rust's ownership and borrowing rules prevent entire classes of data races at compile time. MutexGuard with RAII ensures locks are always released. While Rust can't prevent all deadlocks statically (that would require solving the halting problem for lock orderings), it eliminates the most common causes — forgotten unlocks, use-after-free on locked resources, and data races.

Lock-Free and Wait-Free Data Structures

The ultimate form of prevention: eliminate locks entirely using atomic compare-and-swap (CAS) operations. Rust's std::sync::atomic module provides these primitives. CAS is a CPU instruction (connecting back to LN9's hardware lecture), and lock-free data structures attack mutual exclusion at the algorithmic level.

⚠️ The ABA Problem: Lock-free algorithms have their own subtle pitfall. A value changes from A to B and back to A. A CAS operation sees "still A" and incorrectly assumes nothing changed. Solutions include tagged pointers and hazard pointers — a whole subfield of concurrent data structures.

The Ostrich Algorithm — "Head in the Sand"

This sounds absurd — but this is what Unix and Windows actually do for most user-space resources. The reasoning is pure cost-benefit analysis:

$$p \cdot C_r < C_p$$

Where:

$$p = \text{deadlock probability}, \quad C_r = \text{recovery cost}, \quad C_p = \text{continuous prevention cost}$$

If the expected cost of rare deadlocks is less than the ongoing cost of preventing them, ignore them.

Most interactive systems accept some deadlock risk because the expected incidence is low enough that full prevention is not worth the cost. When deadlocks do occur, restarting an application or rebooting the machine may be cheaper than paying continuous prevention overhead everywhere.

📚 Engineering Risk Management: This is the same principle used throughout engineering. Bridges are designed for 99.9% of expected loads, not 100%. Software reliability targets "five 9s" (99.999% uptime) — not 100%, because the cost of that last 0.001% is astronomical. The ostrich algorithm is mathematically justified risk management.

So all this formal theory, all these algorithms — and the practical answer for most interactive systems is "don't worry about it, and use a preemptive scheduler." The formal tools become critical in safety-critical and real-time systems, where the cost of a deadlock is not a reboot but a catastrophe.

Rust Deadlock Examples

Let's see deadlock in Rust and fix it with the techniques we've learned.

The Deadlock

Two threads acquiring two mutexes in opposite order:

use std::sync::{Arc, Mutex};
use std::thread;

fn main() {
    let a = Arc::new(Mutex::new(0));
    let b = Arc::new(Mutex::new(0));

    let a1 = Arc::clone(&a);
    let b1 = Arc::clone(&b);
    let t1 = thread::spawn(move || {
        let _lock_a = a1.lock().unwrap(); // Acquires A
        thread::sleep(std::time::Duration::from_millis(10));
        let _lock_b = b1.lock().unwrap(); // Blocks waiting for B!
    });

    let a2 = Arc::clone(&a);
    let b2 = Arc::clone(&b);
    let t2 = thread::spawn(move || {
        let _lock_b = b2.lock().unwrap(); // Acquires B
        thread::sleep(std::time::Duration::from_millis(10));
        let _lock_a = a2.lock().unwrap(); // Blocks waiting for A!
    });

    t1.join().unwrap(); // Hangs forever!
    t2.join().unwrap();
}

This program will hang forever — both threads are stuck waiting for the other's lock. All four Coffman conditions are met.

Fix 1: Resource Ordering (Prevention)

Always acquire locks in the same order. This breaks circular wait:

// Both threads: always acquire A before B
let t1 = thread::spawn(move || {
    let _lock_a = a1.lock().unwrap(); // A first
    let _lock_b = b1.lock().unwrap(); // B second
});

let t2 = thread::spawn(move || {
    let _lock_a = a2.lock().unwrap(); // A first (same order!)
    let _lock_b = b2.lock().unwrap(); // B second
});

Fix 2: `try_lock` with Backoff (Detection/Recovery)

Use non-blocking lock attempts and retry on failure:

use std::sync::{Arc, Mutex};
use std::thread;
use std::time::Duration;

let t1 = thread::spawn(move || {
    loop {
        let lock_a = a1.lock().unwrap();
        match b1.try_lock() {
            Ok(lock_b) => {
                // Got both! Do work.
                break;
            }
            Err(_) => {
                drop(lock_a); // Release A and retry
                thread::sleep(Duration::from_millis(1));
            }
        }
    }
});

This breaks hold and wait — if we can't get everything, we release what we have.

Fix 3: Message Passing (Eliminate Shared Resources)

Use channels instead of shared memory — attacking mutual exclusion by eliminating the shared resource entirely:

use std::sync::mpsc;
use std::thread;

let (tx, rx) = mpsc::channel();

// Worker thread owns the data exclusively
let worker = thread::spawn(move || {
    let mut value = 0;
    for msg in rx {
        match msg {
            "increment" => value += 1,
            "get" => println!("Value: {}", value),
            _ => break,
        }
    }
});

// Other threads send messages instead of locking
tx.send("increment").unwrap();
tx.send("get").unwrap();
tx.send("quit").unwrap();
worker.join().unwrap();

No mutexes, no locks, no deadlock. The data is owned by one thread, and others communicate via messages. This is the actor model — the same philosophy behind Erlang, Go's goroutines, and Rust's Tokio channels.

Scheduling and Deadlock

Tying back to LN11: scheduling affects timing, which can influence whether a buggy locking pattern deadlocks on a particular run. But timing luck is not a guarantee. The "Avoided via Ordering" scenario in the RAG stepper works because the locking rule changed, not because the scheduler happened to be clever.

Preemptive scheduling is still valuable for responsiveness and fairness, but reclaiming CPU time is not the same thing as reclaiming a mutex, file lock, or printer. Deadlock prevention has to target the contended resource itself.

Safety-Critical Stories

These techniques aren't academic exercises. When concurrent systems fail in safety-critical contexts, people get hurt.

The Therac-25 Incident (1985–1987)

The Therac-25 was a radiation therapy machine controlled by software. Race conditions between the operator interface and beam-control hardware — a resource contention problem without proper synchronization — caused the machine to deliver lethal radiation doses. Three patients died and three were seriously injured.

The root cause was exactly what we've been studying: shared resources (beam configuration state) accessed by multiple concurrent processes (operator interface and beam control) without proper mutual exclusion or synchronization. The developers had removed hardware safety interlocks that existed in previous models, trusting the software alone to manage the resources correctly. The software had a race condition that could set the beam to full power while the system believed it was in low-power mode.

If Coffman's conditions are "how deadlocks happen," Therac-25 is "why we must care." Concurrent programming bugs don't just hang your program — in safety-critical systems, they can kill.

The Mars Pathfinder (1997)

NASA's Mars Pathfinder rover kept randomly resetting on the Martian surface. The cause: priority inversion — a deadlock-adjacent problem.

A high-priority task (the bus manager) was blocked waiting for a mutex held by a low-priority task (the meteorological data collector). A medium-priority task (communications) preempted the low-priority task, preventing it from releasing the mutex. The high-priority task starved, the watchdog timer fired, and the system rebooted.

The fix: priority inheritance protocol — temporarily boost the low-priority holder to the high-priority waiter's level so it can finish and release the resource. This was patched remotely from Earth via a C parameter change transmitted to Mars.

📌 Connection: Priority inversion combines elements of both deadlock (resource contention) and starvation (a high-priority process can't proceed). It's a reminder that the problems in this lecture interact with the scheduling problems from LN11 in subtle ways.

Extensions

Distributed Deadlocks

In distributed systems, no single node has the complete resource allocation graph. Detection requires either:

A centralized coordinator that collects lock information from all nodes (single point of failure)
Distributed algorithms like Chandy-Misra-Haas (1983) that propagate "probe" messages through the wait-for graph

This is why microservice architectures overwhelmingly prefer timeouts over formal deadlock detection — it's impractical to maintain a global RAG across hundreds of services.

Convoy Effect Revisited

The convoy effect from LN11 is actually a form of resource-based starvation when viewed through this lecture's framework. A long process "deadlocks" the ready queue by holding the CPU resource via hold and wait. The formal vocabulary we built today lets you see FCFS convoy as a special case of resource contention where the resource is CPU time itself.

Two-Phase Locking (2PL)

A database technique where transactions acquire all locks in a "growing" phase, then release all in a "shrinking" phase. This attacks hold and wait — by the time you start releasing, you've already acquired everything you need. 2PL is the foundation of ACID transaction isolation in most relational databases. If you take a database course, you'll see this again.

Formal Verification

Tools like TLA+ (Leslie Lamport — also the creator of LaTeX!) and SPIN can exhaustively verify concurrent programs for deadlock before deployment. They enumerate all possible interleavings of concurrent operations and check for cycles in every reachable state. NASA uses these tools for spacecraft software — where the cost of a deadlock is not a reboot but mission failure.

The graph-theoretic approach in this lecture scales directly to industrial verification. TLA+ models are essentially formal RAGs with state transitions — the same concepts, deployed at industrial scale.

Spinlocks vs Blocking Locks

When a resource is contested, do you spin (busy-wait in a loop checking the lock) or block (context switch to another process)?

Spinlocks are efficient for very short critical sections (no context switch overhead) but waste CPU cycles and can cause livelock-like behavior on single-core systems
Blocking locks are efficient for long critical sections (CPU does useful work while waiting) but pay the context switch cost

This connects back to LN11's scheduling concerns: spinlocks are essentially cooperative ("I'll keep checking"), while blocking locks delegate to the scheduler ("wake me when it's free"). The choice depends on the expected wait time relative to the context switch cost — yet another engineering tradeoff in the granularity spectrum.

Summary

Technique	Medical Analogy	What It Does	When to Use
Detection & Recovery	Emergency Room	Find and break deadlocks after they occur	Databases, systems with checkpointing
Avoidance	Doctor's Diet	Deny requests that could lead to unsafe states	Known resource bounds (Banker's)
Prevention	Ounce of Prevention	Architect away Coffman conditions	Safety-critical systems, kernel design
Ostrich	Ignore It	Assume deadlocks are too rare to worry about	Unix, Windows, most interactive systems

📝 Lecture Notes

Key Definitions:

Term	Definition
Resource	Something a process may need to acquire, wait for, or share under limited availability
Preemptable	Resource that can be reassigned mid-operation without harm
Non-preemptable	Resource that cannot be interrupted mid-operation
Semaphore	Dijkstra's synchronization primitive (binary = mutex, counting = multi-instance)
Mutual Exclusion	Only one process can hold a resource at a time
Hold and Wait	Process holds resources while waiting for others
No Preemption	Resources cannot be forcibly taken
Circular Wait	Cycle in the wait-for graph
Deadlock	All four Coffman conditions met — processes frozen
Livelock	Processes running but making no progress
Starvation	Process perpetually denied resources due to unfair policy
Safe State	An ordering exists where all processes can complete
Banker's Algorithm	Avoidance algorithm checking for safe states before granting requests
RAG	Resource Allocation Graph — directed graph for deadlock detection
GIL	Global Interpreter Lock — CPython's coarse interpreter lock, which simplifies some internal synchronization at the cost of CPU-bound parallelism

Deadlock Handling Decision Guide:

Safety-critical system?     → Prevention (resource ordering, atomic acquisition)
Known resource bounds?      → Avoidance (Banker's Algorithm)
Checkpointing available?    → Detection & Recovery (cycle detection + rollback)
Interactive/general use?    → Often Ostrich or selective detection, depending on cost and failure tolerance
Distributed system?         → Timeouts (no global RAG available)
Maximum performance?        → Lock-free data structures (eliminate locks entirely)

📚 Additional Resources

Safety-Critical Case Studies

What Really Happened on Mars (Microsoft Research) — The Pathfinder priority inversion story
An Investigation of the Therac-25 Accidents (Leveson & Turner) — Detailed analysis of the Therac-25 radiation incidents
Software Engineering Code of Ethics — IEEE/ACM code for safety-critical software

Formal Verification

TLA+ Home Page (Leslie Lamport) — Formal specification language for concurrent systems
SPIN Model Checker — Exhaustive verification of concurrent programs

Rust Concurrency

The Rustonomicon: Concurrency — Unsafe concurrency in Rust
Rust Atomics and Locks (Mara Bos) — Deep dive into lock-free programming in Rust
Deadlock Detection in Rust (parking_lot) — A Rust mutex library with built-in deadlock detection

Historical References

Coffman et al. (1971): System Deadlocks — The original four conditions paper
Dijkstra's Semaphores (EWD 123) — Original semaphore paper
Python GIL Discussion (PEP 703) — The ongoing effort to remove the GIL

Loading content...

Key Definitions:

Term	Definition
Resource	Something a process may need to acquire, wait for, or share under limited availability
Preemptable	Resource that can be reassigned mid-operation without harm
Non-preemptable	Resource that cannot be interrupted mid-operation
Semaphore	Dijkstra's synchronization primitive (binary = mutex, counting = multi-instance)
Mutual Exclusion	Only one process can hold a resource at a time
Hold and Wait	Process holds resources while waiting for others
No Preemption	Resources cannot be forcibly taken
Circular Wait	Cycle in the wait-for graph
Deadlock	All four Coffman conditions met — processes frozen
Livelock	Processes running but making no progress
Starvation	Process perpetually denied resources due to unfair policy
Safe State	An ordering exists where all processes can complete
Banker's Algorithm	Avoidance algorithm checking for safe states before granting requests
RAG	Resource Allocation Graph — directed graph for deadlock detection
GIL	Global Interpreter Lock — CPython's coarse interpreter lock, which simplifies some internal synchronization at the cost of CPU-bound parallelism

Deadlock Handling Decision Guide:

Safety-critical system?     → Prevention (resource ordering, atomic acquisition)
Known resource bounds?      → Avoidance (Banker's Algorithm)
Checkpointing available?    → Detection & Recovery (cycle detection + rollback)
Interactive/general use?    → Often Ostrich or selective detection, depending on cost and failure tolerance
Distributed system?         → Timeouts (no global RAG available)
Maximum performance?        → Lock-free data structures (eliminate locks entirely)

Course Planner

Final Exam Release

HW 5: Hand-Tossed in Rust

Final Exam Due

LN 12: Don't Starve

Lecture Date

Standard

Topics Covered

📹 Lecture Recordings

Recap

Today's Agenda

Sequential vs Concurrent Headaches

The Hallway Problem: Livelock

Deadlock vs Starvation vs Livelock

What Are Resources?

Preemptable vs Non-Preemptable

Resource Lifecycle

Resource Instances

Rust Connection: Mutex<T> as a Resource

Semaphores: From Dijkstra to Rust

Lock Granularity Spectrum

Python's GIL — The Coarse Extreme in Practice

Coffman's Four Conditions

"Not-Quite-Deadlock" Examples

Coffman's Four Conditions

Resource Allocation Graphs

Step-by-Step: From Safe to Deadlock

Resource Allocation Graph

Build Your Own Deadlock

Build Your Own Deadlock

The Dining Philosophers Problem

Dining Philosophers Problem

Deadlock Handling Techniques

Detection and Recovery — "The Emergency Room"

Timeout-Based Detection

Deadlock Avoidance — "The Doctor's Diet"

The Banker's Algorithm

🏦Dijkstra's Bank — The Original Analogy

Banker's Algorithm — Safe State Calculator

Deadlock Prevention — "An Ounce of Prevention"

Resource Ordering in Practice

Rust's Compile-Time Prevention

Lock-Free and Wait-Free Data Structures

The Ostrich Algorithm — "Head in the Sand"

Rust Deadlock Examples

The Deadlock

Fix 1: Resource Ordering (Prevention)

Fix 2: try_lock with Backoff (Detection/Recovery)

Fix 3: Message Passing (Eliminate Shared Resources)

Scheduling and Deadlock

Safety-Critical Stories

The Therac-25 Incident (1985–1987)

The Mars Pathfinder (1997)

Extensions

Distributed Deadlocks

Convoy Effect Revisited

Two-Phase Locking (2PL)

Formal Verification

Spinlocks vs Blocking Locks

Summary

📝 Lecture Notes

📚 Additional Resources

Recommended Reading

Safety-Critical Case Studies

Formal Verification

Rust Concurrency

Historical References

All Lecture Notes

Recap

Today's Agenda

Sequential vs Concurrent Headaches

The Hallway Problem: Livelock

Deadlock vs Starvation vs Livelock

What Are Resources?

Preemptable vs Non-Preemptable

Resource Lifecycle

Resource Instances

Rust Connection: Mutex<T> as a Resource

Semaphores: From Dijkstra to Rust

Lock Granularity Spectrum

Rust Connection: `Mutex<T>` as a Resource

Fix 2: `try_lock` with Backoff (Detection/Recovery)

Rust Connection: `Mutex<T>` as a Resource

Fix 2: `try_lock` with Backoff (Detection/Recovery)