Ferkans — Interactive Telecom Tutor

From a System Problem to an Information-Theoretic One

Section 7.1 gave the systems-level view: shuffling is a per-epoch data-movement cost. This section formalizes the problem as an information-theoretic coding problem, with well-defined inputs (the dataset, the permutations, the caches), well-defined outputs (the post-shuffle data distribution), and a precise communication-load metric.

Once we have the formal problem, the tools of Chapter 2 (cut-set converses), Chapter 4 (finite-field IA), and Chapter 5 (polynomial codes) apply directly. The point is that the coded-shuffling tradeoff of §7.3 is not a stand- alone trick; it is a specialization of the computation-communication framework we have been building for five chapters.

Definition:
$(N, D, M)$ -Data-Shuffling Problem

Let the dataset be $\mathcal{W} = \{W_1, \ldots, W_D\}$ , each $W_d$ a fixed-size chunk. The $(N, D, M)$ -data-shuffling problem operates over $N$ workers as follows:

Placement phase (one-time, before training): Each worker $k$ stores a subset $\mathcal{T}_k \subseteq \mathcal{W}$ of size $|\mathcal{T}_k| = M$ . The placement is chosen centrally by the master and does not depend on the future permutations.
Delivery phase (one per epoch): At the start of epoch $t$ , a random permutation $\pi_t: [D] \to [D]$ is announced. The permutation induces per-worker assignments $\mathcal{A}_k^{(t)} = \pi_t^{-1}(\mathcal{B}_k)$ , where $\mathcal{B}_k = \{(k-1)D/N + 1, \ldots, kD/N\}$ is worker $k$ 's fixed slot in the epoch's processing order.
Server broadcasts a sequence of bit-messages to (noiselessly) inform each worker of the new-epoch data it does not already have: worker $k$ needs every $W_j \in \mathcal{A}_k^{(t)} \setminus \mathcal{T}_k$ .

The per-epoch shuffling rate is $\Delta(\pi_t) = \text{total broadcast bits} / D$ (normalized by the dataset size). The worst-case rate is $R^*(M) = \max_{\pi} \Delta(\pi)$ . The information-theoretic question is: what is the minimum achievable $R^*(M)$ ?

The setup parallels coded caching (Chapter 4 §4.3): the placement phase knows nothing about future demands; the delivery phase answers the specific permutation via a single broadcast. The difference is that coded caching delivers files (one request per user), while data shuffling delivers groups of files (one mini-batch per worker).

Per-Epoch Shuffling Rate $\Delta$

The total network traffic required to re-shuffle the distributed dataset after a new random permutation is announced, normalized by the dataset size. The optimal $R^*(M)$ is characterized by the CommIT result of §7.3.

Data Shuffling vs. MapReduce Shuffle

The term "shuffle" is overloaded. Two different operations in distributed systems use it:

MapReduce shuffle (Chapter 2): every worker reads its own intermediate values and sends them to the worker responsible for the corresponding reducer key. One-time, per-job operation.
Data shuffling in ML (this chapter): a fresh random permutation of the input dataset is computed, and each worker receives its slice of the permuted dataset. Per- epoch operation (potentially hundreds of times per training run).

The coding techniques are similar (both use finite-field IA / XOR alignment), but the rate regions are different. The data-shuffling bound $R^*(M) = \frac{1 - M/D}{1 + NM/D}$ of §7.3 specializes to the same formula as Maddah-Ali / Niesen caching under the identification $K = N$ , $F = D$ (users = workers, files = data points).

Theorem: Lower Bound: $R^*(M) \geq R_{\text{cut}}(M)$

For any $(N, D, M)$ -data-shuffling scheme, $R^*(M) \;\geq\; \frac{N(1 - M/D)}{1 + NM/D}.$ The proof is a cut-set argument specialized to the shuffling problem: any permutation-agnostic placement must admit at least this much per-epoch broadcast traffic for the worst-case permutation.

With $N$ workers and per-worker memory $M = \mu D$ , a broadcast bit can satisfy at most $1 + KM/D = 1 + N\mu$ distinct per-worker missing slots simultaneously (the alignment factor of Chapter 4's coded caching). The total number of missing-data-point deliveries is $N (1 - \mu) D$ normalized, so the minimum broadcast traffic is $N(1 - \mu)/(1 + N\mu)$ . The argument is exactly the Maddah-Ali / Niesen cut-set from §4.3, specialized to the data-shuffling problem.

Proof

Count per-worker missing points

After placement, each worker has $M$ points cached; under a fresh permutation, the expected number of points each worker needs is $D/N$ . Of these, $(D/N)(M/D) = M/N$ are already cached — a fraction $M/D = \mu$ . So each worker is missing $(D/N)(1 - \mu)$ points per epoch.

Multiply by N workers

Aggregate missing-per-epoch: $N \cdot (D/N)(1 - \mu) = D(1 - \mu)$ .

Apply the alignment bound

Each broadcast bit aligns $1 + N\mu$ missing slots into one transmission (the same factor as the coded-caching gain). Minimum broadcast bits: $D(1 - \mu)/(1 + N\mu)$ . Normalized by $D$ : $R^*(M) \geq N(1 - \mu)/(1 + N\mu)$ .

The cut-set argument (Chapter 2's output-entropy bound, specialized) makes the $1 + N\mu$ factor precise. $\blacksquare$

Example: $N = 3$ Workers, $D = 6$ Data Points, $M = 2$

Set up the $(N, D, M) = (3, 6, 2)$ shuffling problem. Compute the minimum per-epoch shuffling rate, and verify the normalization.

Solution

Setup

Dataset $\mathcal{W} = \{W_1, \ldots, W_6\}$ . Each worker caches $M = 2$ points. Per-epoch, a random permutation $\pi$ is announced. Each worker gets $D/N = 2$ new points to process.

Uncoded cost

Uncached fraction per worker: $1 - M/D = 1 - 1/3 = 2/3$ . Per-worker missing: $D/N \cdot 2/3 = 4/3$ points. Aggregate: $N \cdot 4/3 = 4$ points. Normalized by $D$ : $R_{\text{uncoded}} = 4/6 = 2/3$ .

Coded lower bound

$R^* \geq N(1 - M/D)/(1 + NM/D) = 3 \cdot (2/3)/(1 + 1) = 2/2 = 1$ . Wait — this is larger than uncoded! Let me recheck. Actually, compare correctly: $R_{\text{uncoded}} = N \cdot (1 - M/D) = 3 \cdot 2/3 = 2$ (not normalized by $D$ , just by dataset). $R^* = 3 \cdot (2/3)/2 = 1$ . So coded is $2\times$ better than uncoded at this point. The normalization choice matters — stick with the standard $R^* = N(1 - \mu)/(1 + N\mu)$ convention.

Per-Epoch Shuffling Load vs. Per-Worker Memory

Plot the coded shuffling rate $R^*(M) = N(1-\mu)/(1+N\mu)$ against the per-worker memory fraction $\mu = M/D$ , with comparison to the uncoded baseline $N(1-\mu)$ . The curves illustrate the multiplicative gain of $1 + N\mu$ from finite-field IA in the delivery phase. The gap grows as $\mu$ increases: at $\mu = 1/2$ with $N = 20$ workers, the coded scheme is $11\times$ more efficient than uncoded.

Parameters

N

— workers20

464

Number of workers

\mu

— highlighted point0.30

0.051

Memory fraction at which we annotate the gap

Data Shuffling vs. Coded Caching vs. MapReduce Shuffle

Problem	What is delivered	Per-round cost	Information-theoretic rate
Coded caching (Ch. 4)	$K$ user-specific file requests	One broadcast	$K(1 - M/F) / (1 + KM/F)$
Data shuffling (this chapter)	Worker-specific slices of permuted dataset	One broadcast per epoch	$N(1 - M/D) / (1 + NM/D)$
MapReduce shuffle (Ch. 2)	Worker-specific intermediate-value partitions	One shuffle per job	$(1 - \mu) / (N\mu)$

Common Mistake: Memory $M$ vs. Memory Fraction $\mu = M/D$

Mistake:

Quote shuffling rates in terms of $M$ without specifying the dataset size $D$ .

Correction:

The per-worker memory must be stated in fraction of dataset $\mu = M/D$ to be meaningful. A memory $M$ that is "a lot" for ImageNet ( $D = 1.3$ M) may be negligible for trillion-parameter LLM pre-training datasets. The information-theoretic rates depend on $\mu$ , not on $M$ alone.

🔧Engineering Note

Shuffling in Federated Learning: A Special Case

In federated learning, each user's data is fixed and private — there is no cross-user data shuffling at all (this is one of FL's defining features). Each user repeatedly cycles through its own local dataset. The convergence-rate implications are subtle: with $n$ users, each drawing from their local distribution, the effective "shuffling" comes from the random user-selection per round (FedAvg selects a random subset of users each round, $C \cdot n$ of them).

So in FL the shuffling cost is replaced by a user- selection cost. Coded shuffling (this chapter) therefore does not apply to vanilla FL; it applies to standard distributed-SGD deployments (data-center training). Chapter 9 treats FL in detail; Chapter 11 (ByzSecAgg) re-introduces coded-computing primitives into FL for Byzantine robustness.

Practical Constraints

•
FL: no cross-user shuffling; each user cycles its local data
•
Distributed SGD (data-center): cross-worker shuffling via coded shuffling
•
Hybrid: partial data shuffling across user subsets is an active research area

📋 Ref: McMahan et al. FedAvg 2017; PyTorch DataLoader

Key Takeaway

Data shuffling is an information-theoretic problem with a clean achievability-converse structure. The cut-set lower bound $R^*(M) \geq N(1-\mu)/(1+N\mu)$ of §7.2 mirrors Maddah-Ali / Niesen caching. Section 7.3 gives the achievability (the CommIT-group result by Wan, Tuninetti, and Caire) that matches this lower bound via finite-field IA, closing the rate-region characterization.

Quick Check

For $N = 20$ workers, per-worker memory fraction $\mu = 1/4$ , what is the minimum per-epoch shuffling rate $R^*(M)$ ?

$15$

$2.5$

$7.5$

$20$