The Data-Shuffling Problem

From a System Problem to an Information-Theoretic One

Section 7.1 gave the systems-level view: shuffling is a per-epoch data-movement cost. This section formalizes the problem as an information-theoretic coding problem, with well-defined inputs (the dataset, the permutations, the caches), well-defined outputs (the post-shuffle data distribution), and a precise communication-load metric.

Once we have the formal problem, the tools of Chapter 2 (cut-set converses), Chapter 4 (finite-field IA), and Chapter 5 (polynomial codes) apply directly. The point is that the coded-shuffling tradeoff of §7.3 is not a stand- alone trick; it is a specialization of the computation-communication framework we have been building for five chapters.

Definition:

(N,D,M)(N, D, M)-Data-Shuffling Problem

Let the dataset be W={W1,,WD}\mathcal{W} = \{W_1, \ldots, W_D\}, each WdW_d a fixed-size chunk. The (N,D,M)(N, D, M)-data-shuffling problem operates over NN workers as follows:

  1. Placement phase (one-time, before training): Each worker kk stores a subset TkW\mathcal{T}_k \subseteq \mathcal{W} of size Tk=M|\mathcal{T}_k| = M. The placement is chosen centrally by the master and does not depend on the future permutations.

  2. Delivery phase (one per epoch): At the start of epoch tt, a random permutation πt:[D][D]\pi_t: [D] \to [D] is announced. The permutation induces per-worker assignments Ak(t)=πt1(Bk)\mathcal{A}_k^{(t)} = \pi_t^{-1}(\mathcal{B}_k), where Bk={(k1)D/N+1,,kD/N}\mathcal{B}_k = \{(k-1)D/N + 1, \ldots, kD/N\} is worker kk's fixed slot in the epoch's processing order.

  3. Server broadcasts a sequence of bit-messages to (noiselessly) inform each worker of the new-epoch data it does not already have: worker kk needs every WjAk(t)TkW_j \in \mathcal{A}_k^{(t)} \setminus \mathcal{T}_k.

The per-epoch shuffling rate is Δ(πt)=total broadcast bits/D\Delta(\pi_t) = \text{total broadcast bits} / D (normalized by the dataset size). The worst-case rate is R(M)=maxπΔ(π)R^*(M) = \max_{\pi} \Delta(\pi). The information-theoretic question is: what is the minimum achievable R(M)R^*(M)?

The setup parallels coded caching (Chapter 4 §4.3): the placement phase knows nothing about future demands; the delivery phase answers the specific permutation via a single broadcast. The difference is that coded caching delivers files (one request per user), while data shuffling delivers groups of files (one mini-batch per worker).

Per-Epoch Shuffling Rate Δ\Delta

The total network traffic required to re-shuffle the distributed dataset after a new random permutation is announced, normalized by the dataset size. The optimal R(M)R^*(M) is characterized by the CommIT result of §7.3.

Data Shuffling vs. MapReduce Shuffle

The term "shuffle" is overloaded. Two different operations in distributed systems use it:

  • MapReduce shuffle (Chapter 2): every worker reads its own intermediate values and sends them to the worker responsible for the corresponding reducer key. One-time, per-job operation.
  • Data shuffling in ML (this chapter): a fresh random permutation of the input dataset is computed, and each worker receives its slice of the permuted dataset. Per- epoch operation (potentially hundreds of times per training run).

The coding techniques are similar (both use finite-field IA / XOR alignment), but the rate regions are different. The data-shuffling bound R(M)=1M/D1+NM/DR^*(M) = \frac{1 - M/D}{1 + NM/D} of §7.3 specializes to the same formula as Maddah-Ali / Niesen caching under the identification K=NK = N, F=DF = D (users = workers, files = data points).

Theorem: Lower Bound: R(M)Rcut(M)R^*(M) \geq R_{\text{cut}}(M)

For any (N,D,M)(N, D, M)-data-shuffling scheme, R(M)    N(1M/D)1+NM/D.R^*(M) \;\geq\; \frac{N(1 - M/D)}{1 + NM/D}. The proof is a cut-set argument specialized to the shuffling problem: any permutation-agnostic placement must admit at least this much per-epoch broadcast traffic for the worst-case permutation.

With NN workers and per-worker memory M=μDM = \mu D, a broadcast bit can satisfy at most 1+KM/D=1+Nμ1 + KM/D = 1 + N\mu distinct per-worker missing slots simultaneously (the alignment factor of Chapter 4's coded caching). The total number of missing-data-point deliveries is N(1μ)DN (1 - \mu) D normalized, so the minimum broadcast traffic is N(1μ)/(1+Nμ)N(1 - \mu)/(1 + N\mu). The argument is exactly the Maddah-Ali / Niesen cut-set from §4.3, specialized to the data-shuffling problem.

Example: N=3N = 3 Workers, D=6D = 6 Data Points, M=2M = 2

Set up the (N,D,M)=(3,6,2)(N, D, M) = (3, 6, 2) shuffling problem. Compute the minimum per-epoch shuffling rate, and verify the normalization.

Per-Epoch Shuffling Load vs. Per-Worker Memory

Plot the coded shuffling rate R(M)=N(1μ)/(1+Nμ)R^*(M) = N(1-\mu)/(1+N\mu) against the per-worker memory fraction μ=M/D\mu = M/D, with comparison to the uncoded baseline N(1μ)N(1-\mu). The curves illustrate the multiplicative gain of 1+Nμ1 + N\mu from finite-field IA in the delivery phase. The gap grows as μ\mu increases: at μ=1/2\mu = 1/2 with N=20N = 20 workers, the coded scheme is 11×11\times more efficient than uncoded.

Parameters
20
464

Number of workers

0.30
0.051

Memory fraction at which we annotate the gap

Data Shuffling vs. Coded Caching vs. MapReduce Shuffle

ProblemWhat is deliveredPer-round costInformation-theoretic rate
Coded caching (Ch. 4)KK user-specific file requestsOne broadcastK(1M/F)/(1+KM/F)K(1 - M/F) / (1 + KM/F)
Data shuffling (this chapter)Worker-specific slices of permuted datasetOne broadcast per epochN(1M/D)/(1+NM/D)N(1 - M/D) / (1 + NM/D)
MapReduce shuffle (Ch. 2)Worker-specific intermediate-value partitionsOne shuffle per job(1μ)/(Nμ)(1 - \mu) / (N\mu)

Common Mistake: Memory MM vs. Memory Fraction μ=M/D\mu = M/D

Mistake:

Quote shuffling rates in terms of MM without specifying the dataset size DD.

Correction:

The per-worker memory must be stated in fraction of dataset μ=M/D\mu = M/D to be meaningful. A memory MM that is "a lot" for ImageNet (D=1.3D = 1.3M) may be negligible for trillion-parameter LLM pre-training datasets. The information-theoretic rates depend on μ\mu, not on MM alone.

🔧Engineering Note

Shuffling in Federated Learning: A Special Case

In federated learning, each user's data is fixed and private — there is no cross-user data shuffling at all (this is one of FL's defining features). Each user repeatedly cycles through its own local dataset. The convergence-rate implications are subtle: with nn users, each drawing from their local distribution, the effective "shuffling" comes from the random user-selection per round (FedAvg selects a random subset of users each round, CnC \cdot n of them).

So in FL the shuffling cost is replaced by a user- selection cost. Coded shuffling (this chapter) therefore does not apply to vanilla FL; it applies to standard distributed-SGD deployments (data-center training). Chapter 9 treats FL in detail; Chapter 11 (ByzSecAgg) re-introduces coded-computing primitives into FL for Byzantine robustness.

Practical Constraints
  • FL: no cross-user shuffling; each user cycles its local data

  • Distributed SGD (data-center): cross-worker shuffling via coded shuffling

  • Hybrid: partial data shuffling across user subsets is an active research area

📋 Ref: McMahan et al. FedAvg 2017; PyTorch DataLoader

Key Takeaway

Data shuffling is an information-theoretic problem with a clean achievability-converse structure. The cut-set lower bound R(M)N(1μ)/(1+Nμ)R^*(M) \geq N(1-\mu)/(1+N\mu) of §7.2 mirrors Maddah-Ali / Niesen caching. Section 7.3 gives the achievability (the CommIT-group result by Wan, Tuninetti, and Caire) that matches this lower bound via finite-field IA, closing the rate-region characterization.

Quick Check

For N=20N = 20 workers, per-worker memory fraction μ=1/4\mu = 1/4, what is the minimum per-epoch shuffling rate R(M)R^*(M)?

1515

2.52.5

7.57.5

2020