You Only Prune Once: Designing Calibration-Free Model Compression with Policy Learning

Large models are powerful but painfully expensive: memory-heavy, slow to run, and hard to deploy. Traditional structured pruning methods—SliceGPT, SVD-based methods, and large-block channel pruning—depend heavily on calibration datasets, break easily at high sparsity, and often require expensive recovery finetuning to undo the damage.

PruneNet takes a different route:
It treats pruning as policy learning over the model’s weights, guided by a simple but powerful principle:

"Keep the compressed spectrum close to the original spectrum, and the model will keep behaving the same."

By learning what to prune without ever touching calibration data, PruneNet becomes reusable, layer-aware, and surprisingly stable—even at aggressive compression ratios.

TL;DR

PruneNet turns pruning into a learned policy that studies a model’s own weight structure—no calibration data, no guesswork. By preserving the spectral shape of FFN layers, it achieves higher accuracy, faster inference, and robust compression across many LLMs. Think of it as pruning with intuition and discipline.

Why This Research?

LLM pruning suffers from two long-standing problems:

Calibration dependency — Most techniques must evaluate many pruned variants on a dataset to pick which slices to keep.
Spectral damage — Removing rows/columns collapses singular values, distorting information flow in deep layers.

PruneNet tackles both by:

Using a weight-only policy network to choose rows of FFN layers.
Penalizing Kolmogorov–Smirnov distance between original and compressed spectra.
Making pruning decisions reusable across sparsity levels.

The result is a method that is fast, stable, and shockingly accurate with zero external data.

Main Insights

1. Pruning as Learned Policy (No Calibration)

Each FFN layer is treated as a state, and pruning rows is an action.
The policy network (a tiny MLP ~0.6% of LLaMA-2-7B) outputs keep/drop probabilities based solely on weights.

No dataset needed
No activation recording
No layer-wise greedy scoring

Just pure structural reasoning over the matrix itself.

2. Spectral Preservation via KS Distance

Standard slicing shrinks and skews singular values: $\sigma^\text{pruned} \rightarrow \text{imbalanced, right-skewed spectrum}$

PruneNet minimizes the KS divergence: $\text{KS}(\Sigma_\text{orig},\, \Sigma_\text{pruned})$

This ensures the shape of the singular-value distribution stays intact, which strongly correlates with downstream task stability.

3. Structured FFN Compression (Where It Matters)

FFN layers account for:

64% of parameters in LLaMA-2-7B
Majority of nonlinearity
Majority of compute

PruneNet selectively removes entire rows of FFN1 and corresponding columns of FFN2—maintaining functional consistency and FLOPs reductions.

4. You Only Prune Once: Policy Transfer

Train a policy at one sparsity level—reuse it across many compression ratios.

Train at 40% → reuse at 10%, 20%, 30% → <1% average drop.
Train at 10% → reuse at 20–40% → still beats SliceGPT by ≥3 points.

A single policy effectively acts as a compression oracle for the whole model family.

5. Reinforcement Learning Over Layers

The pruning problem is non-differentiable.
PruneNet uses REINFORCE with:

Per-layer rewards
Future-discounted spectral penalties
Layer depth–sensitive prioritization

Later layers get stronger incentives (they hold more semantic content).

Results Overview

Zero-Shot Performance (LLaMA-2-7B)

Average over PIQA, WinoGrande, HellaSwag, ARC-e, ARC-c:

Compression	Dense	SliceGPT	PruneNet
20%	69.0	58.2	61.7
25%	69.0	55.5	58.6
30%	69.0	51.5	55.5

PruneNet retains ~89% performance at moderate sparsity and stays significantly more stable than SliceGPT.

Phi-2 (2.7B)

Compression	Dense	SliceGPT	PruneNet
30%	72.24	51.99	61.05

A massive 9-point advantage at the same sparsity.

Throughput Gains

LLaMA-2-7B @ 30% compression:

Model	Tokens/sec
Dense	11.96
SliceGPT	12.82
PruneNet	20.74

Nearly 2× faster inference.