Robust and Efficient Fine-tuning of LLMs with Bayesian Reparameterization of Low-Rank Adaptation

TMLR 2025

Vaibhav Seth
IIT Delhi, India
Arinjay Pathak
IIT Delhi, India
Ayan Sengupta
IIT Delhi, India
Aastha Verma
IIT Delhi, India
Natraj Raman
JPMorgan AI Research
Sriram Gopalakrishnan
JPMorgan AI Research
Niladri Chatterjee
IIT Delhi, India
Tanmoy Chakraborty
IIT Delhi, India
Parameter-Efficient Fine-tuningBayesian MethodsRobustnessLarge Language ModelsLow-Rank Adaptation

Abstract

MonteCLoRA is a Bayesian variant of LoRA for parameter-efficient fine-tuning of LLMs. It models low-rank adapters as mixtures of Gaussians with Wishart and Dirichlet hyperpriors, and uses Monte Carlo estimation to obtain low-variance, unbiased updates of LoRA parameters. This reduces sensitivity to hyperparameters and stabilizes fine-tuning, improving both robustness and accuracy on NLU and NLG benchmarks with only O(r) extra parameters for rank r.

TL;DR

LoRA is lightweight but surprisingly brittle: small hyperparameter changes can swing performance by 10–20 points. MonteCLoRA turns LoRA into a Bayesian, Monte Carlo–estimated adapter, keeping the same mean behavior but smoothing the loss landscape and dramatically improving robustness across learning rates and batch sizes.

Why this research?

LoRA is the go-to PEFT method, but in practice it behaves like a temperamental guest:

  • Tiny tweaks in learning rate or batch size can cause large swings in validation accuracy.
  • Full fine-tuning is often even less stable and much more expensive.
  • Exhaustive hyperparameter sweeps on large LLMs are simply not affordable.

Existing “fixes” (MC Dropout, Laplace-LoRA, ensembles, temperature scaling) are mostly post-hoc calibration tricks: you train a brittle model and then try to bandage its uncertainty.
This paper asks a different question:

Can we make LoRA intrinsically robust by changing how the low-rank update itself is parameterized?

MonteCLoRA in one picture

Classic LoRA learns a deterministic low-rank matrix (A).
MonteCLoRA replaces each column of (A) with a stochastic mixture of Gaussians:

  1. Treat each LoRA column as:
    • A mean vector plus
    • A mixture of Gaussian samples drawn from a shared covariance.
  2. Mixture weights follow a Dirichlet prior; covariance comes from a Wishart prior.
  3. Draw (N) samples, combine them with the mixture weights, scale them by a factor (\varepsilon), and add them back to the mean.

This yields a stochastic low-rank update whose expected value is still the original LoRA weight, but whose randomness:

  • Regularizes the model,
  • Smooths sharp minima,
  • And makes the training trajectory much less sensitive to hyperparameters.

Training objective = task loss + KL regularizers:

  • KL for the Gaussian (towards a standard normal),
  • KL for the Wishart covariance,
  • KL for the Dirichlet weights,
  • Plus a cooperative loss to keep all mixture components active (avoid collapse).

Overhead is modest: roughly (O(r + N)) extra parameters per LoRA layer ((r) = rank, (N) = mixture components).

Teaser Image


Main insights

  • Unbiased but more stable LoRA
    The expected adapter equals standard LoRA, but stochastic sampling provides built-in regularization and wider, flatter minima—leading to more stable validation performance across hyperparameters.

  • End-to-end Bayesian LoRA, not post-hoc
    Unlike MC Dropout or Laplace-LoRA, MonteCLoRA is trained from scratch with its Bayesian structure baked into the optimization, not slapped on afterward.

  • Robustness across tasks and models
    From RoBERTa on GLUE to LLaMA-7B on commonsense reasoning and LLaMA-3.2-3B on GSM8k / HumanEval, MonteCLoRA consistently:

    • Matches or improves best-case accuracy,
    • Shrinks the spread (worst–best gap) in validation metrics by up to ~50–60%.
  • Better calibration for free
    Negative log-likelihood (NLL) and calibration curves improve over LoRA and Bayesian post-hoc methods without extra calibration stages.

  • Low extra cost
    With buffered sampling and clever implementation, training overhead is on the order of 1.2–1.7× LoRA and memory overhead ≈ 1.06–1.25×, which is far cheaper than large hyperparameter sweeps or ensembles.

MonteCLoRA Overview