Signal uncertainty quantification for OpenTSLM checkpoints, built on Monte Carlo Signal Perturbation Uncertainty (MCSPU).
MCSPU measures how much a model's output distribution shifts when the input time-series is perturbed.
Three perturbation modes are currently supported: gaussian noise (gaussian), missing timepoints (missing_zeros), and missing channels (missing_channels).
The main entry point is opentslm_uncertainty_test.py. It runs a full 4-sigma gaussian noise MCSPU sweep on your model and decides if it is production-ready (checking if it truly relies on the signal).
# 1. Install dependencies (can be done using uv also)
pip install -r requirements.txt
# 2. Run the uncertainty test on your checkpoint
python opentslm_uncertainty_test.py \
--checkpoint models/<your_model>.pt \
--dataset <tsqa|har|sleep|ecg_qa> \
--model_type <sp|flamingo> \
--perturbation_type <gaussian|missing_zeros|missing_channels> \ # (default is gaussian)
--llm_id meta-llama/Llama-3.2-1B \
--out_dir plots \
--n_noise 50 \
--max_samples 200 \
--device cuda \
--class_batch 16 # (adapt to your GPU capabilities)
Use at least
--n_noise 50and--max_samples 200(default). These values provide sufficient statistical power (Cohen's$d \ge 0.3$ detectable at$> 95%$ power). SeeUNCERTAINTY_TEST_GUIDE.txtfor the full justification.
Exit code 0 = PRODUCTION READY, 1 = NOT PRODUCTION READY.
All modes will print a summary table in the terminal and save plots to --out_dir
Monte Carlo Signal Perturbation Uncertainty is a per-sample score defined as:
U_signal(x) = (1/N) Σᵢ KL( p_clean ‖ p_perturbed_i )
where:
p_clean= model output distribution on the real signalp_perturbed_i= model output distribution on the perturbed signal (noise or missing data)N= number of perturbation draws (default 50)
The MCSPU scorer (src/opentslm/uncertainty/mcspu.py) operates by replacing the real signal with a perturbed copy and comparing the resulting output distributions via KL divergence. The text description in the prompt is intentionally left unchanged so only signal content is perturbed.
High MCSPU → model output distribution shifts when the signal is perturbed → the model is using the signal.
Low MCSPU → model output distribution barely changes → the model is ignoring the signal and answering from text/prior only.
The perturbation is additive: signal + εᵢ, εᵢ ~ N(0, σ²). The test sweeps σ ∈ {0.1, 0.5, 1.0, 2.0} and checks that uncertainty increases with noise magnitude. This is the production gate, we can use it to confirm our models are ready for deployment.
Click here to understand the background and the experiments that led to this test.
Two missing-data modes are available. Both always exit 0, this is just exploratory and has no thresholds, it produces plots for encoder design guidance.
missing_zeros: a random fraction of timepoints per channel is set to 0.0. Each of the N draws uses a different random mask. Sweeps fractions ∈ {0.1, 0.25, 0.5, 0.75, 1.0}. Tests how quickly the model degrades as signal data goes missing over time.
missing_channels: entire channels are blacked out completely. Some presets have been defined for har and ecg datasets by default, but custom configs via --drop_channels can also be used. This mode can help us show which input channels the encoder really relies on and are most important.
Every checkpoint must pass all tests to be production-ready. The first four are just checking that calculations makes sense, a failure there means broken computation not a bad model, and the last four are the real tests that must be passed in order to be production ready, that means that the model is truly relying on the signal.
A failure here means broken computation, not a bad model:
no_nan_inf_scores: 0 non-finite MCSPU scoreskl_nonneg: min KL ≥ −1 × 10⁻⁶ natsscore_consistency: max |score − mean(KL)| ≤ 1 × 10⁻⁵ natsprobs_normalized: max |Σ probs − 1| ≤ 1 × 10⁻³
Signal sensitivity tests:
sensitivity_magnitude: mean MCSPU(σ=2.0) > 0.05 nats (sp) and 0.008 nats (flamingo)sensitivity_range: MCSPU(σ=2.0) − MCSPU(σ=0.1) > 0.02 nats (sp) and 0.008 nats (flamingo)statistical_significance: Mann-Whitney U p < 0.05effect_size: Cohen's d > 0.30
Thresholds were derived from the HuggingFace pretrained checkpoints. Flamingo thresholds are lower because the cross-attention gates remain near-zero early in training, structurally suppressing absolute KL by ~10-20×. See UNCERTAINTY_TEST_GUIDE.txt for full justification of both sets.
The missing-data modes have no thresholds as results are exploratory only to help us in encoder design decisions.
You do not need to run any of this. The pipeline below is how the thresholds and the production gate were originally established against the upstream pretrained checkpoints. It is documented here for reproducibility and transparency.
The derivation ran in four steps:
- Download the official OpenTSLM checkpoints from HuggingFace via
scripts/download_pretrained_models.py. - Sweep σ ∈ {0.1, 0.5, 1.0, 2.0} across all datasets using
scripts/run_tsqa_har_sleep_gpu.sh(TSQA / HAR / Sleep) andscripts/run_ecg_gpu_batched.sh(ECG-QA). Each run writes a JSONL file of per-sample KL scores viascripts/compute_mcspu.py. - Visualise with
scripts/plot_mcspu.py(reads all JSONL files, outputs figures including themcspu_vs_sigma.pngshown above). - Derive thresholds from the resulting distributions. See
UNCERTAINTY_TEST_GUIDE.txtfor the full statistical justification.
OpenTSLM-Uncertainty/
├── opentslm_uncertainty_test.py # MAIN SCRIPT — run this on any new checkpoint
├── UNCERTAINTY_TEST_GUIDE.txt # Full test documentation and threshold derivations
├── requirements.txt # Python dependencies
├── pyproject.toml
├── scripts/
│ ├── compute_mcspu.py # Low-level MCSPU scorer (one sigma at a time)
│ ├── plot_mcspu.py # Plotting tool for pre-computed JSONL results
│ ├── download_pretrained_models.py # Download official OpenTSLM checkpoints
│ ├── run_tsqa_har_sleep_gpu.sh # GPU sigma sweep for TSQA / HAR / Sleep EDF
│ └── run_ecg_gpu_batched.sh # GPU sigma sweep for ECG-QA (OOM-safe batching)
└── src/
├── opentslm/ # OpenTSLM library (model, datasets, uncertainty)
└── data/har_cot/ # HAR chain-of-thought CSV splits
- OpenTSLM: github.com/StanfordBDHG/OpenTSLM
- MCSPU methodology: see
UNCERTAINTY_TEST_GUIDE.txtin this repo
