DEV Community: Devanshu Biswas

Caching That Survives Real Traffic: TTL Jitter and Single-Flight in Spring Boot

Devanshu Biswas — Wed, 01 Jul 2026 22:36:36 +0000

Day 11 of building OrderHub added Redis caching with @Cacheable/@CacheEvict — the same read served ~60× faster from memory. But a naïve cache has two failure modes that only show up under load. Day 12 is about making caching safe.

Failure 1: the thundering herd

If a burst of keys all get the same TTL (say 60s), they all expire at the same instant one TTL later — and every request misses at once and stampedes the database. The fix is expiry jitter: add a small random offset (±10%) to every TTL so expiries spread into a smooth trickle instead of one cliff.

Duration ttlWithJitter(Duration base) {
  long ms = base.toMillis(), span = (long)(ms * 0.10);
  long delta = ThreadLocalRandom.current().nextLong(-span, span + 1);
  return Duration.ofMillis(ms + delta);
}

Failure 2: the dogpile

When one hot key expires, dozens of concurrent requests all miss and all recompute the same value together. Single-flight lets exactly one request recompute while the rest wait for its result. In Spring, the simplest per-node form is one flag:

@Cacheable(cacheNames="order", key="#id", sync=true)
public Order getOrder(Long id) { return repo.findById(id).orElseThrow(); }

sync=true makes one thread load while the others block on the cache. Across nodes you'd use a short Redis SETNX lock before recomputing.

Configurable TTLs per cache

Different data goes stale at different rates. Drive per-cache TTLs from configuration so you can tune them per environment without a redeploy:

app:
  cache:
    default-ttl: 10m
    ttls:
      order:  5m
      orders: 2m
      product: 1h

Evict precisely, never flush

Caching is only safe if writes invalidate the right keys. Evict the specific entry that changed and any list/aggregate that includes it — but avoid flushing the whole cache, which just re-triggers the herd you fixed.

Choose an eviction policy

Redis has finite memory; maxmemory-policy decides what to drop when it fills. allkeys-lru (evict least-recently-used) is a solid default for a general cache; volatile-ttl targets keys nearest expiry. Set it deliberately so memory pressure degrades gracefully instead of erroring.

Then prove it: a Testcontainers Redis test that does a cached read and asserts the key exists with a TTL in the expected jittered band.

Live cache-strategy playground (herd vs jitter, dogpile vs single-flight) + the full Spring Boot walkthrough:
https://dev48v.infy.uk/orderhub.php

Repo: https://github.com/dev48v/order-hub-from-zero

Why Your LLM Doesn't Re-Read the Prompt: The KV-Cache

Devanshu Biswas — Wed, 01 Jul 2026 22:35:53 +0000

The KV-cache is the single most important optimisation in LLM inference — and the reason real-time chat with a model is even feasible. Here's what it is and why it matters.

Generation is autoregressive

An LLM produces text one token at a time: emit a token, append it, run the whole model again for the next. Inside each attention layer, every token becomes a Query, a Key, and a Value. To produce the newest token, its Query is scored against the Keys of all previous tokens, and those weights blend their Values. So generating token t needs the K and V of tokens 1…t.

The naïve approach is quadratic

Without a cache, each step re-encodes the entire prefix to rebuild K/V for tokens 1…t. Step 1 processes 1 token, step 2 processes 2, …, step N processes N. Total work ≈ 1+2+…+N = N(N+1)/2 — quadratic. Token 1's K/V gets recomputed on every single step even though it never changes.

The key insight: past K/V never change

LLMs use causal masking — a token attends only to earlier tokens. So adding a new token at the end can't change the Keys and Values of earlier tokens. They're constant. Recomputing them is pure waste.

Cache them → linear generation

Store each token's K/V the first time. Each step computes K/V for just the one new token, appends it, and attends over the whole cache:

K_cache, V_cache = [], []
for t in range(N):
    k, v = kv(token_t)              # ONE token's work
    K_cache.append(k); V_cache.append(v)
    out = attend(Q_t, K_cache, V_cache)   # reuse all prior K/V

Per-step work is now constant → O(N) total instead of O(N²).

Prefill vs decode

This splits inference into two phases. Prefill: ingest the whole prompt in one parallel pass, filling the cache — compute-heavy, and why the first token can take a moment on a long prompt. Decode: generate output tokens one at a time, each a cheap cache-append. That's why "time to first token" and "time per output token" are different numbers.

The memory price of long context

The cache stores K and V for every token, every layer, every head. Its size grows linearly with context length — which is exactly why a 128k-token context is expensive: the cache can eat many gigabytes of GPU memory, often becoming the limit on how many users a GPU can serve. Tricks like paged attention (vLLM), grouped-query attention, quantised caches, and prompt caching all exist to tame it.

Watch a "no cache" vs "cached" generation diverge, op by op:
https://dev48v.infy.uk/ai/days/day22-kv-cache.html

One "+x" That Made 100-Layer Networks Trainable: ResNet Skip Connections

Devanshu Biswas — Wed, 01 Jul 2026 22:35:11 +0000

Deep networks have a cruel paradox. In theory, more layers should never hurt — the extra ones could just learn to pass their input through unchanged. In practice, before 2015, stacking more plain layers made networks worse: a 56-layer net had higher training error than a 20-layer one. The gradient vanished on its way back to the early layers, and optimisation couldn't even find that "do nothing" identity mapping. ResNet fixed it with almost absurdly little.

The residual reformulation

Instead of asking a block to learn a full mapping H(x), ask it to learn the residual F(x) = H(x) − x, and add the input back:

def forward(self, x):
    return F.relu(x + self.f(x))   # y = x + F(x)  <- the skip connection

If the ideal mapping is close to identity, F(x) just needs to be near zero — trivial to learn (push the weights toward 0). The block only learns the correction on top of passing the input through.

Why the +1 saves the gradient

Differentiate the block: d(x + F(x))/dx = 1 + F'(x). Backprop multiplies these across blocks. Even when F'(x) is tiny, the factor stays near 1 instead of near 0 — so the product doesn't collapse:

plain:    dL/dx1 = product of  F'(z)      -> 0    (each F' <= ~0.25 for sigmoid)
residual: dL/dx1 = product of (1 + F'(z)) -> ~O(1)
                                 ^ the identity path never vanishes

The identity path is a gradient highway straight back to the earliest layers.

Projection shortcuts

When a block changes the feature dimensions (a conv that halves spatial size, doubles channels), x and F(x) no longer match, so you can't add them. Put a 1×1 conv on the skip to project x into the new shape first — the "projection shortcut" from the paper. Most shortcuts are plain identity; only dimension-changing ones need this.

The impact

With residual blocks, the 2015 ResNet paper trained 152-layer networks — an order of magnitude deeper than what worked before — and won ImageNet. Deeper finally meant better again. And skip connections are now everywhere: ResNets, U-Nets, and every Transformer block (x + Sublayer(x)). The same +1 quietly keeps gradients healthy inside modern LLMs.

See a plain net vs a ResNet at the same depth, gradient-by-gradient:
https://dev48v.infy.uk/dl/day22-resnet-skip-connections.html

Gaussian Mixture Models: Soft Clustering with the EM Algorithm

Devanshu Biswas — Wed, 01 Jul 2026 22:34:27 +0000

K-Means is the clustering algorithm everyone learns first. But it makes two strong assumptions: every point belongs fully to exactly one cluster (hard assignment), and clusters are round blobs. Real data breaks both. Gaussian Mixture Models (GMMs) relax them — elliptical clusters, and soft probabilistic membership — and they're fit with the elegant EM algorithm.

Data as a mixture of Gaussians

A GMM assumes your data was generated by several Gaussian "bell curves" mixed together. Each component has a mean (centre), a covariance matrix (its shape — how wide, tall, and tilted), and a mixing weight (what fraction of the data it explains). Fitting the model means finding those.

The chicken-and-egg problem

To place the Gaussians you'd need to know which points belong to which cluster — but to assign points you'd need the Gaussians. EM breaks this loop: guess, softly assign, re-estimate, repeat.

E-step: responsibilities

For each point and cluster, compute weight × Gaussian density, then normalise so a point's numbers sum to 1. Those are the responsibilities — soft assignments. A point deep in one cluster gets ~[1,0,0]; a point between two gets [0.5, 0.5, 0].

const r = model.map(k => k.weight * gauss(p, k.mean, k.cov));
const s = r.reduce((a,b)=>a+b, 0);
const resp = r.map(v => v / s);   // sums to 1

M-step: re-fit each Gaussian

Treat the responsibilities as soft counts and re-estimate each Gaussian: the new mean is a responsibility-weighted average of the points; the new covariance is the weighted spread around it (this is what lets the ellipse stretch and tilt); the new weight is its share of total responsibility.

Why the log-likelihood always climbs

EM has a beautiful guarantee: each full E+M iteration never decreases the data's log-likelihood. It rises and plateaus — so watch it to detect convergence. It can settle into a local optimum depending on the random start, which is why people run it a few times and keep the best.

GMM vs K-Means

K-Means is actually a special case of GMM (force spherical, equal covariances and hard assignments). GMM is strictly more expressive — elliptical, differently-sized, overlapping clusters, plus a measure of uncertainty. You still choose K; use BIC or AIC to compare.

Watch the ellipses grow to fit 2D data and points blend colours where clusters overlap:
https://dev48v.infy.uk/ml/day22-gmm-em.html

The Unix Timestamp, Demystified (and the 1000 Bug That Bites Everyone)

Devanshu Biswas — Wed, 01 Jul 2026 22:33:44 +0000

A Unix timestamp is one of the most common things in software and one of the most quietly misunderstood. Let's fix that.

What it actually is

It's a single integer: the number of seconds since midnight UTC on 1 January 1970 — "the epoch". No timezone, no formatting, just a count from one fixed point. That's why it's the ideal way to store and compare moments: two events order by comparing integers, with zero ambiguity about which timezone was meant.

The bug everyone hits: seconds vs milliseconds

Unix tools and most APIs use seconds. JavaScript's Date uses milliseconds. Mix them up and you're off by a factor of 1000 — a date in 1970, or in the year 52,000. You can tell them apart by size: a "now" timestamp is ~10 digits in seconds, ~13 in milliseconds. So detect and normalise:

const digits = String(Math.trunc(n)).length;
const ms = digits <= 11 ? n * 1000     // seconds
         : digits <= 14 ? n            // milliseconds
         : Math.round(n / 1000);       // microseconds

A Date is an instant, not a string

A Date wraps a single millisecond count — an absolute instant. Everything you see (UTC text, local time, ISO) is just a rendering of that one instant. You never "convert" the instant between timezones; you only choose how to display it.

const d = new Date(ms);
d.toISOString();  // 2025-07-01T12:00:00.000Z  <- store THIS
d.toUTCString();  // Tue, 01 Jul 2025 12:00:00 GMT

Local and relative time via Intl

The browser's Intl handles the messy parts for free:

Intl.DateTimeFormat().resolvedOptions().timeZone;   // "Asia/Kolkata"
new Intl.RelativeTimeFormat(undefined, {numeric:"auto"})
  .format(-3, "day");                                // "3 days ago"

Two famous gotchas

JavaScript months are 0-based. new Date(2025, 0, 1) is January. Off-by-one bugs live here.
The Year 2038 problem. Systems storing the timestamp in a signed 32-bit integer overflow at 2,147,483,647 seconds — 03:14:07 UTC on 19 January 2038 — and wrap to a negative date. The fix everywhere is 64-bit timestamps.

The golden rules

Store UTC (or epoch). Convert to local only for display. Never hand-roll date maths — leap seconds, DST, and timezone history will get you; lean on Date, Intl, and the modern Temporal API.

Try the live converter (auto-detects the unit, shows every format, ticks in real time):
https://dev48v.infy.uk/solve/day22-timestamp-converter.html

Skeleton of Thought: Make an LLM Answer 2–3 Faster

Devanshu Biswas — Wed, 01 Jul 2026 22:33:01 +0000

LLMs write answers one token at a time, strictly left to right. Token 500 can't start until token 499 exists, so a thorough answer feels slow no matter how fast your hardware is. Skeleton of Thought (SoT) attacks exactly that — the length of the sequential critical path.

The idea

Most answers are really a list of semi-independent parts: tips, sections, aspects of a comparison. SoT exploits this in three moves:

Ask for a skeleton — just short point titles, no prose.
Expand every point in parallel — one request each, all at once.
Stitch them back together in order.

Because the expansions don't depend on each other, they can run concurrently.

The skeleton call

const skeleton = await llm(
  `Answer "${q}" as ONLY 3-8 numbered point titles, <=6 words each.`);
const points = skeleton.split("\n").filter(Boolean);

Tiny and fast. It also doubles as a plan, which tends to make the final answer better-organised than free-form writing.

Parallel expansion — where the speed comes from

const parts = await Promise.all(points.map((p, i) =>
  llm(`Q: ${q}\nOutline: ${skeleton}\nExpand point ${i+1} in 1-2 sentences.`)
));

Instead of waiting for point 1, then 2, then 3… you wait once for whichever point takes longest.

The maths

Sequential time ≈ skeleton + sum of all points.
SoT time ≈ skeleton + the single longest point.

With five similar-length points, that's roughly five point-times versus one — the reported speedups land around 2× and up.

When NOT to use it

Parallelism is only safe when points are independent. Chained reasoning — a maths proof, a step-by-step derivation where point 3 needs point 2 — breaks it. Gate on that and fall back to normal generation.

The trade-off

SoT isn't free: you spend more total tokens (each expansion repeats the question and outline as context) and make more requests. What you buy is latency — the user sees a complete, structured answer much sooner. It's the classic distributed-systems bargain: more total work to shorten the critical path. It pairs beautifully with streaming UIs, too.

Watch a sequential vs SoT race with real timing here:
https://dev48v.infy.uk/prompt/day22-skeleton-of-thought.html

The Tooltip Problem: A Little Box That Never Falls Off Screen

Devanshu Biswas — Wed, 01 Jul 2026 22:32:19 +0000

A tooltip sounds like the simplest UI component there is. Then you put a button in the top-right corner, hover it, and your tooltip gets clipped by the edge of the screen. Welcome to collision-aware positioning — the real problem that libraries like Floating UI exist to solve. Here's how to build it by hand.

Measure, don't guess

Everything starts with getBoundingClientRect(), which gives an element's position and size in viewport pixels. Read the trigger's rect (where it is) and the tooltip's rect (how big it is), plus innerWidth/innerHeight.

const r = trigger.getBoundingClientRect();
const t = tooltip.getBoundingClientRect();
const vw = innerWidth, vh = innerHeight;

Measure the tooltip after it's visible — a hidden element reports zero size.

Flip

Pick a preferred side — say, top. Compute where the tooltip would go there. If its top edge would go above 0 (off-screen), flip it below the trigger:

let placement = "top";
let top = r.top - t.height - GAP;
if (top < 0) { placement = "bottom"; top = r.bottom + GAP; }

This is the most noticeable smart behaviour — it's why a tooltip on a button at the very top of the page appears below it instead of being cut off.

Shift and clamp

Flipping fixes the main axis; shifting fixes the other one. Centre the tooltip on the trigger, then clamp it inside the viewport. The distance you moved is the "shift":

let left = r.left + r.width/2 - t.width/2;
const clamped = Math.max(GAP, Math.min(left, vw - t.width - GAP));
const shift = left - clamped;
left = clamped;

The arrow that keeps pointing

Because the box moved by shift pixels, nudge the arrow back by the same amount so it still points at the trigger's centre. Without it, a shifted tooltip looks disconnected from what it describes.

Behaviour, not just position

A tooltip is only accessible if it works for everyone: open on mouseenter (after ~80ms so a passing pointer doesn't flash it) and on keyboard focus; close on mouseleave, blur, and Escape; give it role="tooltip" and link it with aria-describedby; and set pointer-events: none so it never steals the hover it depends on.

That's the whole engine — and the same one powers popovers, dropdowns and context menus. Try it (including buttons jammed in every corner) here:
https://dev48v.infy.uk/design/day22-tooltip.html

I Built a Tic-Tac-Toe AI That Literally Cannot Lose

Devanshu Biswas — Wed, 01 Jul 2026 22:31:36 +0000

Tic-tac-toe feels trivial — until you try to write an opponent that never loses. The trick is an algorithm called minimax, and it's the same idea underneath the engines that beat chess grandmasters.

Games are trees

Any turn-based game is a tree. The current board is the root, each legal move is a branch to a new board, and those branch again for the opponent's replies. A full game is one path from the root down to a leaf where someone has won or the board is full. Tic-tac-toe's tree is tiny — at most nine moves deep — so a computer can walk the entire thing instantly. That completeness is why perfect play is possible.

Scoring the endings

We only truly know the value of finished games. Score them from the AI's point of view: a win for the AI is +10, a win for you is −10, a draw is 0. Every other position's value is figured out by assuming perfect play from there.

Minimax: assume both sides are perfect

The two players want opposite things. On the AI's turn it takes the maximum child score; on your turn it assumes you'll take the minimum (best for you). Recurse to the leaves and bubble those choices back up:

function minimax(board, isAI) {
  const w = winner(board);
  if (w) return w === AI ? 10 : w === HUMAN ? -10 : 0;
  const scores = empties(board).map(i => {
    board[i] = isAI ? AI : HUMAN;
    const s = minimax(board, !isAI);
    board[i] = "";
    return s;
  });
  return isAI ? Math.max(...scores) : Math.min(...scores);
}

Win sooner, lose later

Plain minimax treats all wins as equal, so the AI might toy with you or walk into an instant loss. Fold the search depth into the score — subtract it from a win, add it to a loss — and it prefers fast wins and stalls losses:

if (w === AI)    return  10 - depth;
if (w === HUMAN) return -10 + depth;

Why it's unbeatable

Tic-tac-toe is a solved game: perfect play by both sides always ends in a draw. Minimax plays perfectly, so it takes any win you hand it and steers everything else to at worst a tie. The best you can ever get is a draw.

For bigger games like chess you can't search to the end — you add alpha-beta pruning to skip hopeless branches and a heuristic to estimate positions at a depth limit. But the min/max skeleton is identical.

Play the unbeatable version (and an "easy" random mode) here, with the full walkthrough:
https://dev48v.infy.uk/game/day22-tic-tac-toe.html

Same request. Same answer. One is ~120ms, the other is ~2ms. The only difference is whether it came from Postgres or from Redis.

Devanshu Biswas — Wed, 01 Jul 2026 15:44:29 +0000

This is Day 11 of building OrderHub — one production-grade Spring Boot + React app, one feature a day. Phase 1 built a rock-solid monolith (REST, JPA, Flyway, validation, error handling, pagination, config, tests, OpenAPI, Docker). Phase 2 is about making it fast and resilient, and it starts with the highest-leverage performance win there is: caching hot reads in Redis with @Cacheable and @CacheEvict.

🌐 Interactive learning hub (click through the hit/miss demo): https://dev48v.infy.uk/orderhub.php
👉 Repo (read the commits in order): https://github.com/dev48v/order-hub-from-zero

The problem: reading the same row a thousand times

An order gets read constantly — every detail-page open, every refresh, every downstream status check — but it barely ever changes. Yet every one of those reads is currently a full SQL round-trip:

public Order getOrder(String id) {
    return repository.findById(id)      // SQL query, EVERY call
        .orElseThrow(() -> new OrderNotFoundException(id));
}

Identical result, full query cost, over and over. The database becomes the bottleneck long before your app does. The fix is the oldest trick in computing: if you just computed an answer and it hasn't changed, keep a copy somewhere fast.

Redis, in one paragraph

Redis is an in-memory key/value store — data in RAM, so reads are sub-millisecond. You SET a key, GET it back, DEL it, optionally with a TTL. That maps onto a cache perfectly: the key is what you looked up (an order id), the value is the answer (the order), the TTL bounds staleness. And because it's a separate networked process, every instance of your app shares one cache.

The cache-aside pattern, declared not plumbed

You could hand-write "check the key, on a miss run the query, store the result, remember to delete on writes" all over your service. Don't. Spring's caching abstraction lets you declare it. Turn it on once:

@Configuration
@EnableCaching
public class CacheConfig { /* ... */ }

Add the starter (spring-boot-starter-data-redis, which brings the Lettuce client), and then you only ever annotate methods.

@Cacheable(cacheNames = "order", key = "#id", unless = "#result == null")
public Order getOrder(String id) {
    return repository.findById(id)      // runs ONLY on a cache miss
        .orElseThrow(() -> new OrderNotFoundException(id));
}

On a hit, Spring returns the stored value and the body never runs — the database is untouched. On a miss, it runs the body, stores the returned Order under the key, and returns it. That's cache-aside, for free.

The key strategy is the design decision. key = "#id" caches each order under its own entry (order::42, order::43), so they're evicted independently. (Gotcha: caching works through a proxy, so a self-call this.getOrder(id) bypasses it — only cross-bean calls are cached.)

Configuring the manager: JSON values and a TTL

@EnableCaching needs a CacheManager that says how to store things. Two decisions matter.

Serialization. Redis stores bytes. GenericJackson2JsonRedisSerializer stores JSON — readable in redis-cli, language-neutral, and it embeds the Java type so it deserializes back to the right class. (Java's built-in serialization produces opaque, version-brittle blobs. Use JSON.)

A default TTL. Give every entry an expiry — a safety net so a stale value can only live so long even if an eviction were missed, and dead keys clean themselves up.

@Bean
RedisCacheManager cacheManager(RedisConnectionFactory cf,
                               GenericJackson2JsonRedisSerializer json) {
    RedisCacheConfiguration defaults = RedisCacheConfiguration.defaultCacheConfig()
        .entryTtl(Duration.ofMinutes(10))
        .disableCachingNullValues()
        .serializeValuesWith(fromSerializer(json));

    return RedisCacheManager.builder(cf)
        .cacheDefaults(defaults)
        .withInitialCacheConfigurations(Map.of(
            "orders", defaults.entryTtl(Duration.ofMinutes(30))))  // per-cache override
        .build();
}

Connection details come from config — spring.data.redis.host/port, localhost in dev, ${REDIS_HOST}/${REDIS_PORT} in prod so no address is baked into the jar.

The immutable-object snag

OrderHub's Order is immutable: all fields final, no setters, no no-arg constructor. Great design — and exactly what naive JSON deserialization chokes on, since it wants an empty object plus setters. The fix doesn't weaken the domain; you just tell Jackson how to rebuild it:

@JsonCreator
private Order(@JsonProperty("id") String id,
              @JsonProperty("customer") String customer,
              @JsonProperty("item") String item,
              @JsonProperty("quantity") int quantity,
              @JsonProperty("status") OrderStatus status,
              @JsonProperty("createdAt") Instant createdAt) { ... }

Register JavaTimeModule (so Instant becomes an ISO-8601 string) and ParameterNamesModule, and it round-trips cleanly. The general lesson: whatever you cache must survive your serializer.

Keeping it honest: evict on every write

A cache that's never invalidated serves stale data forever, and a stale order status is a real bug. So every write evicts what it invalidates. Placing a new order can't touch any existing per-id entry (its id was just minted) — but it changes the list, so it drops the list cache:

@CacheEvict(cacheNames = "orders", allEntries = true)
public Order placeOrder(String customer, String item, int quantity) { ... }

Confirming an order changes an existing row, so both caches can be stale — evict both, stacked with @Caching:

@Caching(evict = {
    @CacheEvict(cacheNames = "order",  key = "#id"),
    @CacheEvict(cacheNames = "orders", allEntries = true)
})
public Order confirmOrder(String id) { ... }

Why evict instead of @CachePut? A lingering stale read is worse than one extra DB round-trip. Evict is simple and always correct; the next read repopulates. Rule of thumb: for every @Cacheable, ask "which writes make this stale?" and evict there.

Prove it against real Redis

Mocking the cache proves nothing about serialization, keys, or eviction — the exact things that break. So the integration test boots a throwaway Redis container next to Postgres (same Testcontainers approach as Day 9):

@Container static final GenericContainer<?> REDIS =
    new GenericContainer<>(DockerImageName.parse("redis:7-alpine")).withExposedPorts(6379);

@DynamicPropertySource
static void props(DynamicPropertyRegistry r) {
    r.add("spring.data.redis.host", REDIS::getHost);
    r.add("spring.data.redis.port", () -> REDIS.getMappedPort(6379));
}

Now the full-stack test proves it end to end: POST → GET (miss, cached) → GET (hit, no query) → confirm (evict) → GET (fresh CONFIRMED). All 26 tests green, same engine as prod, only Docker required.

The frontend gets a cache too

The server cache is the foundation; the browser compounds it. React Query is cache-aside one layer up — a client cache keyed by a query key mirroring order::id:

useQuery({ queryKey: ['order', id], queryFn: () => getOrder(id), staleTime: 30_000 })

staleTime gives you stale-while-revalidate: revisiting an order paints the cached copy instantly, refetches in the background, and swaps in fresh data — no spinner. And the same eviction discipline applies: after a write, invalidate the matching keys.

onSuccess: () => {
  qc.invalidateQueries({ queryKey: ['order', id] })   // FE "@CacheEvict"
  qc.invalidateQueries({ queryKey: ['orders'] })
}

Two caches, one discipline: read from the fast copy, evict it on every write.

Operating a cache without getting burned

Only cache read-heavy, staleness-tolerant data. A low hit rate is just an extra network hop.
Beware the stampede: when a hot key expires, a flood of misses can hammer the DB at once — use short, jittered TTLs.
Bound memory with TTLs plus a Redis eviction policy (allkeys-lru).
Degrade gracefully: Lettuce connects lazily, so the app boots without Redis; a cache outage should fall back to the DB.

Fast when healthy, correct always, graceful under failure. That mindset carries into the rest of Phase 2: cache strategies (Day 12), rate limiting on Redis (Day 13), and circuit breakers, retries and timeouts (Days 14–15).

Play with the live hit/miss/evict demo and read the annotated backend + frontend steps at the learning hub, and follow the whole build commit by commit in the repo — both linked up top. 🚀

How to Make an LLM 2-3x Faster Without Changing a Single Word It Says

Devanshu Biswas — Wed, 01 Jul 2026 15:43:47 +0000

Large language models are slow for one stubborn reason: they write one token at a time. To produce a 200-token answer, the model runs its full stack of billions of parameters 200 separate times, and each run has to finish before the next can start. You can't compute token 5 until you know token 4. It's a strictly sequential grind.

Worse, each run barely uses your hardware. A forward pass spends most of its time hauling weights out of memory, not doing math, so your expensive GPU sits mostly idle, one token at a time. That single fact — one token per slow, memory-bound pass — is the wall every fast-inference trick is trying to knock down.

Speculative decoding knocks it down with a trick that sounds too cheap to work: guess ahead with a small model, then have the big model check all the guesses at once. And the output comes out exactly the same as if you'd never used the trick. Same words, same order, just faster.

The insight: checking is cheaper than writing

Here's the asymmetry everything hinges on. Generating five tokens the normal way costs five slow passes. But checking five already-written tokens costs almost the same as checking one — because you pay the memory cost once and get all five verdicts in a single pass. Writing is sequential and slow. Verifying is parallel and nearly free.

So if some faster process could propose the next few tokens, the big model could confirm a whole batch of them in one shot instead of grinding them out individually. That "faster process" is a second, much smaller model.

Two models: a draft and a target

You run two models that share the same vocabulary.

The target is the big, accurate model whose output you actually want. Its answers must not change.
The draft is a small, fast model whose only job is to guess ahead cheaply.

The draft doesn't need to be smart. It just needs to be right often enough that its guesses usually survive. You keep the quality of the big model and borrow the speed of the small one.

The loop, one round at a time

1. Propose. The draft decodes K tokens ahead on its own — say 4. Because the draft is tiny, those 4 sequential guesses are quick and cheap. You now have a little chain of speculative tokens.

2. Verify. The target runs once over your current text plus all 4 guesses. Thanks to how attention works, that single pass produces the target's own predicted token at every position in parallel — as if you'd asked "what would you have written here?" at each step, simultaneously. One expensive pass, five predictions.

3. Accept. Now walk the guesses left to right and keep each one as long as it matches what the target wanted at that spot. The instant a guess disagrees, you stop. That mismatched token is rejected, and everything after it gets thrown away — those later guesses were built on a token the target won't keep. A round might accept all 4, or just 1, or 0.

4. Correct. Even when a guess is rejected, that same pass already computed the target's own token for that position, so you take it as a free correction. This guarantees progress: every round writes at least one genuine target token, even when the draft got everything wrong. And if all K guesses are accepted, the pass also gives you a bonus token for the position just past them.

So each round commits between 1 and K+1 tokens — always including at least one the target itself chose — for the cost of a single target pass.

Why the output never changes

This is the part that makes it more than a hack. The final text is identical to what the target would have produced alone, because the target has veto power at every position. A draft token only survives if the target agrees; any disagreement is overwritten by the target's own choice. For sampling (not just greedy decoding) there's a cleverer probabilistic accept/reject rule that provably reproduces the exact same output distribution as sampling from the target directly. The draft never injects its opinions — it only proposes candidates the target is free to confirm or reject.

Lossless. That word matters. You are not trading quality for speed here.

What actually drives the speedup

Everything rides on how often the draft guesses right — the acceptance rate.

Draft usually correct → most rounds accept the whole run of K tokens → many tokens committed per target pass → speedup approaches K+1.
Draft often wrong → rounds accept a token or two → you barely beat the plain baseline.

In practice, a decent draft on predictable text gets you roughly 2–3× fewer target passes for the same output. That's why it's a staple of production serving stacks.

When it helps, when it hurts

It shines on predictable, low-entropy text — code, structured formats, obvious continuations — where guesses land often and accepted runs are long. It helps most when the target is large and memory-bound, so parallel verification is a big relative win.

It helps less, or can even hurt, when the draft is a poor match for the target, when the text is highly creative or random, or when K is set so large that most proposed tokens are wasted. The craft is picking a fast-but-decent draft and a K that fits your workload.

You rarely hand-roll the loop in production. transformers exposes it as assisted generation; vLLM and TensorRT-LLM enable it with a flag, using a draft model, n-gram lookups, or Medusa heads. Same output, fewer passes.

I built an interactive version where you drag a "draft accuracy" slider and watch the accept rate — and the speedup — climb in real time:

https://dev48v.infy.uk/ai/days/day21-speculative-decoding.html

Your gradient dies on the way to layer 1 (and how to save it)

Devanshu Biswas — Wed, 01 Jul 2026 15:43:05 +0000

Stack enough layers and something strange happens: the network trains, the last few layers learn fine, and the first layers barely move at all. Not slowly — barely at all. For years this quietly capped how deep a network anyone could actually train. The culprit is one line of arithmetic hiding inside backpropagation, and once you see it you can't unsee it. Here it is, running on a real chain of layers in your browser.

📉 Slide the depth, pick an activation, watch the gradient vanish or explode: https://dev48v.infy.uk/dl/day21-vanishing-gradients.html

Backprop is a product, not a sum

When a network learns, backpropagation figures out how the loss changes with respect to every weight, working backwards from the output to the input. The important structural fact is how the gradient travels: at every layer it gets multiplied by that layer's local factor, roughly the weight magnitude times the derivative of the activation, |w| · f'(x).

Multiplied. Not added. So the gradient that finally reaches the first layer is a long product of these per-layer factors — one for every layer in between. And products of many numbers are fragile in a way sums never are.

Below 1, it vanishes

Suppose each factor is a little under 1 — say 0.9. Sounds harmless. But 0.9 to the 50th power is about 0.005, and by 100 layers it's practically zero. The shrinkage is exponential in depth, so it sneaks up fast: a factor that looks perfectly reasonable at one layer becomes catastrophic when you compound it dozens of times.

When the gradient reaching the earliest layers is essentially zero, those layers get almost no update signal and effectively stop learning. Only the layers near the output train at all. That's the vanishing gradient problem, and in the demo you can watch it directly: with sigmoid and depth 16, the bar for layer 1 is flush with the floor of the chart at around 1e-9.

Above 1, it explodes

The mirror image is just as deadly. If each factor is greater than 1, the product grows exponentially instead of shrinking. 1.5^20 is already over 3,000; 2^20 is over a million. An exploding gradient produces enormous weight updates that overshoot wildly and send your parameters to NaN in a single step. This is especially common in recurrent networks, where the same weight matrix is applied at every timestep — a long sequence is effectively a very deep chain multiplying the same factor over and over. Drag the weight-scale slider up in the demo and the bars turn amber as the gradient rockets into the thousands.

Sigmoid and tanh make it worse on purpose

The classic activations actively push the factors below 1. The sigmoid's derivative is s·(1−s), which maxes out at just 0.25 at the center and is far smaller in the flat tails where big inputs land. So before you even consider the weights, a single sigmoid layer can multiply the gradient by at most a quarter. Stack a handful and the product is already minuscule — 0.25^19 is about 3.6e-12.

Tanh is a bit kinder — its derivative peaks at 1 — but it too saturates toward 0 for large inputs. Squashing activations in deep stacks all but guarantee vanishing. That's exactly why the demo defaults to sigmoid to show the effect.

Why deep nets stalled

For a long stretch this single phenomenon was the ceiling. People stacked many sigmoid or tanh layers, the early layers refused to learn, and "deep" networks performed no better than shallow ones. It made depth look like a dead end. The workarounds were fiddly — greedy layer-by-layer pretraining, hand-tuned learning rates, staying shallow. None of them fixed the underlying multiplication problem.

The breakthrough wasn't one magic trick. It was a cluster of fixes that each do the same job: keep the per-layer factor near 1.

The fixes

ReLU. Its derivative is either 0 (negative inputs) or exactly 1 (positive inputs) — no shrinking 0.25 cap. Every active neuron passes the gradient through undamped, so a chain of active ReLUs multiplies by 1 at each step and the product doesn't decay. This is the single biggest reason ReLU replaced sigmoid as the default hidden activation.

Weight initialization. Even with a good activation, the |w| part matters. Xavier (Glorot) init sets the weight variance to about 1/fan_in, keeping variance constant across layers for tanh-like activations. He init uses 2/fan_in — the extra factor of two compensates for ReLU zeroing half its inputs — and is the standard partner for ReLU. Both pick the starting scale so the per-layer factor lands right at 1. In the demo, the He + ReLU preset drops every factor onto the green "stable = 1" line.

Gradient clipping. For the exploding case, especially in RNNs, you measure the gradient's norm and, if it exceeds a threshold, rescale the whole vector down. Same direction, capped length. Cheap and reliable.

Batch norm and residual connections. Batch norm re-centers each layer's pre-activations into the healthy region where derivatives aren't tiny. Residual connections add the input back — y = x + F(x) — so the gradient gets a straight-through +1 path: dy/dx = 1 + dF/dx. Even if F's own gradient is small, the gradient flows around the block. That single trick is what let ResNets train hundreds of layers deep.

One idea to remember

Because backprop multiplies a factor at every layer, keeping that factor near 1 is the whole game. Below 1 it vanishes, above 1 it explodes. ReLU, He init, clipping, batch norm, residual connections — and the gates inside LSTMs for sequences — are all just different ways of pinning that factor to roughly 1 so the gradient survives the trip from output to input.

🔨 Built from a real forward pass and chain-rule product on the page — no framework: https://dev48v.infy.uk/dl/day21-vanishing-gradients.html

Part of DeepLearningFromZero. 🌐 https://dev48v.infy.uk

AdaBoost from Scratch: How a Pile of Dumb Rules Becomes a Smart Classifier

Devanshu Biswas — Wed, 01 Jul 2026 15:42:23 +0000

Here is a question that sounds like a trick: can you build an accurate classifier out of models that are barely better than flipping a coin?

Surprisingly, yes. That is the whole idea behind boosting, and AdaBoost is the algorithm that made it famous. I built it from scratch and dropped it into an interactive demo — here's how it actually works, real math, no hand-waving.

Play with the live version: https://dev48v.infy.uk/ml/day21-adaboost.html

The weak learner: a decision stump

AdaBoost's building block is the simplest classifier you can imagine: a decision stump. It is a decision tree with exactly one split. Look at one feature, compare it to one threshold, and call everything on one side "+1" and everything on the other side "−1". That's it. One line, one cut.

def stump_predict(X, dim, thresh, polarity):
    pred = np.ones(len(X))
    if polarity == 1:
        pred[X[:, dim] <= thresh] = -1
    else:
        pred[X[:, dim] >  thresh] = -1
    return pred

On anything that isn't trivially separable, a single stump is hopeless — on a checkerboard layout it barely passes 55-60%. That is exactly why it's a "weak learner": a model that only beats random guessing by a hair. The magic is in how we combine hundreds of them.

Sample weights: a moving spotlight

The engine of AdaBoost is a weight on every training point that says "how much does getting this one right matter?" Everything starts equal:

n = len(X)
w = np.full(n, 1.0 / n)   # uniform: every point weighs 1/n

These weights are a probability distribution — they sum to 1. After each round they change: points we got right get lighter, points we missed get heavier. Since we always pick the next stump to minimise weighted error, the heavy points end up dominating the search. The next stump is effectively forced to stare at whatever the committee keeps blowing.

Weighted error, not a plain count

When we hunt for the best stump each round, we don't count mistakes — we add up the weight of the mistakes:

def weighted_error(pred, y, w):
    return w[pred != y].sum()   # weight of the misses, not the count

Early on, with uniform weights, this is just the usual error rate. But once some points are heavy, a stump that nails those heavy points scores a low weighted error even if it fumbles a few light ones. So "best stump" quietly shifts every round toward the current hard cases — and we never had to tell it which points are hard. The weights say it for us.

The alpha formula: why the logarithm?

Once we know a stump's weighted error, we decide how loud its vote will be in the final ensemble:

eps = 1e-10
err = min(max(err, eps), 1 - eps)      # guard the log
alpha = 0.5 * np.log((1 - err) / err)

Stare at the shape of that formula, because every piece earns its place:

When err → 0, the ratio (1-err)/err explodes and alpha → +∞. A near-perfect stump dominates.
When err = 0.5, the ratio is 1, ln(1) = 0, so a coin-flip stump gets alpha = 0 — no say at all.
When err > 0.5, the log goes negative, so a worse-than-random stump gets a negative alpha and its vote is simply flipped.

The logarithm isn't decoration. It's the exact value that minimises the exponential loss AdaBoost is secretly doing gradient descent on. That is why boosting provably drives training error down.

Reweighting: grow the misses, renormalise

Now we reshape the weights so the next stump faces a harder problem:

pred = stump_predict(X, dim, thresh, polarity)
w = w * np.exp(-alpha * y * pred)   # right shrinks, wrong grows by exp(alpha)
w = w / w.sum()                     # renormalise so sum(w) == 1 again

When the stump is right, y * pred = +1, the exponent is negative, and the weight shrinks. When it's wrong the weight grows by exactly exp(alpha) — a confident stump reweights harder. Then we divide by the total so the weights sum back to 1, a valid distribution again.

I verified the chain numerically: after every round the renormalised weights sum to 1.0 to ten decimals, and alpha tracks the formula exactly (0.0 at err=0.5, 1.099 at err=0.1, 2.298 at err=0.01). In the demo this is why the misclassified points visibly swell round after round.

The strong classifier: a weighted vote

The final model isn't a plain majority vote. It's a weighted one:

def predict(ensemble, X):
    total = np.zeros(len(X))
    for alpha, dim, thr, pol in ensemble:
        total += alpha * stump_predict(X, dim, thr, pol)
    return np.sign(total)

Ask every stump for its ±1 answer, scale each by its alpha, add them up, take the sign. Confident stumps swing the sum hard; weak ones barely nudge it. Formally, F(x) = sign(Σ αₜ·hₜ(x)) — an additive model. In my demo, the blocky shaded background is the sign of exactly this sum evaluated across the whole plane. On XOR-style data I watched it climb from 60% train accuracy to 85% over 25 rounds, with each individual stump still stuck near 40% error the entire time. That is the payoff: no single learner improved, but the committee did.

Boosting cuts bias

Contrast it with random forests. Bagging averages many strong, low-bias trees to cut variance. Boosting does the opposite: it starts with high-bias stumps that badly underfit and adds them one at a time, each correcting the residual of the whole. So the ensemble's bias falls steadily and the boundary grows more expressive every round. Boosting turns underfitting models into a flexible one — that's its signature.

When to stop

Boosting can overshoot. Enough rounds will drive training error to zero, but past a point AdaBoost starts fitting the noise and test error creeps back up. Because it weights hard points heavily, it's especially touchy about mislabelled examples and outliers — it keeps doubling down on points it can never win. The cures are the usual: cap the number of rounds, shrink each alpha with a learning rate, and pick both with cross-validation.

In practice

You'd never hand-roll this in production. Scikit-learn hands it to you in one object:

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

clf = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),  # a stump
    n_estimators=50,     # T rounds
    learning_rate=1.0,   # shrinks each alpha
)
clf.fit(X_train, y_train)

max_depth=1 is exactly our stump. n_estimators is the number of rounds. learning_rate is the alpha-shrinkage that fights overfitting. Everything maps straight onto the loop we just built by hand.

AdaBoost with stumps is what powered the classic Viola-Jones face detector that made real-time face detection possible. Gradient boosting (XGBoost, LightGBM) has largely taken over since, but AdaBoost is still the clearest way to see boosting: reweight, refit, revote.

Drag the rounds slider and watch it happen live: https://dev48v.infy.uk/ml/day21-adaboost.html