Week 2 of my series "System Design in Action", where I build meaningful demos to explain real-world distributed systems concepts. 🔹 Consistent Hashing Ever wondered what really happens when a server is added to a distributed system like a cache or database? I built a tiny Go demo to see exactly that, comparing Consistent Hashing against the basic Naive Modulo Hashing. It’s a hands-on look at why modern distributed systems like DynamoDB, Cassandra, and Redis Cluster all rely on hash rings. What I Built: I created a command-line application in Go that implements two hashing algorithms: - A Consistent Hashing ring with support for virtual nodes to ensure balanced distribution. - A Naive Modulo Hasher to serve as a baseline for comparison. How I Tested It: I ran a script that simulated common real-world scenarios: 📈 Scaling Up: Adding a new server to an existing 3-node cluster. 💥 Node Failure: Removing a server to simulate an outage or decommissioning. For each scenario, the tool mapped 50,000 keys and calculated exactly how many had to be moved for each hashing method. My Findings: When scaling the cluster from 3 nodes to 4: ✅ Consistent Hashing: Only ~25% of keys were remapped. The disruption was minimal and localized, affecting only the keys that needed to move to the new node. ❌ Naive Hashing: A staggering ~75% of keys were remapped, triggering a catastrophic, system-wide data shuffle. This is why consistent hashing is critical for stability. In a real system, that massive data movement from naive hashing would mean widespread cache misses, a flooded network, and a huge performance hit. Consistent hashing keeps the system calm and predictable during changes. The code is on GitHub if you want to run the demo yourself! https://lnkd.in/enEeidj5 #SystemDesign #DistributedSystems #Go #Golang #Backend #ConsistentHashing #Scalability
Building a Consistent Hashing Demo in Go for Distributed Systems
More Relevant Posts
-
Step-by-Step Guide to Capture Heap Snapshots in Node.js on Kubernetes Memory leaks are a common issue in Node.js applications, especially when running inside containerized environments like Kubernetes. A memory leak occurs when an application keeps allocating memory without releasing it, leading to increased memory usage and eventually to container restarts or crashes. Here’s a typical graph of what a memory leak looks like: memory usage constantly increasing over time until the pod is restarted. Unclosed Redis or database connections Global variables accumulating data Event listeners that are never removed In short: memory leaks happen more often than you think, especially in complex services. And the tricky part? You won’t always see them in your local environment, since production traffic and timeouts behave differently. In our case, for example, the leak was caused by too many requests to an external API with a very long default timeout, resulting in pending requests piling up in memory. Pro tip: Always define your own timeout! Never rely on externa https://lnkd.in/gqN83hq7
To view or add a comment, sign in
-
Step-by-Step Guide to Capture Heap Snapshots in Node.js on Kubernetes Memory leaks are a common issue in Node.js applications, especially when running inside containerized environments like Kubernetes. A memory leak occurs when an application keeps allocating memory without releasing it, leading to increased memory usage and eventually to container restarts or crashes. Here’s a typical graph of what a memory leak looks like: memory usage constantly increasing over time until the pod is restarted. Unclosed Redis or database connections Global variables accumulating data Event listeners that are never removed In short: memory leaks happen more often than you think, especially in complex services. And the tricky part? You won’t always see them in your local environment, since production traffic and timeouts behave differently. In our case, for example, the leak was caused by too many requests to an external API with a very long default timeout, resulting in pending requests piling up in memory. Pro tip: Always define your own timeout! Never rely on externa https://lnkd.in/gqN83hq7
To view or add a comment, sign in
-
The CAP theorem isn't theoretical at 3 AM. PagerDuty was a fire alarm for cascading read failures in our EU cluster. We'd assumed our multi-AZ Redis setup was bulletproof; a rookie mistake, honestly. Here's the hard truth we learned about what distributed consistency actually costs. 1. Availability vs. Consistency isn't a debate. It's a dial you're forced to turn during a network partition. Our system chose Availability, serving stale cache data for 12 agonizing minutes until the partition healed. We sacrificed C for A without even realizing we'd made the choice. 2. Consensus algorithms are slow for a reason. We relied on etcd for service discovery, and its Raft implementation is solid. But when the leader node got isolated, the election process to reach quorum (needing 2 of 3 nodes to agree) took a full 32 seconds. That’s an eternity when thousands of requests are timing out. 3. Split-brain is the real monster under the bed. For about 90 seconds, we had two Kafka brokers that both thought they were the partition leader. Result? Divergent data streams and a painful, manual reconciliation process that took engineers offline for hours. Fencing wasn't just a good idea; it was the only idea that would have saved us. 4. "Eventually consistent" can mean "wrong for a while." Many teams hear "eventual" and think "fast enough." But when our read replicas lagged by 8 seconds during a traffic spike on AWS Aurora, users were seeing outdated inventory, leading to oversold items. That's a direct revenue hit. 5. Your observability stack is part of the system. Our Prometheus server was on one side of the partition. It saw a perfectly healthy world, while the other half of our Kubernetes cluster was on fire. We were flying blind. Distributed tracing isn't a luxury; it's your only source of truth when the network itself is lying to you. These principles aren't just academic. They have teeth. What’s the hairiest distributed systems bug you've ever chased? Drop your war stories below. #DevOps #DistributedSystems #CAPtheorem #SiteReliabilityEngineering
To view or add a comment, sign in
-
-
New Blog Post: Elasticsearch on Docker Swarm — Part 3 This article wraps up my series on deploying Elasticsearch in Docker Swarm. We cover: - Swarm node labeling - NFS persistence - HAProxy integration Read the full article on Medium: https://lnkd.in/edmyEehw
Deploying Elasticsearch on Docker Swarm Part 3: Node Distribution & HAProxy Configuration medium.com To view or add a comment, sign in
-
5 Feed Design Patterns That Cut P99 Latency By 70%. It’s 10 AM, product launch day. Database CPU hits 95%. Feed generation slows to a crawl. You optimized the backend service. But did you optimize the read path for millions? After implementing high-scale feed architectures 7 times, here is the proven blueprint: 1. Fanout-on-Write (Push Model). Write user content to a Kafka topic. Use worker services (running in Kubernetes) to asynchronously push the content ID to follower inboxes stored in Redis. Context/Metric: This shifts computational load from read time to write time, sustaining 500K feed requests per second with stable 10ms latency. 2. Read/Write Split with Aggregation. Isolate write requests to the main PostgreSQL cluster. Use dedicated read replicas for profile data and user metadata lookups during feed assembly. Context/Metric: Decoupling reduces contention on the main DB cluster, improving write throughput by 65 percent during ingestion spikes. 3. Content Caching Hierarchy. Implement a multi-layer cache: TTL on the content ID list in Redis, and a shorter TTL (5 mins) on the fully rendered feed block stored in Memcached. Context/Metric: Crucial for Black Friday-level traffic. This strategy reduced average feed generation time from 800ms to 150ms. 4. Timeline Service Isolation. Dedicate a small set of microservices (e.g., using Golang) solely for the personalized ranking and merging of content sources. Context/Metric: This prevents cascading failures. If the ranking algorithm fails, the core content delivery pipeline remains intact, minimizing Mean Time To Recovery (MTTR). 5. Backpressure via Rate Limiting. Implement API Gateway throttling and use Kubernetes HPA based on queue depth (e.g., Kafka lag) instead of CPU load alone. Context/Metric: Prevents resource exhaustion on the content ingestion API during sudden bursts, stabilizing latency in the checkout flow timeline services. Your next traffic spike will test these decisions. Prepare now or debug the database at 3am. Which pattern saved you during your last high-volume traffic surge? What critical component did I miss? Save this for your next architecture review. #SystemDesign #Microservices #Kubernetes #PlatformEngineering
To view or add a comment, sign in
-
-
#Day6 of #HighLevelDesignJourney #Hashing & #Consistent Hashing Today I explored one of the most important concepts used in system design Hashing, and its advanced version Consistent Hashing Algorithm 💡 🔹 What is Hashing? Hashing is a technique to map data of any size to a fixed-size value called a hash (or hash code). It’s widely used in: Load balancing Caching Database sharding Data retrieval (like in hash maps) 🔹 #ConsistentHashingAlgorithm Consistent hashing is a special technique used to minimize data movement when nodes (servers) are added or removed from a system. ✅ Key Idea: Imagine a circle (called a hash ring). Both data and servers are mapped onto this ring using a hash function. Each data item is stored on the next server clockwise on the ring. When a server joins or leaves, only a small portion of data needs to be redistributed — not all! 🌀 Why it’s used: Reduces rehashing Helps with scalability Used in distributed systems like Cassandra, Amazon DynamoDB. #100DaysOfHLD #SystemDesign #Hashing #ConsistentHashing #BackendEngineering #ScalableSystems #HighLevelDesign #DistributedSystems #TechLearning #DeveloperJourney #EngineeringMindset #SoftwareArchitecture #CodingCommunity #TechContent #LearningInPublic #DevJourney #TechForEveryone #SystemDesignDaily #ProgrammingLife #CloudComputing
To view or add a comment, sign in
-
-
This blog post delves into the dichotomy within the tech industry, where one camp prioritizes buzzwords and cutting-edge, complex technologies like Kafka, while the other favors practical, straightforward solutions grounded in common sense. The first camp often embraces overengineered, cloud-scale tools influenced by resume-driven design and vendor hype, whereas the second camp opts for minimalistic, practical implementations. The author emphasizes minimum viable infrastructure (MVI): build the smallest amount of system while still providing value. The author observes a gradual shift towards the latter camp, citing trends such as the Small Data movement and the increasing popularity of using Postgres for diverse tasks (the “Just Use Postgres” approach) due to its simplicity and reliability.
To view or add a comment, sign in
-
How do you improve the behaviour of open source tools without forking them? Here is an example of how we do it at Palantir: Making Elasticsearch Bulletproof Without Forking It Palantir runs 300+ Elasticsearch clusters. The problem? Unpredictable API usage from upstream services was causing outages. The root cause - Concurrent sync refreshes exhausting thread pools - wait_for policies blocking indefinitely with long refresh intervals The fix Built a custom ES plugin that automatically rewrites dangerous refresh policies on the fly with no code changes needed across hundreds of services. The insight Make the infrastructure itself defensive. We use ES plugins to enforce safety without forking the codebase. Smart telemetry alerts teams when rewrites happen so they can fix the root cause over time. Build guardrails once, protect everywhere. #Infrastructure #Elasticsearch #Reliability #SoftwareEngineering
To view or add a comment, sign in
-
Lock-Free Concurrency with Immutable Vector + Atomic Swap High-performance systems need millions of reads per second with occasional updates — think consistent hashing rings, load balancer maps, or cache tables. The trick? Immutable data structures + atomic pointer swap. How it works: 1. Keep your data structure immutable — once created, it never changes. 2. Readers just load the current pointer — no locks needed. 3. Writers build a new version of the structure and swap it atomically. 4. Old readers continue safely using the old version. Memory is cleaned up automatically via shared_ptr. Why it’s awesome: Lock-free reads → super low latency Safe concurrent updates Easy reasoning — no tricky locks or races Real-world examples: Envoy endpoint tables, Redis Cluster partition maps, Consistent hashing rings in distributed systems In short: build immutable snapshots, swap atomically, read freely
To view or add a comment, sign in
-
Solved the Redeemer machine on Hack The Box by applying thorough port and service scanning and by studying the Redis database behavior used on the target. The process relied on careful enumeration and banner/version discovery to identify exposed services, followed by focused analysis of Redis configuration to gain further access and escalate privileges. Key takeaways: methodical enumeration, validating assumptions at each step, and documenting findings for reproducibility and learning.
To view or add a comment, sign in
-