𝐒𝐭𝐞𝐩 𝐅𝐮𝐧𝐜𝐭𝐢𝐨𝐧𝐬 𝐆𝐨𝐧𝐞 𝐖𝐢𝐥𝐝: 𝐋𝐞𝐬𝐬𝐨𝐧𝐬 𝐟𝐫𝐨𝐦 𝐚𝐧 𝐄𝐯𝐞𝐧𝐭-𝐃𝐫𝐢𝐯𝐞𝐧 𝐁𝐢𝐨𝐝𝐢𝐯𝐞𝐫𝐬𝐢𝐭𝐲 𝐏𝐥𝐚𝐭𝐟𝐨𝐫𝐦 Today I attended an AWS session in Perth, WA where Alison Lynton shared the brutally honest story behind CSIRO’s Nature IQ biodiversity platform -and what really happens when you go “all in” on serverless. Nature IQ started as a clean, elegant, event-driven architecture but production reality had other ideas. Instead of “everything scales magically,” they hit: • Lambda throttles • Runaway concurrency • DynamoDB partition pain • AVP throttling & auth-level friction And yes - more than one Lambda-induced headache. What made the session valuable wasn’t the final diagram, but the scars that shaped it. The architecture still uses Step Functions, Lambda, DynamoDB, and S3 but now with orchestrated control, pre-computation, and multi-tenant safety baked in. Key technical insights: ✅ Job state belongs in DynamoDB ✅ Avoid env vars for runtime behaviour - use AWS AppConfig ✅ Guardrails matter more than “pretty diagrams” ✅ When upstream fails, everything fails → resilience must be intentional The big takeaway: The hard part of serverless isn’t compute - it’s control. #AWS #StepFunction #DynamoDB #AWSLambda #S3
AWS Session: Nature IQ's Serverless Biodiversity Platform
More Relevant Posts
-
🚨 The Hidden Danger of AWS Lambda Retries (And How to Fix It) Your Lambda function processes a payment: ✅ Step 1: Charge the customer's card - SUCCESS ✅ Step 2: Update the database - SUCCESS ❌ Step 3: Send confirmation email - TIMEOUT AWS Lambda automatically retries the function. Great, right? Wrong. On retry, ALL THREE STEPS execute again. Your customer just got charged twice. 💸💸 Why does this happen? Many developers assume Lambda "remembers" what succeeded in the previous attempt. It doesn't. Between retry attempts: - Variables are garbage collected - Memory state is completely lost - Lambda may even spin up a fresh container - Your boolean flags (like isProcessed = true) reset to false Every retry = a fresh start from line 1. 💫💫💫 The Solution: Idempotency 💫💫💫 The fix isn't hoping Lambda remembers - it's making your function idempotent by persisting execution state externally. Store in DynamoDB: request_id (primary key) status: "processing" | "completed" | "failed" TTL for automatic cleanup The Result: ✅ Retry #1: Processes normally ✅ Retry #2: Sees "already completed" → skips everything ✅ Customer charged exactly once 🚨 Key Takeaway: In distributed systems, never rely on in-memory state. Always externalize your execution status. 🚨 #AWS #Lambda #CloudArchitecture #Serverless #SoftwareEngineering #DynamoDB
To view or add a comment, sign in
-
We almost killed our serverless project. Over $500. It was a quiet Tuesday when the AWS cost alert screamed. Our brand new data pipeline, built on Lambda, was the prime suspect. Here's the humbling truth we learned about our own assumptions. 1. We treated Lambda like a tiny, cheap VM. We focused on memory and CPU, but the real cost driver was invocation duration—our p99 latency was hitting 1.2 seconds on a function that should've been sub-300ms. We were literally paying for our own inefficient code. 2. We obsessed over cold starts. A total red herring for our async workload. We spent days trying to trim 100ms with custom runtimes when a 400ms cold start had zero business impact. A classic case of premature optimization. 3. Our first function was a monolith. It handled everything from S3 event parsing to database writes. The "aha" moment was realizing this wasn't just messy—it was expensive. Breaking it into five single-purpose functions orchestrated by Step Functions actually simplified our Terraform config and cut the execution cost by 40%. 4. IAM permissions were a wildcard (lambda:). Honestly, it was just faster during dev. But this lack of discipline made debugging impossible. The breakthrough was locking a function down to the *one* DynamoDB table it needed. Suddenly, the blast radius was tiny and the architecture, readable. 5. We were flying completely blind. Logs were just… there. The game changed when we properly instrumented with OpenTelemetry. Seeing the full trace—from API Gateway to the final RDS write—was like turning the lights on. Our MTTR dropped by nearly 70% in the first month. Serverless isn't about having no servers. It’s about having more discipline. What was the one serverless pattern that felt like a total game-changer for you? Save this for your team's next cost optimization review. #DevOps #AWSLambda #Serverless #CloudCost
To view or add a comment, sign in
-
-
I became a researcher—of my own cloud needs. Not in a lab, but inside AWS, hunting for the smallest, simplest architecture that’s fast to build, easy to run, and friendly to the wallet. Here’s my “starter kit” I now use for side-projects, MVPs, and PoCs: Goal : Ship a web app + API + light analytics with near-zero ops. Architecture (bite-sized) Front-end: S3 (static site) + CloudFront (CDN, HTTPS, caching) API: API Gateway → Lambda (Python/FastAPI via Lambda adapter) State: DynamoDB (autoscale on-demand) + SSM Parameter Store (secrets/config) Auth (optional): Amazon Cognito (hosted UI, JWTs) Events & ETL (optional): S3 ingest → Lambda → Step Functions → Athena Observability: CloudWatch Logs + Metrics, alarms via SNS Networking: One VPC only when needed (private Lambda subnets for VPC deps); otherwise keep it serverless/no-VPC for speed Why this works Tiny blast radius: each piece is independent and replaceable Pay-per-use: almost nothing when idle; scales when traffic shows up Secure by default: managed TLS, IAM, no servers to patch Fast to ship: minutes to first deploy, hours to first feature Deploy checklist Create S3 bucket (static site), attach CloudFront, route domain in Route 53 Define API in API Gateway (HTTP API), connect to Lambda Model tables in DynamoDB (on-demand capacity) Store secrets in SSM Parameter Store (no hardcoded creds) Add CloudWatch alarms (5XX on API, throttle/write capacity on DynamoDB) CI/CD: GitHub Actions → AWS SAM/Serverless Framework/Terraform for repeatable deploys Cost-savvy tips Start with on-demand everywhere; add reserved capacity only after steady load Put CloudFront in front of S3 to slash egress & speed up global reads Batch analytics with Athena instead of long-running clusters #AWS #Cloud #CloudArchitecture #Serverless #Lambda #APIGateway #DynamoDB #S3 #CloudFront #Cognito #DevOps #DataEngineering #Python #CostOptimization #MVP #Startup
To view or add a comment, sign in
-
-
The CAP theorem isn't theoretical at 3 AM. PagerDuty was a fire alarm for cascading read failures in our EU cluster. We'd assumed our multi-AZ Redis setup was bulletproof; a rookie mistake, honestly. Here's the hard truth we learned about what distributed consistency actually costs. 1. Availability vs. Consistency isn't a debate. It's a dial you're forced to turn during a network partition. Our system chose Availability, serving stale cache data for 12 agonizing minutes until the partition healed. We sacrificed C for A without even realizing we'd made the choice. 2. Consensus algorithms are slow for a reason. We relied on etcd for service discovery, and its Raft implementation is solid. But when the leader node got isolated, the election process to reach quorum (needing 2 of 3 nodes to agree) took a full 32 seconds. That’s an eternity when thousands of requests are timing out. 3. Split-brain is the real monster under the bed. For about 90 seconds, we had two Kafka brokers that both thought they were the partition leader. Result? Divergent data streams and a painful, manual reconciliation process that took engineers offline for hours. Fencing wasn't just a good idea; it was the only idea that would have saved us. 4. "Eventually consistent" can mean "wrong for a while." Many teams hear "eventual" and think "fast enough." But when our read replicas lagged by 8 seconds during a traffic spike on AWS Aurora, users were seeing outdated inventory, leading to oversold items. That's a direct revenue hit. 5. Your observability stack is part of the system. Our Prometheus server was on one side of the partition. It saw a perfectly healthy world, while the other half of our Kubernetes cluster was on fire. We were flying blind. Distributed tracing isn't a luxury; it's your only source of truth when the network itself is lying to you. These principles aren't just academic. They have teeth. What’s the hairiest distributed systems bug you've ever chased? Drop your war stories below. #DevOps #DistributedSystems #CAPtheorem #SiteReliabilityEngineering
To view or add a comment, sign in
-
-
🤯 The Real Reason The Internet Stayed Down: The AWS Backlog Domino Effect The Big Question: AWS fixed the core DNS error in DynamoDB quickly. So, why were apps like Duolingo, Snapchat still broken hours later? 🤔 The answer lies in the messy reality of system recovery vs. system fix. The Fix vs. The Recovery Bottleneck 🚦 When a core service fails, the immediate technical problem is solved (the "fix"), but the chaos it caused creates a massive recovery delay. This is where the magic (and misery) of asynchronous architecture comes in: The SQS Queue Clog: Many critical app functions don't run instantly; they send a message to an SQS (Simple Queue Service) queue. When the DynamoDB database failed, all the payment, notification, and data update messages piled up in SQS. The Lambda Overload: The Lambda functions that are supposed to process these SQS messages failed instantly because they couldn't write to the broken DynamoDB. The messages stayed put. The Recovery Nightmare: When AWS finally fixed the DynamoDB address, the floodgates opened! 🌊 SQS queues suddenly held an entire day's worth of failed messages. The Lambda functions, now working, were instantly overwhelmed trying to clear this colossal backlog. To prevent a second crash, AWS had to throttle (deliberately slow down) the queues. This meant that even though the service was technically "back," your simple transaction was stuck waiting behind millions of others for hours. The Lesson for Tech Leaders: A failure in one highly concentrated region (US-EAST-1) instantly paralyzes the entire downstream flow (SQS/Lambda). Your architecture must be designed with Multi-Region Redundancy to shift workloads the second the queue starts building up. 🚀
To view or add a comment, sign in
-
-
What happens when DynamoDB goes down? "I think last week's oncall must have felt it because I was one of them.🙂" AWS just shared details about their recent outage, and honestly, their engineering response was impressive! Here's what we can learn from how they handled it - What happened? • A rare race condition in DynamoDB's DNS system caused the database endpoint to disappear this cascaded to affect EC2, Lambda, Load Balancers, and even the AWS Console. First, let me tell you how DynamoDB works: 🗄️ What is DynamoDB? DynamoDB is AWS's fully managed NoSQL database - think of it as a super-fast digital filing cabinet that can handle millions of requests per second. 🔑 Here are some key features of DynamoDB: • Serverless - No servers to manage, AWS handles everything. • Lightning fast - Single-digit millisecond response times. • Auto-scaling - Grows and shrinks based on your needs. • Global Tables - Your data syncs across multiple regions automatically. 🏗️ Why it's everywhere: DynamoDB isn't just for storing your app data - AWS uses it internally for: • Managing server inventories (EC2) • Storing function metadata (Lambda) • Tracking user sessions (Console logins) • Coordinating load balancer health checks Hats off to AWS engineers: Their team found a complex DNS race condition in less than an hour. They fixed the problem and what truly impressed me most was how they shared every technical detail publicly. This turned out to be the great learning opportunity for the entire tech community. They're already building better safeguards to prevent similar issues in the future. The real lesson? Even the best systems can have edge cases. But what truly matters is how fast you diagnose it, fix faster, and learn openly! 🚀 Next post: I am currently learning the top 30 important system design concepts that every engineer should know. I'll share them once I complete my learning. Stay tuned and happy learning ! #AWS #DynamoDB #CloudComputing #Engineering #IncidentResponse #TechLeadership #SystemDesign
To view or add a comment, sign in
-
Building Reliable Backends Isn’t Just About Code — It’s About Architecture. In today’s cloud-driven world, writing APIs is easy — but designing scalable, fault-tolerant systems is an art. Here’s what I focus on when building modern backend services: ✅ Microservices with Spring Boot & Spring Cloud – each service independent, deployable, and resilient. ✅ AWS Glue + S3 + Lambda pipelines – ensuring data flows seamlessly and efficiently. ✅ Cassandra & DynamoDB – for high-availability, low-latency data storage. ✅ API optimization – caching, async calls, and load balancing for peak performance. ✅ Monitoring with CloudWatch & Prometheus – because reliability begins with visibility. Every enhancement, every line of code, is a step toward better system design — not just working software. 💬 What’s your favorite AWS service for backend scalability? #JavaDeveloper #SpringBoot #BackendDevelopment #Microservices #AWS #CloudEngineering #ScalableArchitecture #SoftwareEngineering
To view or add a comment, sign in
-
Each Lambda invocation generates a CloudWatch report (duration, billed duration, memory, start/end times) - even if your code logs nothing. And you’ll be charged for it! Doesn't sound like much, but let's take a look at a simple calculation: These logs created by a single Lambda execution generate 262 bytes. If your function executes an average of 1 million times a day, you'll end up with 262B x 1m executions = 262MB, which will make ~8GB per month. With $0.5 per GB of log ingestion, this will be $4 a month! Even if it doesn't sound like much, if you're not actually using these logs, e.g. for optimizing the right memory size, it definitely is. For high-traffic environments, this will get expensive quite quickly. If you don't use them, you can suppress by lowering the system log level, e.g., set it to WARN or higher. For CDK, just set "systemLogLevelV2" to "SystemLogLevel.WARN". Don't forget that JSON logs are required for this to work! For the CLI, you can update your function by providing this logging-config parameter: LogFormat=JSON,ApplicationLogLevel=ERROR,SystemLogLevel=WARN. It's a simple change that can save you a lot of unnecessary costs! 💸 btw. I'm Tobi, a full-stack engineer who's broken plenty of things on AWS before getting it right. I put together some free animations that break down how AWS services work under the hood. Worth a look if you want the practical details. https://lnkd.in/e8sgvRk9
To view or add a comment, sign in
-
-
Managing legacy databases feels like a full-time firefight! Normalization, scaling issues, and high operational costs make relational systems hard to manage—especially when AI workloads demand flexibility. MongoDB helps admins simplify ops with a more efficient, scalable model that runs reliably in any environment. Learn more: https://lnkd.in/ghXUqx4K
To view or add a comment, sign in
-