Skip to content

feat: bot mitigation proxy & health endpoint#248

Open
thedaviddias wants to merge 2 commits intomainfrom
feat/bot-mitigation-health-endpoint
Open

feat: bot mitigation proxy & health endpoint#248
thedaviddias wants to merge 2 commits intomainfrom
feat/bot-mitigation-health-endpoint

Conversation

@thedaviddias
Copy link
Owner

@thedaviddias thedaviddias commented Feb 10, 2026

Summary

  • Next.js 16 proxy (proxy.ts) blocks abusive bots (DotBot, GPTBot, AI scrapers, headless browsers, vulnerability scanners) and rate-limits requests per IP + route category
  • Health endpoint (/api/health) runs on Edge runtime for zero-cold-start uptime monitoring — replaces polling GET / which triggered full SSR
  • Sentry fix: hoisted @sentry/* packages in .npmrc to resolve @sentry/node-core module-not-found errors that caused 500s on all SSR pages

Rate limits

Route Limit
/proxy/api/event, /api/event 10 req/min
/api/* 20 req/min
All pages 30 req/min

Bot handling

Type Action
Good bots (Googlebot, Bingbot, etc.) Pass through
Bad bots (DotBot, GPTBot, scrapers) 403 Forbidden
Empty user-agent 403 Forbidden
Vulnerability scan paths (/wp-admin, /.env) 403 Forbidden

Test plan

  • curl /api/health → 200 {"status":"ok"}
  • curl -H "User-Agent: DotBot" / → 403
  • curl -H "User-Agent: GPTBot" / → 403
  • curl -H "User-Agent: Googlebot" / → 200
  • curl -H "User-Agent: " / → 403 (empty UA)
  • curl /wp-admin → 403 (vuln scan path)
  • 30 rapid requests → 200, request 31+ → 429
  • Homepage renders 200 (Sentry fix verified)
  • Post-deploy: verify no legitimate 403/429 in Sentry
  • Post-deploy: Google Search Console shows no crawl errors

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Added service health monitoring endpoint.
    • Implemented bot detection and blocking to protect against malicious traffic.
    • Introduced rate limiting per user/route to prevent abuse and ensure service stability.
thedaviddias and others added 2 commits February 10, 2026 18:32
Block abusive crawlers hitting GET requests (DotBot, GPTBot, etc.) and
rate-limit per IP to prevent Plausible 429s. Add /api/health edge endpoint
for uptime monitoring without SSR overhead. Fix Sentry module resolution
by hoisting @sentry/* packages in .npmrc.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vercel
Copy link

vercel bot commented Feb 10, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
ux-patterns-for-developers Ready Ready Preview, Comment Feb 10, 2026 11:36pm
@coderabbitai
Copy link

coderabbitai bot commented Feb 10, 2026

📝 Walkthrough

Walkthrough

This pull request introduces a health check API endpoint, a bot detection and rate limiting utility module, and a proxy middleware that integrates these components to filter requests based on bot classification and rate limit thresholds by IP and route category.

Changes

Cohort / File(s) Summary
Health Check Endpoint
apps/web/app/api/health/route.ts
New edge-runtime GET handler that responds with status and timestamp, including cache-busting and indexing-prevention headers.
Bot Detection & Rate Limiting
apps/web/lib/bot-detection.ts
New utility module providing bot classification (good/bad/suspicious/human) via user-agent pattern matching, IP extraction from headers, and in-memory rate limiting with periodic cleanup and route-based rate limit categories.
Proxy Middleware
apps/web/proxy.ts
New middleware that detects bots, blocks bad bots with 403, rate limits requests per IP and route category returning 429 when exceeded, and passes through prefetch and good-bot requests to the application.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Proxy
    participant BotDetection
    participant RateLimiter
    participant NextApp as Next.js App
    
    Client->>Proxy: HTTP Request
    Proxy->>BotDetection: detectBot(userAgent, pathname)
    BotDetection-->>Proxy: BotDetectionResult
    
    alt Bad Bot Detected
        Proxy-->>Client: 403 Forbidden
    else Good Bot or Human
        Proxy->>Proxy: Check if prefetch request
        alt Prefetch Request
            Proxy-->>NextApp: NextResponse.next()
            NextApp-->>Client: Response
        else Non-prefetch
            Proxy->>BotDetection: getClientIP(request)
            BotDetection-->>Proxy: IP Address
            Proxy->>BotDetection: getRouteCategory(pathname)
            BotDetection-->>Proxy: Category & Limit
            Proxy->>RateLimiter: checkRateLimit(key, limit)
            alt Rate Limit Exceeded
                RateLimiter-->>Proxy: false
                Proxy-->>Client: 429 Too Many Requests
            else Rate Limit OK
                RateLimiter-->>Proxy: true
                Proxy-->>NextApp: NextResponse.next()
                NextApp-->>Client: Response
            end
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: bot mitigation proxy & health endpoint' accurately summarizes the main changes: adding a bot mitigation proxy and a health endpoint, which align with the core objectives of the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/bot-mitigation-health-endpoint

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 00558730eb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +31 to +33
const goodMatch = userAgent.match(GOOD_BOTS);
if (goodMatch) {
return { isBot: true, botType: "good", botName: goodMatch[0] };

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Check bad patterns before returning good-bot classification

detectBot returns as soon as GOOD_BOTS matches, so a user-agent like Googlebot HeadlessChrome is classified as good and then bypasses all enforcement via the proxy fast-path. Because BAD_BOTS is evaluated later, any mixed UA containing an allowlisted token can evade the intended bot/scanner blocking; evaluate bad signatures first (or treat mixed matches as bad) to avoid this bypass.

Useful? React with 👍 / 👎.

Comment on lines +53 to +55
return (
forwardedFor?.split(",")[0]?.trim() || realIp || cfConnectingIp || "unknown"
);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Derive rate-limit IP from trusted proxy headers

The rate-limit key prioritizes the first value in x-forwarded-for, which is often client-controllable or prependable in many proxy setups; an attacker can vary that value per request to bypass per-IP throttling. Use a trusted ingress-provided IP source (or only parse x-forwarded-for at a known trusted hop boundary) before building the limiter key.

Useful? React with 👍 / 👎.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@apps/web/lib/bot-detection.ts`:
- Around line 12-13: The SUSPICIOUS_PATHS regex currently includes
/.well-known/security.txt which blocks a standardized disclosure location;
update the SUSPICIOUS_PATHS constant (the regex assigned to SUSPICIOUS_PATHS) to
remove the \.well-known\/security\.txt alternative, or add an explicit
allow-list check for the path before applying SUSPICIOUS_PATHS (e.g., explicitly
permit the exact path "/.well-known/security.txt" in the request-path handling
code) so security.txt is not treated as suspicious.
🧹 Nitpick comments (4)
apps/web/lib/bot-detection.ts (3)

23-46: Good-bot classification takes priority over suspicious-path check — spoofed UAs bypass path blocking.

If a request carries a spoofed Googlebot UA and hits /wp-admin, it will be classified as "good" and pass through without the suspicious-path check ever running. This is a common trade-off, but worth noting. To harden this, you could check suspicious paths before (or independently of) the good-bot match, or add reverse-DNS verification for search engine bots at a later stage.


48-56: Fallback IP "unknown" collapses all unidentified clients into one rate-limit bucket.

If none of the IP headers are present, every request maps to the same rate-limit key (unknown:<category>), causing all such clients to share a single quota. On platforms like Vercel or Cloudflare this is unlikely, but if this ever runs behind a different reverse proxy that doesn't set these headers, legitimate users will be collectively throttled.

Consider logging a warning or using request.ip (available in Next.js middleware) as an additional fallback.


58-78: In-memory rate limiter and setInterval are ineffective in serverless/edge environments.

Two concerns here:

  1. Ephemeral state: The Map is local to each invocation's memory. In serverless environments (e.g., Vercel), each cold start creates a fresh Map, so rate limits won't persist across invocations. The rate limiter will only be effective for warm instances that handle multiple requests.

  2. Top-level setInterval: In a long-running Node.js server this works fine, but in serverless/edge runtimes, the interval either never fires (short-lived isolates) or keeps the process reference alive unnecessarily. The comment on Line 58 says "Node.js runtime" but the proxy may run in edge context depending on deployment.

If the goal is best-effort rate limiting for warm instances, this is acceptable as-is. For stricter enforcement, consider an external store (e.g., Vercel KV, Upstash Redis) or at minimum document the limitation. You could also replace the setInterval with a lazy cleanup approach inside checkRateLimit itself (e.g., prune stale entries every N calls).

apps/web/proxy.ts (1)

24-37: X-Blocked-Reason header discloses blocking rationale to the client.

This is helpful for debugging but also informs attackers why they were blocked, making it easier to adjust their approach (e.g., switching UA, avoiding probe paths). Consider removing these headers in production or restricting them to internal/debug builds.

Comment on lines +12 to +13
const SUSPICIOUS_PATHS =
/^\/(wp-admin|wp-login|wp-content|wp-includes|\.env|\.git|phpmyadmin|phpinfo|administrator|cgi-bin|\.aws|\.well-known\/security\.txt)/i;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Blocking /.well-known/security.txt may be counterproductive.

/.well-known/security.txt is a standardized path (RFC 9116) for responsible vulnerability disclosure. Blocking it prevents security researchers from finding your security contact information. Consider removing it from the suspicious-paths pattern.

Suggested fix
 const SUSPICIOUS_PATHS =
-	/^\/(wp-admin|wp-login|wp-content|wp-includes|\.env|\.git|phpmyadmin|phpinfo|administrator|cgi-bin|\.aws|\.well-known\/security\.txt)/i;
+	/^\/(wp-admin|wp-login|wp-content|wp-includes|\.env|\.git|phpmyadmin|phpinfo|administrator|cgi-bin|\.aws)/i;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const SUSPICIOUS_PATHS =
/^\/(wp-admin|wp-login|wp-content|wp-includes|\.env|\.git|phpmyadmin|phpinfo|administrator|cgi-bin|\.aws|\.well-known\/security\.txt)/i;
const SUSPICIOUS_PATHS =
/^\/(wp-admin|wp-login|wp-content|wp-includes|\.env|\.git|phpmyadmin|phpinfo|administrator|cgi-bin|\.aws)/i;
🤖 Prompt for AI Agents
In `@apps/web/lib/bot-detection.ts` around lines 12 - 13, The SUSPICIOUS_PATHS
regex currently includes /.well-known/security.txt which blocks a standardized
disclosure location; update the SUSPICIOUS_PATHS constant (the regex assigned to
SUSPICIOUS_PATHS) to remove the \.well-known\/security\.txt alternative, or add
an explicit allow-list check for the path before applying SUSPICIOUS_PATHS
(e.g., explicitly permit the exact path "/.well-known/security.txt" in the
request-path handling code) so security.txt is not treated as suspicious.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant