feat: bot mitigation proxy & health endpoint#248
Conversation
Block abusive crawlers hitting GET requests (DotBot, GPTBot, etc.) and rate-limit per IP to prevent Plausible 429s. Add /api/health edge endpoint for uptime monitoring without SSR overhead. Fix Sentry module resolution by hoisting @sentry/* packages in .npmrc. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📝 WalkthroughWalkthroughThis pull request introduces a health check API endpoint, a bot detection and rate limiting utility module, and a proxy middleware that integrates these components to filter requests based on bot classification and rate limit thresholds by IP and route category. Changes
Sequence DiagramsequenceDiagram
participant Client
participant Proxy
participant BotDetection
participant RateLimiter
participant NextApp as Next.js App
Client->>Proxy: HTTP Request
Proxy->>BotDetection: detectBot(userAgent, pathname)
BotDetection-->>Proxy: BotDetectionResult
alt Bad Bot Detected
Proxy-->>Client: 403 Forbidden
else Good Bot or Human
Proxy->>Proxy: Check if prefetch request
alt Prefetch Request
Proxy-->>NextApp: NextResponse.next()
NextApp-->>Client: Response
else Non-prefetch
Proxy->>BotDetection: getClientIP(request)
BotDetection-->>Proxy: IP Address
Proxy->>BotDetection: getRouteCategory(pathname)
BotDetection-->>Proxy: Category & Limit
Proxy->>RateLimiter: checkRateLimit(key, limit)
alt Rate Limit Exceeded
RateLimiter-->>Proxy: false
Proxy-->>Client: 429 Too Many Requests
else Rate Limit OK
RateLimiter-->>Proxy: true
Proxy-->>NextApp: NextResponse.next()
NextApp-->>Client: Response
end
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 00558730eb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const goodMatch = userAgent.match(GOOD_BOTS); | ||
| if (goodMatch) { | ||
| return { isBot: true, botType: "good", botName: goodMatch[0] }; |
There was a problem hiding this comment.
Check bad patterns before returning good-bot classification
detectBot returns as soon as GOOD_BOTS matches, so a user-agent like Googlebot HeadlessChrome is classified as good and then bypasses all enforcement via the proxy fast-path. Because BAD_BOTS is evaluated later, any mixed UA containing an allowlisted token can evade the intended bot/scanner blocking; evaluate bad signatures first (or treat mixed matches as bad) to avoid this bypass.
Useful? React with 👍 / 👎.
| return ( | ||
| forwardedFor?.split(",")[0]?.trim() || realIp || cfConnectingIp || "unknown" | ||
| ); |
There was a problem hiding this comment.
Derive rate-limit IP from trusted proxy headers
The rate-limit key prioritizes the first value in x-forwarded-for, which is often client-controllable or prependable in many proxy setups; an attacker can vary that value per request to bypass per-IP throttling. Use a trusted ingress-provided IP source (or only parse x-forwarded-for at a known trusted hop boundary) before building the limiter key.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@apps/web/lib/bot-detection.ts`:
- Around line 12-13: The SUSPICIOUS_PATHS regex currently includes
/.well-known/security.txt which blocks a standardized disclosure location;
update the SUSPICIOUS_PATHS constant (the regex assigned to SUSPICIOUS_PATHS) to
remove the \.well-known\/security\.txt alternative, or add an explicit
allow-list check for the path before applying SUSPICIOUS_PATHS (e.g., explicitly
permit the exact path "/.well-known/security.txt" in the request-path handling
code) so security.txt is not treated as suspicious.
🧹 Nitpick comments (4)
apps/web/lib/bot-detection.ts (3)
23-46: Good-bot classification takes priority over suspicious-path check — spoofed UAs bypass path blocking.If a request carries a spoofed
GooglebotUA and hits/wp-admin, it will be classified as"good"and pass through without the suspicious-path check ever running. This is a common trade-off, but worth noting. To harden this, you could check suspicious paths before (or independently of) the good-bot match, or add reverse-DNS verification for search engine bots at a later stage.
48-56: Fallback IP"unknown"collapses all unidentified clients into one rate-limit bucket.If none of the IP headers are present, every request maps to the same rate-limit key (
unknown:<category>), causing all such clients to share a single quota. On platforms like Vercel or Cloudflare this is unlikely, but if this ever runs behind a different reverse proxy that doesn't set these headers, legitimate users will be collectively throttled.Consider logging a warning or using
request.ip(available in Next.js middleware) as an additional fallback.
58-78: In-memory rate limiter andsetIntervalare ineffective in serverless/edge environments.Two concerns here:
Ephemeral state: The
Mapis local to each invocation's memory. In serverless environments (e.g., Vercel), each cold start creates a freshMap, so rate limits won't persist across invocations. The rate limiter will only be effective for warm instances that handle multiple requests.Top-level
setInterval: In a long-running Node.js server this works fine, but in serverless/edge runtimes, the interval either never fires (short-lived isolates) or keeps the process reference alive unnecessarily. The comment on Line 58 says "Node.js runtime" but the proxy may run in edge context depending on deployment.If the goal is best-effort rate limiting for warm instances, this is acceptable as-is. For stricter enforcement, consider an external store (e.g., Vercel KV, Upstash Redis) or at minimum document the limitation. You could also replace the
setIntervalwith a lazy cleanup approach insidecheckRateLimititself (e.g., prune stale entries every N calls).apps/web/proxy.ts (1)
24-37:X-Blocked-Reasonheader discloses blocking rationale to the client.This is helpful for debugging but also informs attackers why they were blocked, making it easier to adjust their approach (e.g., switching UA, avoiding probe paths). Consider removing these headers in production or restricting them to internal/debug builds.
| const SUSPICIOUS_PATHS = | ||
| /^\/(wp-admin|wp-login|wp-content|wp-includes|\.env|\.git|phpmyadmin|phpinfo|administrator|cgi-bin|\.aws|\.well-known\/security\.txt)/i; |
There was a problem hiding this comment.
Blocking /.well-known/security.txt may be counterproductive.
/.well-known/security.txt is a standardized path (RFC 9116) for responsible vulnerability disclosure. Blocking it prevents security researchers from finding your security contact information. Consider removing it from the suspicious-paths pattern.
Suggested fix
const SUSPICIOUS_PATHS =
- /^\/(wp-admin|wp-login|wp-content|wp-includes|\.env|\.git|phpmyadmin|phpinfo|administrator|cgi-bin|\.aws|\.well-known\/security\.txt)/i;
+ /^\/(wp-admin|wp-login|wp-content|wp-includes|\.env|\.git|phpmyadmin|phpinfo|administrator|cgi-bin|\.aws)/i;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const SUSPICIOUS_PATHS = | |
| /^\/(wp-admin|wp-login|wp-content|wp-includes|\.env|\.git|phpmyadmin|phpinfo|administrator|cgi-bin|\.aws|\.well-known\/security\.txt)/i; | |
| const SUSPICIOUS_PATHS = | |
| /^\/(wp-admin|wp-login|wp-content|wp-includes|\.env|\.git|phpmyadmin|phpinfo|administrator|cgi-bin|\.aws)/i; |
🤖 Prompt for AI Agents
In `@apps/web/lib/bot-detection.ts` around lines 12 - 13, The SUSPICIOUS_PATHS
regex currently includes /.well-known/security.txt which blocks a standardized
disclosure location; update the SUSPICIOUS_PATHS constant (the regex assigned to
SUSPICIOUS_PATHS) to remove the \.well-known\/security\.txt alternative, or add
an explicit allow-list check for the path before applying SUSPICIOUS_PATHS
(e.g., explicitly permit the exact path "/.well-known/security.txt" in the
request-path handling code) so security.txt is not treated as suspicious.
Summary
proxy.ts) blocks abusive bots (DotBot, GPTBot, AI scrapers, headless browsers, vulnerability scanners) and rate-limits requests per IP + route category/api/health) runs on Edge runtime for zero-cold-start uptime monitoring — replaces pollingGET /which triggered full SSR@sentry/*packages in.npmrcto resolve@sentry/node-coremodule-not-found errors that caused 500s on all SSR pagesRate limits
/proxy/api/event,/api/event/api/*Bot handling
/wp-admin,/.env)Test plan
curl /api/health→ 200{"status":"ok"}curl -H "User-Agent: DotBot" /→ 403curl -H "User-Agent: GPTBot" /→ 403curl -H "User-Agent: Googlebot" /→ 200curl -H "User-Agent: " /→ 403 (empty UA)curl /wp-admin→ 403 (vuln scan path)🤖 Generated with Claude Code
Summary by CodeRabbit