Web-DL mirrors a website with wget, compresses it with archiver, and streams live progress back to your browser over a Socket.IO channel — then hands you a ready-to-download .zip.
- Safe by design —
wgetis launched withspawn()and an argument array (never a shell string), so a URL can't be interpreted as a command. - SSRF protection — the server DNS-resolves each host and refuses private, loopback, link-local and cloud-metadata addresses.
- Tunable downloads — set crawl depth, include/exclude file types, a size quota, wait between requests, page requisites, and whether to follow external links.
- Cancel anytime — stop a running job; the
wgetprocess is killed and partial files are removed. - Concurrency control — a configurable cap with a queue keeps the server from being overwhelmed.
- Live progress — real-time progress bar, file count, current file and downloaded size.
- Download history — list, re-download or delete previously generated zips.
- Auto-cleanup — old zips are swept on an interval so disk usage stays bounded.
Important
Web-DL shells out to wget. Make sure Node.js 18+ and wget are installed and on your PATH.
git clone https://github.com/nooblk-98/Web-DL.git
cd Web-DL
npm install
npm startThen open http://localhost:3000, paste a URL, tweak the options, and download.
Every download is built from a fixed set of base wget flags:
wget --mirror --convert-links --adjust-extension --page-requisites --no-if-modified-since <url>| Flag | Why |
|---|---|
--mirror |
recursive download of the whole site |
--convert-links |
rewrite links (incl. CSS) to relative paths for offline viewing |
--adjust-extension |
add .html / .css extensions based on content-type |
--page-requisites |
fetch the CSS, JS and images needed to render each page |
--no-if-modified-since |
always fetch resources instead of relying on conditional requests |
User-supplied options (depth, filters, --no-parent, etc.) are layered on top through a strict allowlist — raw flags from the client are never accepted.
The request flows through the server like this:
Browser ──Socket.IO──▶ socket/socket.js ──▶ lib/jobQueue.js ──▶ wget/index.js (spawn wget)
▲ │
│ live progress, file counts, status ▼
└──────────────────── archiver/index.js ◀──── mirrored site folder (downloads/)
│
▼
public/sites/<host>.zip ──▶ served via express.static + /api/history
- The client submits a URL and options over Socket.IO.
lib/urlGuard.jsvalidates the URL and blocks SSRF targets;lib/wgetArgs.jsbuilds a safe argument array.lib/jobQueue.jsenforces the concurrency cap;wget/index.jsspawnswgetand streams progress.archiver/index.jszips the mirrored folder intopublic/sites/, the temp mirror is removed, and the download link is sent back.lib/cleanup.jsperiodically deletes zips older than the configured TTL.
Everything is optional and falls back to sensible defaults. Set via environment variables:
| Variable | Default | Description |
|---|---|---|
PORT |
3000 |
HTTP port |
MAX_CONCURRENT_DOWNLOADS |
3 |
Max simultaneous wget jobs |
ZIP_TTL_MS |
86400000 (24h) |
Age after which generated zips are deleted |
CLEANUP_INTERVAL_MS |
3600000 (1h) |
How often the cleanup sweep runs |
DOWNLOAD_ROOT |
./downloads |
Working directory for site mirrors |
MAX_DEPTH |
10 |
Upper bound for user-supplied crawl depth |
MAX_QUOTA_MB |
2048 |
Upper bound for user-supplied max download size |
MAX_WAIT_SECONDS |
30 |
Upper bound for the wait-between-requests option |
ALLOW_PRIVATE_HOSTS |
false |
Set true to permit localhost/private hosts (local testing only) |
Beyond the Socket.IO download channel, a small REST API manages generated zips:
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/history |
List generated zips (name, size, modified, url) |
DELETE |
/api/history/:name |
Delete one zip (path-traversal guarded) |
GET |
/sites/:name.zip |
Re-download a zip (served via express.static) |
| Command | Description |
|---|---|
npm start |
Run the server |
npm run dev |
Run with NODE_ENV=development |
npm test |
Run the Jest unit tests |
npm run lint / npm run lint:fix |
ESLint |
npm run format |
Prettier |
app.js Express app wiring (routes, views, error handling)
bin/www Server entry point (HTTP + Socket.IO + cleanup)
routes/ HTTP routes (index, users, history API)
socket/ Socket.IO download orchestration
lib/ Core logic: urlGuard, wgetArgs, jobQueue, activeJobs, sites, cleanup
wget/ wget process spawning + progress parsing
archiver/ Zips a mirrored site folder
config/ Central config, limits and constants
views/ Handlebars templates
public/ Static assets + generated zips (public/sites)
__tests__/ Jest unit tests
Warning
wget re-resolves DNS and follows redirects itself, so a public host that redirects to an internal address could still be reached. Redirects are capped via --max-redirect; for hardened deployments, also run the server in a network-restricted environment.
Web-DL is built on top of the original Website-downloader by Ahmad Ibrahim — many thanks for the original work this project is based on.
