Skip to content

browser extension crawler which saves pages (e.g. as self-contained html) files for offline viewing. inspired by httrack.

Notifications You must be signed in to change notification settings

faeller/pagespider

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pagespider

browser extension inspired by httrack that crawls websites and saves pages as self-contained html files for offline viewing.

popup top popup bottom

features

  • saves pages with inlined assets (images, css, fonts as base64) or separate files
  • by default rewrites links between pages for offline navigation (like httrack)
  • static mode strips javascript for clean archives of js-heavy sites (sveltekit, next.js, etc.)
  • url filtering with contains/path-starts/regex
  • screenshots via native scroll-stitch or html2canvas
  • two storage backends: browser downloads or filesystem (via dashboard)
  • indexeddb queue survives browser crashes
  • cross-browser compatible

install

  1. clone repo
  2. open chrome://extensions (or about:addons in firefox)
  3. enable developer mode
  4. load unpacked from pagespider/ folder

usage

  1. click extension icon
  2. enter url or click "use current tab"
  3. set depth (how many links deep to crawl)
  4. optionally set url filter to limit scope
  5. choose storage method (filesystem requires opening dashboard first)
  6. start crawl

storage methods

downloads. saves to browser downloads folder. simple but no real folder structure.

filesystem. saves to folder you choose with proper structure. requires keeping the dashboard tab open during crawl.

options

option description
url filter only crawl urls matching pattern
depth how many links deep (0 = single page, -1 = unlimited)
delay ms between requests
assets inline (base64) or separate files
screenshots native (scroll-stitch) and/or html2canvas
static mode strip scripts for clean static pages
rewrite links convert links to local files
same origin stay on same domain

limitations

  • buttons using js navigation (e.g. router.push()) won't work in static mode - only <a href> links get rewritten
  • html2canvas may miss some css effects
  • native screenshots may duplicate fixed/sticky elements

About

browser extension crawler which saves pages (e.g. as self-contained html) files for offline viewing. inspired by httrack.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors