An open-source toolkit for stealthy web data extraction, boilerplate cleaning, structured storage, transformer embedding, and more β powered by the desync-search API.
- Crawling & Bulk Search: Grab web data using
DesyncClient, with support for stealthy headful scraping. - Boilerplate Removal: Strip repeated pages, headers, footers, and navbars from raw HTML or
.text_content. - Data Extraction (WIP): Find emails, phones, LinkedIns, and named entities from raw text or HTML.
- Embedding Pipelines: Chunk, tokenize, and run transformer inference (BERT, S-BERT).
- Link Graph Tools: Construct internal/external page link graphs for sitemaps or network analysis.
- Text Stats + Heuristics: Score content quality (word count, link ratio, etc.)
- Structured Output: Store clean results in CSV, JSON, or SQLite.
- Sentiment Analysis and Summarisation: Create a concise summary and sentiment analysis on the pages you scrape.
- Link Graph Generation: Construct a graph on how the pages you scraped link to one another.
DESYNC.AI_TOOLS/
β
βββ basic_implementation/ # Core DesyncClient tools
β βββ bulk_search.py
β βββ crawl_search.py
β βββ stealth_search.py
β βββ test_search.py
β
βββ data_extraction/ # Contact info extraction (email, phone, LinkedIn)
β βββ extract_contacts.py
β βββ text_summarization.py
β βββ sentiment_analyzer.py
β βββ named_entity_extractor.py
β
βββ examples/ # Example scripts
β βββ bulk_clean_and_save_csv.py
β
βββ model_prep/ # Transformer-based modeling pipeline
β βββ chunk_text_blocks.py
β βββ dataset_builder.py
β βββ tokenizer_loader.py
β βββ torch_loader.py
β βββ transformer_runner.py
β
βββ parsers/ # General-purpose HTML and graph tools
β βββ html_parser.py
β βββ language_detector.py
β βββ link_graph.py
β βββ text_stats.py
β
βββ result_cleaning/ # Cleaning and deduplication
β βββ html_cleaning/
β β βββ remove_boilerplate_html.py
β βββ text_content_cleaning/
β β βββ remove_boilerplate_prefix.py
β β βββ remove_boilerplate_suffix.py
β βββ duplicate_page_remover.py
β βββ filter_by_url_substring.py
β
βββ storage/ # Save your search results
β βββ csv_storage.py
β βββ json_storage.py
β βββ sqlite_storage.py
β
βββ output/ # Your saved outputs (ignored by git)
β
βββ .env # Credentials (not committed)
βββ .gitignore
βββ README.md
pip install desync_searchpip install desync-dataCreate a .env file in the root folder:
DESYNC_API_KEY=your_api_key_here
python examples/bulk_clean_and_save_csv.py| Format | Path | Notes |
|---|---|---|
| CSV | storage/csv/ |
For spreadsheets, Pandas, Excel |
| JSON | storage/json/ |
Human-readable + flexible |
| SQLite | storage/sqlite/ |
Lightweight relational database |
- Jackson Ballow
- Mark Evgenev
- Maks Kubicki
MIT β use freely, improve freely, and credit where due.