Skip to content

Desync-ai/desync.ai_tools_examples

Repository files navigation

Desync.AI Tools

An open-source toolkit for stealthy web data extraction, boilerplate cleaning, structured storage, transformer embedding, and more β€” powered by the desync-search API.


πŸ”§ Features

  • Crawling & Bulk Search: Grab web data using DesyncClient, with support for stealthy headful scraping.
  • Boilerplate Removal: Strip repeated pages, headers, footers, and navbars from raw HTML or .text_content.
  • Data Extraction (WIP): Find emails, phones, LinkedIns, and named entities from raw text or HTML.
  • Embedding Pipelines: Chunk, tokenize, and run transformer inference (BERT, S-BERT).
  • Link Graph Tools: Construct internal/external page link graphs for sitemaps or network analysis.
  • Text Stats + Heuristics: Score content quality (word count, link ratio, etc.)
  • Structured Output: Store clean results in CSV, JSON, or SQLite.
  • Sentiment Analysis and Summarisation: Create a concise summary and sentiment analysis on the pages you scrape.
  • Link Graph Generation: Construct a graph on how the pages you scraped link to one another.

πŸ—‚ Directory Structure

DESYNC.AI_TOOLS/
β”‚
β”œβ”€β”€ basic_implementation/       # Core DesyncClient tools
β”‚   β”œβ”€β”€ bulk_search.py
β”‚   β”œβ”€β”€ crawl_search.py
β”‚   β”œβ”€β”€ stealth_search.py
β”‚   └── test_search.py
β”‚
β”œβ”€β”€ data_extraction/            # Contact info extraction (email, phone, LinkedIn)
β”‚   β”œβ”€β”€ extract_contacts.py
β”‚   β”œβ”€β”€ text_summarization.py
β”‚   β”œβ”€β”€ sentiment_analyzer.py
β”‚   └── named_entity_extractor.py
β”‚
β”œβ”€β”€ examples/                   # Example scripts
β”‚   └── bulk_clean_and_save_csv.py
β”‚
β”œβ”€β”€ model_prep/                 # Transformer-based modeling pipeline
β”‚   β”œβ”€β”€ chunk_text_blocks.py
β”‚   β”œβ”€β”€ dataset_builder.py
β”‚   β”œβ”€β”€ tokenizer_loader.py
β”‚   β”œβ”€β”€ torch_loader.py
β”‚   └── transformer_runner.py
β”‚
β”œβ”€β”€ parsers/                    # General-purpose HTML and graph tools
β”‚   β”œβ”€β”€ html_parser.py
β”‚   β”œβ”€β”€ language_detector.py
β”‚   β”œβ”€β”€ link_graph.py
β”‚   └── text_stats.py
β”‚
β”œβ”€β”€ result_cleaning/            # Cleaning and deduplication
β”‚   β”œβ”€β”€ html_cleaning/
β”‚   β”‚   └── remove_boilerplate_html.py
β”‚   β”œβ”€β”€ text_content_cleaning/
β”‚   β”‚   β”œβ”€β”€ remove_boilerplate_prefix.py
β”‚   β”‚   └── remove_boilerplate_suffix.py
β”‚   └── duplicate_page_remover.py
β”‚   └── filter_by_url_substring.py
β”‚
β”œβ”€β”€ storage/                    # Save your search results
β”‚   β”œβ”€β”€ csv_storage.py
β”‚   β”œβ”€β”€ json_storage.py
β”‚   └── sqlite_storage.py
β”‚
β”œβ”€β”€ output/                     # Your saved outputs (ignored by git)
β”‚
β”œβ”€β”€ .env                        # Credentials (not committed)
β”œβ”€β”€ .gitignore
└── README.md

πŸš€ Quick Start

1. Install dependencies

pip install desync_search
pip install desync-data

2. Add your API key

Create a .env file in the root folder:

DESYNC_API_KEY=your_api_key_here

3. Run an example

python examples/bulk_clean_and_save_csv.py

πŸ“¦ Storage Formats

Format Path Notes
CSV storage/csv/ For spreadsheets, Pandas, Excel
JSON storage/json/ Human-readable + flexible
SQLite storage/sqlite/ Lightweight relational database

πŸ‘¨β€πŸ’» Authors

  • Jackson Ballow
  • Mark Evgenev
  • Maks Kubicki

πŸͺͺ License

MIT β€” use freely, improve freely, and credit where due.

About

Github for desync_search api that contains tools. Tools are most requested add ons from users :)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages