-
Notifications
You must be signed in to change notification settings - Fork 807
content similarity detection / deduplication #915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
"Dynamic Scope" integration
"Dynamic Scope" integration
Add dynamic scope engine.
minor nit
|
@geeknik very interesting approach, I'm going to review this soon and compare it with BM25 |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Refactored the code to integrate with the existing engine as an optional feature instead of creating a new engine.
- Added documentation to assist with understanding and testing the feature:
- Renamed options to better reflect the feature and added support for customizing the similarity threshold.
Options:
-sdd, -similarity-deduplication Enable content similarity detection to avoid crawling similar pages
-st, -similarity-threshold Set similarity threshold for content deduplication (range: 0.0–1.0, default: 0.1)
|
When will this feature be available online? |
"Dynamic Scope" integration, cuts back on data usage while crawling by utilizing a TF-IDF machine learning model to discard pages which might be too similar to pages already crawled. 👍🏻
For example, when running
katana -d 1 -u https://www.ibm.com/ -j -o ibm.json, theibm.jsonis about 11MB. Now when running withDynamic Scopewhich adds-udsto the command line, drops the output of ibm.json to about 3.4MB.Crawl Fast, Crawl Smart. 🚀