This script is designed for scenarios where you want to deduplicate incoming data against previously-seen data, but where that previously-seen data is not available to be hashed against (or you just don't want to recalculate all those hashes again). If the source data is available to you, or you want fine-grained control of the deduplication process, tools like jdupes or DupeGuru would better serve you.
This is a prototype script. Do not run in production, or against any data (incoming or archival) which you're afraid to lose.
Always backup your data. I assume no responsibility for failure to plan on your part.
By design, any files which this script has "seen" twice will always be considered duplicates, because the script makes no distinction between "old" or "new" files, or their locations.
When ran without the --delete flag, the script is designed to ingest and hash all files, whether new or old. Because we didn't pass --delete, your files won't be touched - just hashed. This is always a safe operation.
When ran with the --delete flag, the script will also delete any files it has previously seen - even if those are your "old" files. So be careful! This is an inherently risky operation.
By design, the --delete flag is only intended to be used on incoming data, which you want to de-duplicate against a larger dataset. You should never run --delete against pre-existing data which has any importance to you, or which you don't want to lose.
You have been warned!