Skip to content

Conversation

@davidberenstein1957
Copy link
Contributor

@davidberenstein1957 davidberenstein1957 commented Jan 11, 2025

Inspired by approach described here: https://arxiv.org/abs/2312.15685

  • Added a filter_by_score method that takes records and filters them against by computing a score based on the average distance to k nearest neighbours in an index. This should probably be extended with a threshold to avoid bias for high-density regions pre-deduplication. After, it filters them based on a budget, which can be a percentage or an exact number to keep post-filtering.
  • Added a self_filter_by_score method that does the same but within the existing index.
from datasets import Dataset, load_dataset

from benchmarks import data
from semhash import SemHash

# Load the dataset
dataset = load_dataset("imdb")

# Initialize SemHash with the columns to deduplicate
semhash = SemHash.from_records(records=dataset["train"].to_list(), columns=["text"])

# Deduplicate the records
deduplicated_records = semhash.self_deduplicate(threshold=0.8).deduplicated

# Filter the records
filtered_records = semhash.filter_by_score(records=deduplicated_records, budget=0.8).selected

# Save the filtered records
train_dataset = Dataset.from_list(filtered_records)

Benchmarking. I could not find any benchmarks for this but I thought we would be able to do something with benchmarks on common tasks like text-classification and QnA pre-and post filtering with certain percentages.

from datasets import Dataset, load_dataset

from benchmarks import data
from semhash import SemHash

# Load the dataset
dataset = load_dataset("imdb")


for threshold in [0.2, 0.4, 0.6, 0.8]:
    for budget in [0.2, 0.4, 0.6, 0.8]:
        # Initialize SemHash with the columns to deduplicate
        semhash = SemHash.from_records(records=dataset["train"].to_list(), columns=["text"])

        # Deduplicate the records
        deduplicated_records = semhash.self_deduplicate(threshold=0.8).deduplicated

        # Filter the records
        filtered_records = semhash.filter_by_score(records=deduplicated_records, budget=0.8).selected

        # Save the filtered records
        train_dataset = Dataset.from_list(filtered_records)

       # train and evaluate squad
…for deduplication

- Added `score_columns` parameter to `SemHash` for scoring records based on specified columns.
- Implemented `_score` method to scale and compute scores for records.
- Updated `deduplicate` and `self_deduplicate` methods to incorporate scoring and budget constraints.
- Refactored initialization and documentation to reflect new parameters and functionality.
- Removed unnecessary score cleanup from the records after sorting.
- Moved the scoring logic to ensure records are sorted by score before deduplication.
- Improved code clarity by consolidating score handling within the deduplication process.
- Renamed `_score` method to `_sort_and_scale_scores` to better reflect its functionality.
- Updated references to the scoring method in the deduplication process for improved readability and accuracy.
- Added a new method `compute_nearest_neighbor_alignment_scores` to calculate embedding similarity based on nearest neighbor alignment.
- Updated the SemHash initialization to optionally compute alignment scores when creating an instance.
- Removed the unused `score_columns` parameter from the SemHash constructor for clarity.
- Enhanced the deduplication process by integrating alignment scoring, improving overall functionality and performance.
…lt dataclass

- Added a new `FilterResult` dataclass to encapsulate the results of filtering operations, including selected and filtered records along with their scores.
- Implemented `filter_by_score` and `self_filter_by_score` methods in the `SemHash` class to filter records based on their scores, allowing for budget constraints and sorting options.
- Updated the `Index` class to replace the deprecated `compute_nearest_neighbor_alignment_scores` method with a more streamlined `query_top_k` method for querying top-k records.
- Removed unused parameters and methods to enhance code clarity and maintainability.
@davidberenstein1957 davidberenstein1957 marked this pull request as ready for review January 22, 2025 07:19
- Replaced `filter_by_score` with `filter_by_entropy` to compute record diversity using scipy's entropy function
- Modified `query_top_k` method to handle vector queries more robustly
- Added scipy>=1.13.1 as a dependency in pyproject.toml
- Updated README with new entropy filtering examples and documentation
- Renamed methods to use `entropy` instead of `score` for clarity
- Improved filtering functionality with more intuitive parameters like `descending` and `k`
- Implemented tests for `self_filter_by_entropy` method with various scenarios
- Added test cases for absolute and percentage-based budget filtering
- Verified sorting order with ascending and descending entropy options
- Included validation tests for invalid budget inputs
- Tested string and dictionary input compatibility for entropy filtering
- Replaced `filter_by_score` with `filter_by_entropy` in README documentation
- Clarified description of entropy-based filtering method
- Maintained existing explanation of filtering functionality
…ation

- Removed scipy>=1.13.1 from project dependencies
- Created a custom `entropy_from_distances` function in utils.py to replace scipy's entropy function
- Updated `semhash.py` to use the new entropy calculation method
- Reshaped vector inputs to ensure compatibility with `query_top_k` method
- Simplified entropy calculation with a more explicit implementation
@davidberenstein1957 davidberenstein1957 changed the title feat: Enhance SemHash with scoring functionality Jan 26, 2025
@davidberenstein1957 davidberenstein1957 changed the title feat: Enhance SemHash with entropy scoring functionality Jan 26, 2025
- Updated `_validate_filter_budget` method to include a parameter for top-k records and improved budget validation logic.
- Introduced `_filter_by_entropy` method to encapsulate entropy-based filtering functionality, streamlining the `filter_by_entropy` and `self_filter_by_entropy` methods.
- Simplified the filtering process by leveraging the new `_filter_by_entropy` method for better code organization and readability.
@davidberenstein1957
Copy link
Contributor Author

@Pringled should be ready to go :)

@Pringled Pringled merged commit 05cd14f into MinishLab:main Mar 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants