feat: Add entropy scoring functionality #25

davidberenstein1957 · 2025-01-11T08:44:04Z

Inspired by approach described here: https://arxiv.org/abs/2312.15685

Added a filter_by_score method that takes records and filters them against by computing a score based on the average distance to k nearest neighbours in an index. This should probably be extended with a threshold to avoid bias for high-density regions pre-deduplication. After, it filters them based on a budget, which can be a percentage or an exact number to keep post-filtering.
Added a self_filter_by_score method that does the same but within the existing index.

from datasets import Dataset, load_dataset

from benchmarks import data
from semhash import SemHash

# Load the dataset
dataset = load_dataset("imdb")

# Initialize SemHash with the columns to deduplicate
semhash = SemHash.from_records(records=dataset["train"].to_list(), columns=["text"])

# Deduplicate the records
deduplicated_records = semhash.self_deduplicate(threshold=0.8).deduplicated

# Filter the records
filtered_records = semhash.filter_by_score(records=deduplicated_records, budget=0.8).selected

# Save the filtered records
train_dataset = Dataset.from_list(filtered_records)

Benchmarking. I could not find any benchmarks for this but I thought we would be able to do something with benchmarks on common tasks like text-classification and QnA pre-and post filtering with certain percentages.

from datasets import Dataset, load_dataset

from benchmarks import data
from semhash import SemHash

# Load the dataset
dataset = load_dataset("imdb")


for threshold in [0.2, 0.4, 0.6, 0.8]:
    for budget in [0.2, 0.4, 0.6, 0.8]:
        # Initialize SemHash with the columns to deduplicate
        semhash = SemHash.from_records(records=dataset["train"].to_list(), columns=["text"])

        # Deduplicate the records
        deduplicated_records = semhash.self_deduplicate(threshold=0.8).deduplicated

        # Filter the records
        filtered_records = semhash.filter_by_score(records=deduplicated_records, budget=0.8).selected

        # Save the filtered records
        train_dataset = Dataset.from_list(filtered_records)

       # train and evaluate squad

…for deduplication - Added `score_columns` parameter to `SemHash` for scoring records based on specified columns. - Implemented `_score` method to scale and compute scores for records. - Updated `deduplicate` and `self_deduplicate` methods to incorporate scoring and budget constraints. - Refactored initialization and documentation to reflect new parameters and functionality.

- Removed unnecessary score cleanup from the records after sorting. - Moved the scoring logic to ensure records are sorted by score before deduplication. - Improved code clarity by consolidating score handling within the deduplication process.

- Renamed `_score` method to `_sort_and_scale_scores` to better reflect its functionality. - Updated references to the scoring method in the deduplication process for improved readability and accuracy.

- Added a new method `compute_nearest_neighbor_alignment_scores` to calculate embedding similarity based on nearest neighbor alignment. - Updated the SemHash initialization to optionally compute alignment scores when creating an instance. - Removed the unused `score_columns` parameter from the SemHash constructor for clarity. - Enhanced the deduplication process by integrating alignment scoring, improving overall functionality and performance.

…lt dataclass - Added a new `FilterResult` dataclass to encapsulate the results of filtering operations, including selected and filtered records along with their scores. - Implemented `filter_by_score` and `self_filter_by_score` methods in the `SemHash` class to filter records based on their scores, allowing for budget constraints and sorting options. - Updated the `Index` class to replace the deprecated `compute_nearest_neighbor_alignment_scores` method with a more streamlined `query_top_k` method for querying top-k records. - Removed unused parameters and methods to enhance code clarity and maintainability.

- Replaced `filter_by_score` with `filter_by_entropy` to compute record diversity using scipy's entropy function - Modified `query_top_k` method to handle vector queries more robustly - Added scipy>=1.13.1 as a dependency in pyproject.toml - Updated README with new entropy filtering examples and documentation - Renamed methods to use `entropy` instead of `score` for clarity - Improved filtering functionality with more intuitive parameters like `descending` and `k`

- Implemented tests for `self_filter_by_entropy` method with various scenarios - Added test cases for absolute and percentage-based budget filtering - Verified sorting order with ascending and descending entropy options - Included validation tests for invalid budget inputs - Tested string and dictionary input compatibility for entropy filtering

- Replaced `filter_by_score` with `filter_by_entropy` in README documentation - Clarified description of entropy-based filtering method - Maintained existing explanation of filtering functionality

…ation - Removed scipy>=1.13.1 from project dependencies - Created a custom `entropy_from_distances` function in utils.py to replace scipy's entropy function - Updated `semhash.py` to use the new entropy calculation method - Reshaped vector inputs to ensure compatibility with `query_top_k` method - Simplified entropy calculation with a more explicit implementation

- Updated `_validate_filter_budget` method to include a parameter for top-k records and improved budget validation logic. - Introduced `_filter_by_entropy` method to encapsulate entropy-based filtering functionality, streamlining the `filter_by_entropy` and `self_filter_by_entropy` methods. - Simplified the filtering process by leveraging the new `_filter_by_entropy` method for better code organization and readability.

davidberenstein1957 · 2025-03-20T14:23:25Z

@Pringled should be ready to go :)

davidberenstein1957 added 6 commits January 11, 2025 09:42

Refactor scoring method in SemHash for clarity and consistency

a01a841

- Renamed `_score` method to `_sort_and_scale_scores` to better reflect its functionality. - Updated references to the scoring method in the deduplication process for improved readability and accuracy.

Merge branch 'MinishLab:main' into feat/add-scoring

2cdf823

davidberenstein1957 marked this pull request as ready for review January 22, 2025 07:19

davidberenstein1957 added 4 commits January 25, 2025 08:56

docs: Update README with entropy filtering terminology

a6cc745

- Replaced `filter_by_score` with `filter_by_entropy` in README documentation - Clarified description of entropy-based filtering method - Maintained existing explanation of filtering functionality

davidberenstein1957 changed the title ~~feat: Enhance SemHash with scoring functionality~~ Jan 26, 2025

davidberenstein1957 changed the title ~~feat: Enhance SemHash with entropy scoring functionality~~ Jan 26, 2025

Merge branch 'main' into feat/add-scoring

96ad019

davidberenstein1957 mentioned this pull request Jan 26, 2025

Refactor results data model #40

Merged

davidberenstein1957 added 2 commits March 24, 2025 19:56

feat: add entropy filtering

00f5df3

test: update tests

f96aa83

Pringled merged commit 05cd14f into MinishLab:main Mar 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add entropy scoring functionality #25

feat: Add entropy scoring functionality #25

Uh oh!

davidberenstein1957 commented Jan 11, 2025 •

edited

Loading

davidberenstein1957 commented Mar 20, 2025

Labels

2 participants

feat: Add entropy scoring functionality #25

feat: Add entropy scoring functionality #25

Uh oh!

Conversation

davidberenstein1957 commented Jan 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

davidberenstein1957 commented Mar 20, 2025

Labels

2 participants

davidberenstein1957 commented Jan 11, 2025 •

edited

Loading