TF-IDF Evaluation on Cranfield Dataset

A comprehensive evaluation framework for comparing different TF-IDF weighting schemes on the Cranfield collection.

Overview

This project evaluates 20 different TF-IDF combinations (4 TF schemes × 5 IDF schemes) on the Cranfield dataset, a classic Information Retrieval benchmark with 1,400 documents and 225 queries.

TF-IDF Schemes Implemented

Term Frequency (TF) Schemes

Raw TF: tf(t,d) = frequency
- Simply counts how many times a term appears in the document
Double Normalization (K-normalization): tf(t,d) = k + (1-k) × (freq / max_freq)
- Prevents bias towards longer documents
- k = 0.5 by default (adjustable)
Log Scaled TF: tf(t,d) = 1 + log(freq) if freq > 0, else 0
- Dampens the effect of term frequency
- Reduces the impact of very high frequency terms
Normalized TF: tf(t,d) = freq / doc_length
- Normalizes by document length
- Gives relative term importance

Inverse Document Frequency (IDF) Schemes

Standard IDF: idf(t) = log(N / df)
- Classic IDF formulation
- N = total documents, df = document frequency
Smooth IDF: idf(t) = log(N / (1 + df)) + 1
- Adds smoothing to prevent zero IDF
- +1 ensures positive values
Max IDF: idf(t) = log(max_df / df)
- Uses maximum document frequency as reference
- Alternative normalization approach
Probabilistic IDF: idf(t) = log((N - df) / df)
- Based on probability of relevance
- From Robertson & Sparck Jones
Entropy-based IDF: idf(t) = 1 - (entropy / max_entropy)
- Measures information content
- Lower entropy = more discriminative term

Installation

1. Install Python dependencies

pip install -r requirements.txt

2. Download Cranfield Dataset

Download the following files from the repository: https://github.com/oussbenk/cranfield-trec-dataset

cran.all.1400.xml - Documents
cran.qry.xml - Queries
cranqrel.trec.txt - Relevance judgments

Place them in the same directory as the Python scripts.

Usage

Step 1: Run the Evaluation

python tfidf_evaluator.py

This will:

Load the Cranfield dataset
Preprocess documents and queries (tokenization, stopword removal, stemming)
Build inverted index
Evaluate all 20 TF-IDF combinations
Save results to tfidf_evaluation_results.csv

Expected runtime: 2-5 minutes depending on your system

Step 2: Generate Visualizations

python visualize_results.py

This will generate:

Heatmaps for each metric (MAP, P@10, R@10, MRR, NDCG@10)
Comparison charts showing top 10 combinations
Trend analysis showing how metrics vary across schemes
Summary report with detailed statistics

Evaluation Metrics

The system evaluates performance using:

MAP (Mean Average Precision): Overall ranking quality
P@10 (Precision at 10): Precision of top 10 results
R@10 (Recall at 10): Recall of top 10 results
MRR (Mean Reciprocal Rank): Position of first relevant document
NDCG@10: Normalized Discounted Cumulative Gain at 10

Output Files

After running the scripts, you'll get:

tfidf_evaluation_results.csv - Raw results for all combinations
heatmap_MAP.png - Heatmap for MAP metric
heatmap_P_at_10.png - Heatmap for P@10 metric
heatmap_R_at_10.png - Heatmap for R@10 metric
heatmap_MRR.png - Heatmap for MRR metric
heatmap_NDCG_at_10.png - Heatmap for NDCG@10 metric
top_combinations_comparison.png - Bar chart of top 10 combinations
metric_trends.png - Line plots showing trends
evaluation_summary.txt - Detailed text report

Interpreting Results

Heatmaps

Darker red = better performance
Look for clusters of high performance
Compare across metrics to find robust combinations

Best Combination

The script identifies the best TF-IDF combination for each metric
Pay special attention to MAP as the primary metric
Consider P@10 for user-facing applications (top results matter most)

Trade-offs

Some schemes optimize for precision (top results)
Others optimize for recall (finding all relevant documents)
Choose based on your application needs

Customization

Modify TF/IDF Parameters

In tfidf_evaluator.py, you can:

Adjust K-normalization parameter:

runner.run_experiment(tf='double_norm', idf='standard', k=0.4)

Add custom TF schemes:

@staticmethod
def tf_custom(term_freq, doc_length, max_freq):
    # Your custom formula
    return ...

Add custom IDF schemes:

@staticmethod
def idf_custom(doc_freq, num_docs):
    # Your custom formula
    return ...

Preprocessing Options

Modify the preprocessor initialization:

preprocessor = TextPreprocessor(
    use_stemming=True,      # Enable/disable stemming
    remove_stopwords=True   # Enable/disable stopword removal
)

Evaluation Settings

Change retrieval parameters:

# Retrieve more documents
ranked_results = retrieval.search(query_text, top_k=1000)

# Evaluate at different cutoffs
p5 = EvaluationMetrics.precision_at_k(ranked_doc_ids, relevant_docs, 5)
p20 = EvaluationMetrics.precision_at_k(ranked_doc_ids, relevant_docs, 20)

Project Structure

.
├── tfidf_evaluator.py          # Main evaluation script
├── visualize_results.py         # Visualization generator
├── requirements.txt             # Python dependencies
├── README.md                    # This file
├── cran.all.1400.xml           # Documents (download separately)
├── cran.qry.xml                # Queries (download separately)
└── cranqrel.trec.txt           # Qrels (download separately)

Technical Details

Inverted Index Structure

Term → List of (doc_id, term_frequency) tuples
Efficient retrieval for large document collections
Stores document statistics (length, max term frequency)

Similarity Calculation

Uses cosine similarity (normalized dot product)
Query and document vectors in TF-IDF space
Length normalization for fair comparison

Complexity

Indexing: O(N × M) where N=documents, M=avg doc length
Query: O(Q × K) where Q=query terms, K=avg postings list length
Memory: O(V × D) where V=vocabulary size, D=avg docs per term

Expected Results

Based on IR literature, you should expect:

MAP: 0.25 - 0.45 (Cranfield is challenging)
Best schemes: Typically log-scaled TF with standard/smooth IDF
Worst schemes: Raw TF without proper normalization

Troubleshooting

File Not Found Errors

Make sure the Cranfield XML files are in the same directory as the scripts.

Memory Issues

If you run out of memory, reduce the vocabulary size:

# Add minimum document frequency threshold
if self.get_document_frequency(term) < 2:
    continue  # Skip rare terms

No Results

Check that:

XML files are properly formatted
Query IDs in qrels match query file
Document IDs in qrels match document file

References

Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval.
Robertson, S. (2004). Understanding inverse document frequency.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval.

License

This project is for educational purposes. The Cranfield dataset is publicly available for research.

Contact

For questions or issues, please refer to the original Cranfield dataset repository: https://github.com/oussbenk/cranfield-trec-dataset

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
__pycache__		__pycache__
GETTING_STARTED.md		GETTING_STARTED.md
METHODOLOGY.md		METHODOLOGY.md
README.md		README.md
cran.all.1400.xml		cran.all.1400.xml
cran.qry.xml		cran.qry.xml
cranqrel.trec.txt		cranqrel.trec.txt
download_data.py		download_data.py
evaluation_summary.txt		evaluation_summary.txt
heatmap_MAP.png		heatmap_MAP.png
heatmap_MRR.png		heatmap_MRR.png
heatmap_NDCG_at_10.png		heatmap_NDCG_at_10.png
heatmap_P_at_10.png		heatmap_P_at_10.png
heatmap_R_at_10.png		heatmap_R_at_10.png
metric_trends.png		metric_trends.png
quick_test.py		quick_test.py
requirements.txt		requirements.txt
tfidf_evaluation_results.csv		tfidf_evaluation_results.csv
tfidf_evaluator.py		tfidf_evaluator.py
top_combinations_comparison.png		top_combinations_comparison.png
visualize_results.py		visualize_results.py

Raj-Taware/IR

Folders and files

Latest commit

History

Repository files navigation

TF-IDF Evaluation on Cranfield Dataset

Overview

TF-IDF Schemes Implemented

Term Frequency (TF) Schemes

Inverse Document Frequency (IDF) Schemes

Installation

1. Install Python dependencies

2. Download Cranfield Dataset

Usage

Step 1: Run the Evaluation

Step 2: Generate Visualizations

Evaluation Metrics

Output Files

Interpreting Results

Heatmaps

Best Combination

Trade-offs

Customization

Modify TF/IDF Parameters

Preprocessing Options

Evaluation Settings

Project Structure

Technical Details

Inverted Index Structure

Similarity Calculation

Complexity

Expected Results

Troubleshooting

File Not Found Errors

Memory Issues

No Results

References

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages