A comprehensive evaluation framework for comparing different TF-IDF weighting schemes on the Cranfield collection.
This project evaluates 20 different TF-IDF combinations (4 TF schemes × 5 IDF schemes) on the Cranfield dataset, a classic Information Retrieval benchmark with 1,400 documents and 225 queries.
-
Raw TF:
tf(t,d) = frequency- Simply counts how many times a term appears in the document
-
Double Normalization (K-normalization):
tf(t,d) = k + (1-k) × (freq / max_freq)- Prevents bias towards longer documents
- k = 0.5 by default (adjustable)
-
Log Scaled TF:
tf(t,d) = 1 + log(freq)if freq > 0, else 0- Dampens the effect of term frequency
- Reduces the impact of very high frequency terms
-
Normalized TF:
tf(t,d) = freq / doc_length- Normalizes by document length
- Gives relative term importance
-
Standard IDF:
idf(t) = log(N / df)- Classic IDF formulation
- N = total documents, df = document frequency
-
Smooth IDF:
idf(t) = log(N / (1 + df)) + 1- Adds smoothing to prevent zero IDF
- +1 ensures positive values
-
Max IDF:
idf(t) = log(max_df / df)- Uses maximum document frequency as reference
- Alternative normalization approach
-
Probabilistic IDF:
idf(t) = log((N - df) / df)- Based on probability of relevance
- From Robertson & Sparck Jones
-
Entropy-based IDF:
idf(t) = 1 - (entropy / max_entropy)- Measures information content
- Lower entropy = more discriminative term
pip install -r requirements.txtDownload the following files from the repository: https://github.com/oussbenk/cranfield-trec-dataset
cran.all.1400.xml- Documentscran.qry.xml- Queriescranqrel.trec.txt- Relevance judgments
Place them in the same directory as the Python scripts.
python tfidf_evaluator.pyThis will:
- Load the Cranfield dataset
- Preprocess documents and queries (tokenization, stopword removal, stemming)
- Build inverted index
- Evaluate all 20 TF-IDF combinations
- Save results to
tfidf_evaluation_results.csv
Expected runtime: 2-5 minutes depending on your system
python visualize_results.pyThis will generate:
- Heatmaps for each metric (MAP, P@10, R@10, MRR, NDCG@10)
- Comparison charts showing top 10 combinations
- Trend analysis showing how metrics vary across schemes
- Summary report with detailed statistics
The system evaluates performance using:
- MAP (Mean Average Precision): Overall ranking quality
- P@10 (Precision at 10): Precision of top 10 results
- R@10 (Recall at 10): Recall of top 10 results
- MRR (Mean Reciprocal Rank): Position of first relevant document
- NDCG@10: Normalized Discounted Cumulative Gain at 10
After running the scripts, you'll get:
- tfidf_evaluation_results.csv - Raw results for all combinations
- heatmap_MAP.png - Heatmap for MAP metric
- heatmap_P_at_10.png - Heatmap for P@10 metric
- heatmap_R_at_10.png - Heatmap for R@10 metric
- heatmap_MRR.png - Heatmap for MRR metric
- heatmap_NDCG_at_10.png - Heatmap for NDCG@10 metric
- top_combinations_comparison.png - Bar chart of top 10 combinations
- metric_trends.png - Line plots showing trends
- evaluation_summary.txt - Detailed text report
- Darker red = better performance
- Look for clusters of high performance
- Compare across metrics to find robust combinations
- The script identifies the best TF-IDF combination for each metric
- Pay special attention to MAP as the primary metric
- Consider P@10 for user-facing applications (top results matter most)
- Some schemes optimize for precision (top results)
- Others optimize for recall (finding all relevant documents)
- Choose based on your application needs
In tfidf_evaluator.py, you can:
-
Adjust K-normalization parameter:
runner.run_experiment(tf='double_norm', idf='standard', k=0.4)
-
Add custom TF schemes:
@staticmethod def tf_custom(term_freq, doc_length, max_freq): # Your custom formula return ...
-
Add custom IDF schemes:
@staticmethod def idf_custom(doc_freq, num_docs): # Your custom formula return ...
Modify the preprocessor initialization:
preprocessor = TextPreprocessor(
use_stemming=True, # Enable/disable stemming
remove_stopwords=True # Enable/disable stopword removal
)Change retrieval parameters:
# Retrieve more documents
ranked_results = retrieval.search(query_text, top_k=1000)
# Evaluate at different cutoffs
p5 = EvaluationMetrics.precision_at_k(ranked_doc_ids, relevant_docs, 5)
p20 = EvaluationMetrics.precision_at_k(ranked_doc_ids, relevant_docs, 20).
├── tfidf_evaluator.py # Main evaluation script
├── visualize_results.py # Visualization generator
├── requirements.txt # Python dependencies
├── README.md # This file
├── cran.all.1400.xml # Documents (download separately)
├── cran.qry.xml # Queries (download separately)
└── cranqrel.trec.txt # Qrels (download separately)
- Term → List of (doc_id, term_frequency) tuples
- Efficient retrieval for large document collections
- Stores document statistics (length, max term frequency)
- Uses cosine similarity (normalized dot product)
- Query and document vectors in TF-IDF space
- Length normalization for fair comparison
- Indexing: O(N × M) where N=documents, M=avg doc length
- Query: O(Q × K) where Q=query terms, K=avg postings list length
- Memory: O(V × D) where V=vocabulary size, D=avg docs per term
Based on IR literature, you should expect:
- MAP: 0.25 - 0.45 (Cranfield is challenging)
- Best schemes: Typically log-scaled TF with standard/smooth IDF
- Worst schemes: Raw TF without proper normalization
Make sure the Cranfield XML files are in the same directory as the scripts.
If you run out of memory, reduce the vocabulary size:
# Add minimum document frequency threshold
if self.get_document_frequency(term) < 2:
continue # Skip rare termsCheck that:
- XML files are properly formatted
- Query IDs in qrels match query file
- Document IDs in qrels match document file
- Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval.
- Robertson, S. (2004). Understanding inverse document frequency.
- Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval.
This project is for educational purposes. The Cranfield dataset is publicly available for research.
For questions or issues, please refer to the original Cranfield dataset repository: https://github.com/oussbenk/cranfield-trec-dataset