Skip to content

Raj-Taware/IR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TF-IDF Evaluation on Cranfield Dataset

A comprehensive evaluation framework for comparing different TF-IDF weighting schemes on the Cranfield collection.

Overview

This project evaluates 20 different TF-IDF combinations (4 TF schemes × 5 IDF schemes) on the Cranfield dataset, a classic Information Retrieval benchmark with 1,400 documents and 225 queries.

TF-IDF Schemes Implemented

Term Frequency (TF) Schemes

  1. Raw TF: tf(t,d) = frequency

    • Simply counts how many times a term appears in the document
  2. Double Normalization (K-normalization): tf(t,d) = k + (1-k) × (freq / max_freq)

    • Prevents bias towards longer documents
    • k = 0.5 by default (adjustable)
  3. Log Scaled TF: tf(t,d) = 1 + log(freq) if freq > 0, else 0

    • Dampens the effect of term frequency
    • Reduces the impact of very high frequency terms
  4. Normalized TF: tf(t,d) = freq / doc_length

    • Normalizes by document length
    • Gives relative term importance

Inverse Document Frequency (IDF) Schemes

  1. Standard IDF: idf(t) = log(N / df)

    • Classic IDF formulation
    • N = total documents, df = document frequency
  2. Smooth IDF: idf(t) = log(N / (1 + df)) + 1

    • Adds smoothing to prevent zero IDF
    • +1 ensures positive values
  3. Max IDF: idf(t) = log(max_df / df)

    • Uses maximum document frequency as reference
    • Alternative normalization approach
  4. Probabilistic IDF: idf(t) = log((N - df) / df)

    • Based on probability of relevance
    • From Robertson & Sparck Jones
  5. Entropy-based IDF: idf(t) = 1 - (entropy / max_entropy)

    • Measures information content
    • Lower entropy = more discriminative term

Installation

1. Install Python dependencies

pip install -r requirements.txt

2. Download Cranfield Dataset

Download the following files from the repository: https://github.com/oussbenk/cranfield-trec-dataset

  • cran.all.1400.xml - Documents
  • cran.qry.xml - Queries
  • cranqrel.trec.txt - Relevance judgments

Place them in the same directory as the Python scripts.

Usage

Step 1: Run the Evaluation

python tfidf_evaluator.py

This will:

  • Load the Cranfield dataset
  • Preprocess documents and queries (tokenization, stopword removal, stemming)
  • Build inverted index
  • Evaluate all 20 TF-IDF combinations
  • Save results to tfidf_evaluation_results.csv

Expected runtime: 2-5 minutes depending on your system

Step 2: Generate Visualizations

python visualize_results.py

This will generate:

  • Heatmaps for each metric (MAP, P@10, R@10, MRR, NDCG@10)
  • Comparison charts showing top 10 combinations
  • Trend analysis showing how metrics vary across schemes
  • Summary report with detailed statistics

Evaluation Metrics

The system evaluates performance using:

  • MAP (Mean Average Precision): Overall ranking quality
  • P@10 (Precision at 10): Precision of top 10 results
  • R@10 (Recall at 10): Recall of top 10 results
  • MRR (Mean Reciprocal Rank): Position of first relevant document
  • NDCG@10: Normalized Discounted Cumulative Gain at 10

Output Files

After running the scripts, you'll get:

  1. tfidf_evaluation_results.csv - Raw results for all combinations
  2. heatmap_MAP.png - Heatmap for MAP metric
  3. heatmap_P_at_10.png - Heatmap for P@10 metric
  4. heatmap_R_at_10.png - Heatmap for R@10 metric
  5. heatmap_MRR.png - Heatmap for MRR metric
  6. heatmap_NDCG_at_10.png - Heatmap for NDCG@10 metric
  7. top_combinations_comparison.png - Bar chart of top 10 combinations
  8. metric_trends.png - Line plots showing trends
  9. evaluation_summary.txt - Detailed text report

Interpreting Results

Heatmaps

  • Darker red = better performance
  • Look for clusters of high performance
  • Compare across metrics to find robust combinations

Best Combination

  • The script identifies the best TF-IDF combination for each metric
  • Pay special attention to MAP as the primary metric
  • Consider P@10 for user-facing applications (top results matter most)

Trade-offs

  • Some schemes optimize for precision (top results)
  • Others optimize for recall (finding all relevant documents)
  • Choose based on your application needs

Customization

Modify TF/IDF Parameters

In tfidf_evaluator.py, you can:

  1. Adjust K-normalization parameter:

    runner.run_experiment(tf='double_norm', idf='standard', k=0.4)
  2. Add custom TF schemes:

    @staticmethod
    def tf_custom(term_freq, doc_length, max_freq):
        # Your custom formula
        return ...
  3. Add custom IDF schemes:

    @staticmethod
    def idf_custom(doc_freq, num_docs):
        # Your custom formula
        return ...

Preprocessing Options

Modify the preprocessor initialization:

preprocessor = TextPreprocessor(
    use_stemming=True,      # Enable/disable stemming
    remove_stopwords=True   # Enable/disable stopword removal
)

Evaluation Settings

Change retrieval parameters:

# Retrieve more documents
ranked_results = retrieval.search(query_text, top_k=1000)

# Evaluate at different cutoffs
p5 = EvaluationMetrics.precision_at_k(ranked_doc_ids, relevant_docs, 5)
p20 = EvaluationMetrics.precision_at_k(ranked_doc_ids, relevant_docs, 20)

Project Structure

.
├── tfidf_evaluator.py          # Main evaluation script
├── visualize_results.py         # Visualization generator
├── requirements.txt             # Python dependencies
├── README.md                    # This file
├── cran.all.1400.xml           # Documents (download separately)
├── cran.qry.xml                # Queries (download separately)
└── cranqrel.trec.txt           # Qrels (download separately)

Technical Details

Inverted Index Structure

  • Term → List of (doc_id, term_frequency) tuples
  • Efficient retrieval for large document collections
  • Stores document statistics (length, max term frequency)

Similarity Calculation

  • Uses cosine similarity (normalized dot product)
  • Query and document vectors in TF-IDF space
  • Length normalization for fair comparison

Complexity

  • Indexing: O(N × M) where N=documents, M=avg doc length
  • Query: O(Q × K) where Q=query terms, K=avg postings list length
  • Memory: O(V × D) where V=vocabulary size, D=avg docs per term

Expected Results

Based on IR literature, you should expect:

  • MAP: 0.25 - 0.45 (Cranfield is challenging)
  • Best schemes: Typically log-scaled TF with standard/smooth IDF
  • Worst schemes: Raw TF without proper normalization

Troubleshooting

File Not Found Errors

Make sure the Cranfield XML files are in the same directory as the scripts.

Memory Issues

If you run out of memory, reduce the vocabulary size:

# Add minimum document frequency threshold
if self.get_document_frequency(term) < 2:
    continue  # Skip rare terms

No Results

Check that:

  1. XML files are properly formatted
  2. Query IDs in qrels match query file
  3. Document IDs in qrels match document file

References

  1. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval.
  2. Robertson, S. (2004). Understanding inverse document frequency.
  3. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval.

License

This project is for educational purposes. The Cranfield dataset is publicly available for research.

Contact

For questions or issues, please refer to the original Cranfield dataset repository: https://github.com/oussbenk/cranfield-trec-dataset

About

A compilation of all IR Assingments

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages