A collection of five automated feature selection scripts for machine learning projects. These scripts help identify the most relevant/valuable features, remove redundancy, and optimize your feature set for better model performance.
Feature selection is super important for building efficient, accurate machine learning models. This repository provides five complementary approaches:
- Variance Threshold Selector - Removes features with low or zero variance
- Correlation-Based Selector - Eliminates redundant highly correlated features
- Statistical Test Selector - Selects features based on statistical significance
- Model-Based Selector - Ranks features using ensemble importance from multiple models
- Recursive Feature Eliminator - Finds optimal feature subset through iterative elimination
Each script can be used independently or combined into a complete feature selection pipeline.
Install required packages:
pip install numpy pandas scikit-learn scipyOptional for visualization:
pip install matplotlib seabornRemoves constant or near-constant features that provide no information.
from variance_threshold_selector import VarianceThresholdSelector
import pandas as pd
# Load your data
df = pd.read_csv('your_data.csv')
X = df.drop('target', axis=1)
# Initialize and fit selector
selector = VarianceThresholdSelector(threshold=0.01, normalize=True)
selector.fit(X)
# Get report
print(selector.get_report())
# Transform data
X_selected = selector.transform(X)
print(f"Reduced from {X.shape[1]} to {X_selected.shape[1]} features")Parameters:
threshold: Variance threshold (default: 0.01)normalize: Whether to normalize variance by feature range (default: True)
Removes redundant features by identifying highly correlated pairs.
from correlation_selector import CorrelationSelector
import pandas as pd
# Load your data
df = pd.read_csv('your_data.csv')
X = df.drop('target', axis=1)
y = df['target']
# Initialize and fit selector
selector = CorrelationSelector(threshold=0.95, method='pearson')
selector.fit(X, y)
# Get correlated pairs report
print(selector.get_report())
# Transform data
X_selected = selector.transform(X)
print(f"Removed {len(selector.removed_features_)} correlated features")Parameters:
threshold: Correlation threshold above which features are redundant (default: 0.95)method: Correlation method - 'pearson', 'spearman', or 'kendall' (default: 'pearson')
Selects features with statistically significant relationships to the target.
from statistical_test_selector import StatisticalTestSelector
import pandas as pd
# Load your data
df = pd.read_csv('your_data.csv')
X = df.drop('target', axis=1)
y = df['target']
# Initialize and fit selector
selector = StatisticalTestSelector(alpha=0.05, correction='fdr', test_type='auto')
selector.fit(X, y, task='classification')
# Get statistical test results
print(selector.get_report())
# Transform data
X_selected = selector.transform(X)
print(f"Selected {len(selector.selected_features_)} significant features")Parameters:
alpha: Significance level (default: 0.05)correction: Multiple testing correction - 'bonferroni', 'fdr', or 'none' (default: 'fdr')test_type: Test type - 'auto', 'anova', 'chi2', 'mutual_info', or 'f_regression' (default: 'auto')
Ranks features using importance scores from multiple models.
from model_based_selector import ModelBasedSelector
import pandas as pd
# Load your data
df = pd.read_csv('your_data.csv')
X = df.drop('target', axis=1)
y = df['target']
# Initialize and fit selector
selector = ModelBasedSelector(n_features=10, use_permutation=False)
selector.fit(X, y, task='classification')
# Get importance report
print(selector.get_report())
# Transform data
X_selected = selector.transform(X)
print(f"Selected top {X_selected.shape[1]} features")Parameters:
n_features: Number of top features to select (default: None)threshold: Importance threshold for selection (default: None)use_permutation: Whether to use permutation importance (default: False)random_state: Random state for reproducibility (default: 42)
Finds optimal feature subset through systematic elimination and cross-validation.
from recursive_feature_eliminator import RecursiveFeatureEliminator
import pandas as pd
# Load your data
df = pd.read_csv('your_data.csv')
X = df.drop('target', axis=1)
y = df['target']
# Initialize and fit eliminator
eliminator = RecursiveFeatureEliminator(step=1, cv=5, scoring='accuracy')
eliminator.fit(X, y)
# Get performance curve
print(eliminator.get_performance_curve())
# Transform data
X_selected = eliminator.transform(X)
print(f"Optimal subset: {X_selected.shape[1]} features")
print(f"Selected: {X_selected.columns.tolist()}")Parameters:
estimator: Model for importance scores (default: RandomForestClassifier)step: Number of features to remove per iteration (default: 1)cv: Number of cross-validation folds (default: 5)scoring: Scoring metric - 'accuracy', 'f1', 'roc_auc' (default: 'accuracy')
Combine multiple selectors for comprehensive feature selection:
import pandas as pd
from variance_threshold_selector import VarianceThresholdSelector
from correlation_selector import CorrelationSelector
from statistical_test_selector import StatisticalTestSelector
from model_based_selector import ModelBasedSelector
from recursive_feature_eliminator import RecursiveFeatureEliminator
# Load data
df = pd.read_csv('your_data.csv')
X = df.drop('target', axis=1)
y = df['target']
print(f"Starting with {X.shape[1]} features")
# Step 1: Remove low-variance features
variance_selector = VarianceThresholdSelector(threshold=0.01)
X = variance_selector.fit_transform(X)
print(f"After variance filter: {X.shape[1]} features")
# Step 2: Remove correlated features
corr_selector = CorrelationSelector(threshold=0.95)
X = corr_selector.fit_transform(X, y)
print(f"After correlation filter: {X.shape[1]} features")
# Step 3: Select statistically significant features
stat_selector = StatisticalTestSelector(alpha=0.05, correction='fdr')
X = stat_selector.fit_transform(X, y, task='classification')
print(f"After statistical tests: {X.shape[1]} features")
# Step 4: Select top features by model importance
model_selector = ModelBasedSelector(n_features=20)
X = model_selector.fit_transform(X, y, task='classification')
print(f"After model selection: {X.shape[1]} features")
# Step 5: Find optimal subset with RFE
rfe = RecursiveFeatureEliminator(step=1, cv=5)
X_final = rfe.fit_transform(X, y)
print(f"Final optimal subset: {X_final.shape[1]} features")
print(f"Selected features: {X_final.columns.tolist()}")| Selector | Best For | Typical Use Case |
|---|---|---|
| Variance Threshold | Removing uninformative features | First step to eliminate constants |
| Correlation-Based | Handling multicollinearity | Linear models, reducing redundancy |
| Statistical Test | Ensuring significance | Validating feature-target relationships |
| Model-Based | General-purpose ranking | Identifying most predictive features |
| RFE | Finding optimal subset | Final optimization, feature interactions |
- Start with variance filtering - Quick way to remove obvious noise
- Apply correlation filtering early - Reduces dimensionality before expensive operations
- Use statistical tests for validation - Ensure features have meaningful relationships
- Combine multiple methods as needed - Different selectors capture different aspects
- Always use cross-validation - Especially with RFE to avoid overfitting
- Monitor performance - Track model metrics throughout the selection process
- Consider domain knowledge - Please don't remove features that make business sense