Skip to content

rosalinatorres888/democracy-clustering-analysis

Repository files navigation

Democracy Clustering Analysis

πŸ“Š Two-part unsupervised ML study revealing governance archetypes across 167 countries β€” extended with SHAP explainability and digital freedom dimensions

A research-grade political data science project combining K-means, hierarchical clustering, PCA, t-SNE, and XGBoost surrogate models to uncover democracy patterns in global governance data. Part 1 applies unsupervised clustering to the EIU Democracy Index. Part 2 extends the analysis with Freedom House internet freedom data and SHAP explainability to explain why countries like Hungary and India defy simple classification.

🌐 View Live Interactive Dashboard β†’

Status Python Clustering SHAP Countries License


Two-Part Structure

Part 1 Part 2
File my_democracy-clustering-notebook-torres.ipynb shap_explainer_v2.ipynb
Data EIU Democracy Index EIU 2024 + Freedom House FOTN 2025
Features 5 democracy dimensions 6 dimensions (+ internet freedom)
Method K-means + Hierarchical + PCA + t-SNE AgglomerativeClustering + XGBoost surrogate
Explainability Silhouette + Bootstrap + ARI SHAP TreeExplainer
Clusters 4 named regime archetypes 4 named digital governance profiles

Part 1 β€” EIU Democracy Index Clustering

Data Source

Economist Intelligence Unit (EIU) Democracy Index β€” 167 countries scored across 5 dimensions:

  • Electoral process and pluralism
  • Functioning of government
  • Political participation
  • Political culture
  • Civil liberties

Methodology

1. Preprocessing

  • StandardScaler normalization across all 5 dimensions
  • Cross-sectional dataset (single-year snapshot β€” no time series imputation required)
  • Missing value handling via SimpleImputer

2. Optimal Cluster Selection

  • KElbow visualizer (k=2 to k=10) for distortion score analysis
  • Silhouette scoring across k range:
k Silhouette Score
2 0.459
3 0.385
4 0.297
5 0.253
6 0.248
7 0.261
  • Political science theory (regime type literature) informed final cluster selection
  • Elbow method confirmed diminishing returns beyond k=4

3. Clustering

  • K-means (k-means++ initialization, random_state=42, n_init=10)
  • Hierarchical clustering (Ward's linkage) for structural validation
  • Dendrogram analysis confirming cluster boundaries

4. Dimensionality Reduction

  • PCA (2 components) for cluster visualization and biplot interpretation
  • t-SNE (perplexity calibrated to dataset size) for non-linear pattern detection

5. Validation

  • Bootstrap stability testing (30 iterations, Adjusted Rand Index)
  • Confusion matrix comparison against EIU expert regime classifications
  • Transition country identification via distance-to-centroid scoring

Part 1 Results β€” 4 Named Regime Archetypes

Cluster Name Countries Profile
0 Liberal Democracy 39 High electoral integrity, strong civil liberties, high political participation
1 Participatory Democracy 41 Democratic with strong participation, moderate institutional performance
2 Hybrid Regime 66 Mixed democratic and authoritarian features, weak institutions
3 Electoral Autocracy 21 Limited political freedoms, controlled elections, weak civil society

Key transition countries identified (high distance-to-centroid scores): Ecuador and others at cluster boundaries representing transitional regimes.


Part 2 β€” SHAP Explainability + Digital Freedom Extension

Why Part 2?

Part 1 revealed that countries like Hungary, India, and Turkey resist clean classification by traditional democracy metrics alone. Part 2 adds internet freedom as a sixth dimension to capture digital authoritarianism β€” a governance pattern not captured by EIU's original five dimensions.

Data Sources

  • EIU Democracy Index 2024 β€” 5 core dimensions
  • Freedom House Freedom on the Net (FOTN) 2025 β€” internet freedom scores (0–100)
  • Merged dataset: data/democracy_v2_dataset.csv β€” 167 countries, 6 features + surveillance_gap

4 Digital Governance Profiles

Cluster Name Defining Characteristics
0 Digital Democracies High EIU score + high internet freedom β€” Nordic/Western Europe
1 Constrained Democracies Democratic institutions but restricted digital space
2 Digital Hybrids Mixed governance + selective internet control (Hungary, India, Turkey)
3 Hard Authoritarians Low EIU score + restricted internet β€” China, Russia

SHAP Explainability

Why XGBoost surrogate? Clustering is unsupervised β€” it produces labels but no feature importance scores. The surrogate approach trains XGBoost to predict cluster membership, then uses SHAP TreeExplainer to explain which features drove each country's assignment.

Surrogate architecture:

surrogate = xgb.XGBClassifier(
    n_estimators=300, max_depth=4,
    learning_rate=0.05, subsample=0.8,
    colsample_bytree=0.8, random_state=42
)

SHAP outputs generated:

  • Global feature importance bar chart (all clusters combined)
  • Per-cluster beeswarm plots (4 separate figures)
  • Country-level waterfall plots β€” Hungary and India
  • Decision plot β€” all Digital Hybrids countries
  • Mean absolute SHAP pivot table by feature and cluster

Key finding: For the Digital Hybrids cluster, internet_freedom and political_culture are the dominant features β€” not electoral process. This explains why Hungary scores relatively well on EIU metrics but lands in the hybrid cluster when digital freedom is included.


Tech Stack

Component Technology
Language Python 3.9+
Data manipulation pandas, numpy
Clustering scikit-learn (KMeans, AgglomerativeClustering)
Dimensionality reduction PCA, t-SNE (sklearn)
Visualization matplotlib, seaborn, Plotly Express, plotly.graph_objects
Clustering validation factoextra (via yellowbrick KElbow), silhouette_score, ARI
Explainability SHAP TreeExplainer, XGBoost surrogate
Statistical methods Bootstrap resampling (30 iterations), Ward's linkage

Repository Structure

democracy-clustering-analysis/
β”œβ”€β”€ my_democracy-clustering-notebook-torres.ipynb  # Part 1 β€” EIU clustering analysis
β”œβ”€β”€ shap_explainer_v2.ipynb                        # Part 2 β€” SHAP + digital freedom extension
β”œβ”€β”€ democracy-clustering-notebook-torres.html      # Rendered HTML notebook (Part 1)
β”œβ”€β”€ index.html                                     # GitHub Pages entry point
β”œβ”€β”€ dashboard/                                     # Dashboard components
β”œβ”€β”€ docs/                                          # Documentation
β”œβ”€β”€ react-democracy-viz/src/                       # React visualization components
β”œβ”€β”€ README.md
└── .gitignore

Data files (not committed β€” sourced externally):

  • democracy_index.csv β€” EIU Democracy Index (download from EIU)
  • data/democracy_v2_dataset.csv β€” EIU 2024 + Freedom House FOTN 2025 merged dataset

Installation

git clone https://github.com/rosalinatorres888/democracy-clustering-analysis.git
cd democracy-clustering-analysis

pip install pandas numpy scikit-learn matplotlib seaborn plotly yellowbrick scipy xgboost shap

Run Part 1: Open my_democracy-clustering-notebook-torres.ipynb in Jupyter. Update the data path in Cell 5 to point to your local EIU Democracy Index CSV.

Run Part 2: Open shap_explainer_v2.ipynb. Requires data/democracy_v2_dataset.csv (EIU 2024 + Freedom House FOTN 2025 merged).


Academic Context

Course: IE6400 β€” Data Analytics Engineering Institution: Northeastern University β€” MS Data Analytics Engineering (EDGE Program) Term: Spring 2025

Research applications demonstrated:

  • Unsupervised ML for political science classification
  • Surrogate model explainability for clustering outputs
  • Multi-source data integration (EIU + Freedom House)
  • Digital authoritarianism detection via internet freedom dimensions
  • Transition regime identification at cluster boundaries

Author

Rosalina Torres β€” ML/AI Engineer MS Data Analytics Engineering @ Northeastern University (EDGE Program) Expected Graduation: August 2026 Β· 4.0 GPA


License

MIT License β€” See LICENSE file for details


Part of an ML/AI engineering portfolio demonstrating unsupervised learning, dimensionality reduction, model explainability, and political data science.


Interactive Dashboard

democracy_clustering_dashboard.html β€” a standalone 5-section interactive dashboard built with Plotly.js:

Section Content
01 Country Explorer Radar chart for any country across all 5 EIU dimensions + borderline flags
02 US Decline Bar chart of dimensional scores + historical trend (2015–2024)
03 Democracy Clustering Scatter plot with toggleable cluster boundaries + 3-method comparison
04 Trends Line chart 2017–2024 β€” all countries, US vs Norway, borderline cases
05 Key Insights Research summary with correlation findings

Clustering method comparison (Adjusted Rand Index vs expert labels):

Method ARI Score Notes
Hierarchical Clustering 0.78 Best alignment with EIU expert classifications
K-Means 0.57 Good overall, less precise at regime boundaries
Gaussian Mixture Model 0.52 Probabilistic β€” useful for borderline cases

Feature correlations with regime type:

  • Civil liberties: 0.90
  • Electoral process: 0.86

Open democracy_clustering_dashboard.html directly in any browser β€” no server required.

Live: rosalinatorres888.github.io/democracy-clustering-analysis/index.html


Limitations & Algorithmic Assumptions

This analysis relies on multidimensional clustering and surrogate explainability models. As with any machine learning architecture applied to sociopolitical data, it is critical to state the epistemological boundaries of the pipeline:

  • The Input Data is Quantified Consensus, Not Physical Law: The EIU Democracy Index is a highly engineered composite of qualitative assessments made by human analysts. By feeding these dimensions into a K-means algorithm, the model is not discovering an absolute, objective political reality. Rather, it is successfully mapping the underlying mathematical geometry of expert consensus. The fact that an unlabeled algorithm recreated expert regime classifications (ARI: 0.78) proves the internal structural consistency of that consensus.

  • Time-Series Shifts Reflect Grader Penalties: When the algorithm isolates 2016 as a breakpoint for the United States, it is not independently "discovering" election interference. It is plotting the exact mathematical penalty applied by EIU human graders to the "Political Culture" and "Functioning of Government" dimensions. The value of the clustering model is that it proves this penalty was not just statistical noise β€” it was mathematically severe enough to fundamentally alter the country's position within the high-dimensional space.

  • Surrogate Explainability is Not Ground Truth: In Part 2, an XGBoost surrogate model and SHAP TreeExplainer are used to calculate feature importance for the clusters. Because K-means forces data into spheres, XGBoost is learning how to predict K-means' arbitrary boundaries, not necessarily the ground-truth "soul" of a country. The SHAP values are explicitly designed to explain the mechanics of the algorithmic clustering decision, not to serve as an objective oracle. It transforms an opaque mathematical grouping into an interpretable one.

About

195 countries analyzed

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors