0

I'm running HDBSCAN on a massive dataset of geospatial data on rodent inspection sites in New York City (from the NYC open data site). In addition to running the algorithm for the latitude/longitude coordinates of each site, I'm adding a third column — the categorical "INSPECTION_TYPE" column, with three possible values: Initial, BAIT, and Compliance. I'm attempting to change the shape of the marker in the scatterplot for each point based on what inspection type occurred at the site by mapping each value to a different shape after one-hot encoding the column.

Here's the issue: introducing INSPECTION_TYPE is causing the clustering to become messy, with points that should be members of a cluster being treated as noise (marked in grey), and some clusters overlapping with others (points in a cluster are desaturated based on proximity to the core). Also, each point appears to be marked with all three INSPECTION_TYPE categories, resulting in points styled as circles, squares, and triangles simultaneously.

For reference, here is the code:

import pandas as pd
import hdbscan
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

if __name__ == '__main__':

    plot_kwds = {'alpha' : 0.5, 's' : 50, 'linewidth' : 0}

    df = pd.read_csv('Rodent_Inspection_20251121.csv')
    pd.set_option('display.max_columns', None)

    df.dropna(inplace=True)
    df = df.head(n=10000)

    X = df[['LONGITUDE', 'LATITUDE', 'INSPECTION_TYPE']]

    X = pd.get_dummies(X, columns=['INSPECTION_TYPE'], dtype=int)

    X_scaled = StandardScaler().fit_transform(X)

    hdb = hdbscan.HDBSCAN(min_cluster_size=20)
    hdb.fit(X_scaled)

    palette = sns.color_palette('bright', len(set(hdb.labels_)) - (1 if -1 in hdb.labels_ else 0))
    cluster_colors = [sns.desaturate(palette[col], 0.5) if col >= 0 else (0.5, 0.5, 0.5) for col in hdb.labels_]

    marker_map = {
        'INSPECTION_TYPE_Initial': 'o',
        'INSPECTION_TYPE_BAIT': 's',
        'INSPECTION_TYPE_Compliance': '^'
    }

    for col_name, marker_type in marker_map.items():
        plt.scatter(x=X_scaled.T[0], y=X_scaled.T[1], marker=marker_type, c=cluster_colors, **plot_kwds)

    print(f'Number of clusters: {len(set(hdb.labels_)) - 1}')

    plt.show()

Here's what the scatterplot looks like for the code above:

The resulting scatterplot

Zoomed in to show detail

Here's what the scatterplot looks like with the INSPECTION_TYPE column excluded, all else kept the same:

The resulting scatterplot

Zoomed in to show detail

0

1 Answer 1

0

It is hard to tell exactly what is going on in the visual because you have multiple mixed data sources contributing (dfs, lists, dicts, etc.). It will be much easier to understand and work on if you take a more modular approach.

Take advantage of your df and use dictionary mappings to make things easier.

This solution has the following flow:

  • Setup
  • df analysis (scaling, clustering)
  • df visualization prep (colors, markers)
  • visualization
# import the required packages
import pandas as pd
import hdbscan
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# set up dictionaries
plot_kwds = {
  'alpha' : 0.5, 
  's' : 50, 
  'linewidth' : 0
}

marker_map = {
    'INSPECTION_TYPE_Initial': 'o',
    'INSPECTION_TYPE_BAIT': 's',
    'INSPECTION_TYPE_Compliance': '^'
}

# read in data, do analysis
df = pd.read_csv('Rodent_Inspection_20251121.csv')

df.dropna(inplace=True)
df = df.head(n=10000)

X = df[['LONGITUDE', 'LATITUDE', 'INSPECTION_TYPE']]

X = pd.get_dummies(X, columns=['INSPECTION_TYPE'], dtype=int)
X_scaled = StandardScaler().fit_transform(X)

hdb = hdbscan.HDBSCAN(min_cluster_size=20)
hdb.fit(X_scaled)

# create a single df with all plotting data
X_plot = X_scaled.copy() # to preserve your X df 
X_plot["cluster"] = hdb.labels_ # Creates new column with your cluster labels

# color maps for clusters
clusters = sorted(c for c in X_plot["cluster"].unique() if c != -1)

palette = sns.color_palette("bright", len(clusters))

cluster_color_map = dict(zip(clusters, palette))

# noise color
cluster_color_map[-1] = (0.5, 0.5, 0.5)

# new column with color mapping
X_plot["color"] = X_plot["cluster"].map(cluster_color_map)

# new column with marker shapes
X_plot["shape"] = X_plot[dummy_column].map(marker_map)

# create scatter from single source
plt.scatter(x=X_plot['LONGITUDE'], y=X_plot['LATITUDE'], marker=X_plot['shape'], c=X_plot['color'], **plot_kwds)

plt.show()

Be sure to check the df you've created for troubleshooting, having all info in a single source should help!

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.