I'm running HDBSCAN on a massive dataset of geospatial data on rodent inspection sites in New York City (from the NYC open data site). In addition to running the algorithm for the latitude/longitude coordinates of each site, I'm adding a third column — the categorical "INSPECTION_TYPE" column, with three possible values: Initial, BAIT, and Compliance. I'm attempting to change the shape of the marker in the scatterplot for each point based on what inspection type occurred at the site by mapping each value to a different shape after one-hot encoding the column.
Here's the issue: introducing INSPECTION_TYPE is causing the clustering to become messy, with points that should be members of a cluster being treated as noise (marked in grey), and some clusters overlapping with others (points in a cluster are desaturated based on proximity to the core). Also, each point appears to be marked with all three INSPECTION_TYPE categories, resulting in points styled as circles, squares, and triangles simultaneously.
For reference, here is the code:
import pandas as pd
import hdbscan
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
if __name__ == '__main__':
plot_kwds = {'alpha' : 0.5, 's' : 50, 'linewidth' : 0}
df = pd.read_csv('Rodent_Inspection_20251121.csv')
pd.set_option('display.max_columns', None)
df.dropna(inplace=True)
df = df.head(n=10000)
X = df[['LONGITUDE', 'LATITUDE', 'INSPECTION_TYPE']]
X = pd.get_dummies(X, columns=['INSPECTION_TYPE'], dtype=int)
X_scaled = StandardScaler().fit_transform(X)
hdb = hdbscan.HDBSCAN(min_cluster_size=20)
hdb.fit(X_scaled)
palette = sns.color_palette('bright', len(set(hdb.labels_)) - (1 if -1 in hdb.labels_ else 0))
cluster_colors = [sns.desaturate(palette[col], 0.5) if col >= 0 else (0.5, 0.5, 0.5) for col in hdb.labels_]
marker_map = {
'INSPECTION_TYPE_Initial': 'o',
'INSPECTION_TYPE_BAIT': 's',
'INSPECTION_TYPE_Compliance': '^'
}
for col_name, marker_type in marker_map.items():
plt.scatter(x=X_scaled.T[0], y=X_scaled.T[1], marker=marker_type, c=cluster_colors, **plot_kwds)
print(f'Number of clusters: {len(set(hdb.labels_)) - 1}')
plt.show()
Here's what the scatterplot looks like for the code above:
Here's what the scatterplot looks like with the INSPECTION_TYPE column excluded, all else kept the same: