I am working on a ternary KDE (kernel density estimate) plot using Matplotlib + SciPy’s gaussian_kde, but the density scaling is not behaving as expected.
The top panel (scatter plots) shows ternary distributions for different groups (X, Y, Z). These look fine.
Original Picture by OP.
ADDED Picture produced by code, see colorbars missing in original picture:
The bottom panel (KDE heatmaps) should show density contours based on the same datasets.
Expected:
The X dataset is sparse, so the KDE should be faint or nearly blank. The Y and Z datasets have more data, so they should have stronger KDE gradients.
Issue:
Even though X has fewer data points, its KDE still appears structured and unexpectedly dense comparable to Y and Z. Should not it be almost blank or significantly fainter compared to Y and Z? Is this due to bandwidth selection, normalization issues, or KDE smoothing?
I have included a sample dataset for testing: kde_ternary_example.csv
You can load the dataset using:
import pandas as pd
# Load dataset from Google Drive link
file_url = ";
df = pd.read_csv(file_url)
print(df.head()) # Preview the dataset
My Current Code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from matplotlib.patches import Polygon
from scipy.stats import gaussian_kde
# Load dataset
file_url = ";
df = pd.read_csv(file_url)
# Define ternary combinations and corresponding filters
ternary_combinations = [
(['B', 'C', 'A'], 'X'),
(['C', 'D', 'B'], 'Y'),
(['D', 'E', 'C'], 'Z')
]
# Function to convert barycentric to Cartesian coordinates
def barycentric_to_cartesian(data):
triangle_vertices = np.array([[0, 0], [1, 0], [0.5, np.sqrt(3) / 2]]) # Equilateral triangle
return np.dot(data, triangle_vertices)
# Function to check if points are inside the triangle
def is_inside_triangle(points, triangle_vertices):
A, B, C = triangle_vertices
v0, v1, v2 = B - A, C - A, points - A
d00, d01, d11 = np.dot(v0, v0), np.dot(v0, v1), np.dot(v1, v1)
d20, d21 = np.einsum('ij,j->i', v2, v0), np.einsum('ij,j->i', v2, v1)
denom = d00 * d11 - d01 * d01
u = (d11 * d20 - d01 * d21) / denom
v = (d00 * d21 - d01 * d20) / denom
return (u >= 0) & (v >= 0) & (u + v <= 1)
# Function to plot ternary diagram with KDE heatmap
def plot_ternary_diagram():
fig, axes = plt.subplots(1, 3, figsize=(24, 8), constrained_layout=True, facecolor="whitesmoke")
def plot_ternary(ax, data_columns, normative_filter):
filtered_df = df[df['NN'] == normative_filter]
data = filtered_df[data_columns].dropna().to_numpy()
if data.shape[0] == 0:
return # Skip plotting if no data is available
sum_values = np.sum(data, axis=1, keepdims=True)
valid_rows = sum_values[:, 0] > 0
data = data[valid_rows] / sum_values[valid_rows]
cartesian_points = barycentric_to_cartesian(data)
# Scatter plot
ax.scatter(cartesian_points[:, 0], cartesian_points[:, 1], s=20, alpha=0.5,
facecolors='none', edgecolors='black', linewidth=0.6, zorder=1)
# KDE calculation
if np.linalg.matrix_rank(np.cov(cartesian_points.T)) < 2:
cartesian_points += np.random.normal(0, 1e-6, cartesian_points.shape) # Add small noise
kde = gaussian_kde(cartesian_points.T, bw_method=0.25)
x = np.linspace(0, 1, 200)
y = np.linspace(0, np.sqrt(3) / 2, 200)
X, Y = np.meshgrid(x, y)
grid_points = np.vstack([X.ravel(), Y.ravel()]).T
# Mask points outside the triangle
triangle_vertices = np.array([[0, 0], [1, 0], [0.5, np.sqrt(3) / 2]])
mask = is_inside_triangle(grid_points, triangle_vertices)
Z = np.full(grid_points.shape[0], np.nan)
Z[mask] = kde(grid_points[mask].T)
Z = Z.reshape(X.shape)
# Set density limit
max_density = np.nanmax(Z)
Z[Z > 0.4 * max_density] = 0.4 * max_density
Z[Z < 0.03 * max_density] = np.nan
# Plot KDE
contour = ax.contourf(X, Y, Z, levels=12, cmap='viridis', alpha=0.85, zorder=2)
# Draw triangle boundary
ax.add_patch(Polygon(triangle_vertices, closed=True, fill=False, edgecolor='black',
linewidth=2.5, linestyle='dashed', zorder=3))
# Remove axes and set aspect ratio
ax.set_xticks([])
ax.set_yticks([])
ax.set_frame_on(False)
ax.set_aspect('equal')
# Colorbar
cbar = plt.colorbar(contour, ax=ax, orientation='horizontal', fraction=0.05, pad=0.08)
cbar.set_label('Density', fontsize=12)
for ax, (ternary_set, norm_filter) in zip(axes, ternary_combinations):
plot_ternary(ax, ternary_set, norm_filter)
plt.show()
# Generate and save ternary plots with filters
plot_ternary_diagram()
My Questions
Why is the KDE density for X appearing structured and not as faint as expected?
Could bandwidth selection (bw_method = 0.25; 'scott') be causing unwanted smoothing effects?
Am I normalizing or transforming the data incorrectly?
I am working on a ternary KDE (kernel density estimate) plot using Matplotlib + SciPy’s gaussian_kde, but the density scaling is not behaving as expected.
The top panel (scatter plots) shows ternary distributions for different groups (X, Y, Z). These look fine.
Original Picture by OP.
ADDED Picture produced by code, see colorbars missing in original picture:
The bottom panel (KDE heatmaps) should show density contours based on the same datasets.
Expected:
The X dataset is sparse, so the KDE should be faint or nearly blank. The Y and Z datasets have more data, so they should have stronger KDE gradients.
Issue:
Even though X has fewer data points, its KDE still appears structured and unexpectedly dense comparable to Y and Z. Should not it be almost blank or significantly fainter compared to Y and Z? Is this due to bandwidth selection, normalization issues, or KDE smoothing?
I have included a sample dataset for testing: kde_ternary_example.csv
You can load the dataset using:
import pandas as pd
# Load dataset from Google Drive link
file_url = "https://drive.google/uc?id=1p6qwnEom_ybyhbTNLUOThoVB5YE5xcyJ"
df = pd.read_csv(file_url)
print(df.head()) # Preview the dataset
My Current Code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from matplotlib.patches import Polygon
from scipy.stats import gaussian_kde
# Load dataset
file_url = "https://drive.google/uc?id=1p6qwnEom_ybyhbTNLUOThoVB5YE5xcyJ"
df = pd.read_csv(file_url)
# Define ternary combinations and corresponding filters
ternary_combinations = [
(['B', 'C', 'A'], 'X'),
(['C', 'D', 'B'], 'Y'),
(['D', 'E', 'C'], 'Z')
]
# Function to convert barycentric to Cartesian coordinates
def barycentric_to_cartesian(data):
triangle_vertices = np.array([[0, 0], [1, 0], [0.5, np.sqrt(3) / 2]]) # Equilateral triangle
return np.dot(data, triangle_vertices)
# Function to check if points are inside the triangle
def is_inside_triangle(points, triangle_vertices):
A, B, C = triangle_vertices
v0, v1, v2 = B - A, C - A, points - A
d00, d01, d11 = np.dot(v0, v0), np.dot(v0, v1), np.dot(v1, v1)
d20, d21 = np.einsum('ij,j->i', v2, v0), np.einsum('ij,j->i', v2, v1)
denom = d00 * d11 - d01 * d01
u = (d11 * d20 - d01 * d21) / denom
v = (d00 * d21 - d01 * d20) / denom
return (u >= 0) & (v >= 0) & (u + v <= 1)
# Function to plot ternary diagram with KDE heatmap
def plot_ternary_diagram():
fig, axes = plt.subplots(1, 3, figsize=(24, 8), constrained_layout=True, facecolor="whitesmoke")
def plot_ternary(ax, data_columns, normative_filter):
filtered_df = df[df['NN'] == normative_filter]
data = filtered_df[data_columns].dropna().to_numpy()
if data.shape[0] == 0:
return # Skip plotting if no data is available
sum_values = np.sum(data, axis=1, keepdims=True)
valid_rows = sum_values[:, 0] > 0
data = data[valid_rows] / sum_values[valid_rows]
cartesian_points = barycentric_to_cartesian(data)
# Scatter plot
ax.scatter(cartesian_points[:, 0], cartesian_points[:, 1], s=20, alpha=0.5,
facecolors='none', edgecolors='black', linewidth=0.6, zorder=1)
# KDE calculation
if np.linalg.matrix_rank(np.cov(cartesian_points.T)) < 2:
cartesian_points += np.random.normal(0, 1e-6, cartesian_points.shape) # Add small noise
kde = gaussian_kde(cartesian_points.T, bw_method=0.25)
x = np.linspace(0, 1, 200)
y = np.linspace(0, np.sqrt(3) / 2, 200)
X, Y = np.meshgrid(x, y)
grid_points = np.vstack([X.ravel(), Y.ravel()]).T
# Mask points outside the triangle
triangle_vertices = np.array([[0, 0], [1, 0], [0.5, np.sqrt(3) / 2]])
mask = is_inside_triangle(grid_points, triangle_vertices)
Z = np.full(grid_points.shape[0], np.nan)
Z[mask] = kde(grid_points[mask].T)
Z = Z.reshape(X.shape)
# Set density limit
max_density = np.nanmax(Z)
Z[Z > 0.4 * max_density] = 0.4 * max_density
Z[Z < 0.03 * max_density] = np.nan
# Plot KDE
contour = ax.contourf(X, Y, Z, levels=12, cmap='viridis', alpha=0.85, zorder=2)
# Draw triangle boundary
ax.add_patch(Polygon(triangle_vertices, closed=True, fill=False, edgecolor='black',
linewidth=2.5, linestyle='dashed', zorder=3))
# Remove axes and set aspect ratio
ax.set_xticks([])
ax.set_yticks([])
ax.set_frame_on(False)
ax.set_aspect('equal')
# Colorbar
cbar = plt.colorbar(contour, ax=ax, orientation='horizontal', fraction=0.05, pad=0.08)
cbar.set_label('Density', fontsize=12)
for ax, (ternary_set, norm_filter) in zip(axes, ternary_combinations):
plot_ternary(ax, ternary_set, norm_filter)
plt.show()
# Generate and save ternary plots with filters
plot_ternary_diagram()
My Questions
Why is the KDE density for X appearing structured and not as faint as expected?
Could bandwidth selection (bw_method = 0.25; 'scott') be causing unwanted smoothing effects?
Am I normalizing or transforming the data incorrectly?
Share Improve this question edited Feb 16 at 20:59 pippo1980 3,0963 gold badges17 silver badges39 bronze badges asked Feb 15 at 5:59 Shubhadeep RoyShubhadeep Roy 11 bronze badge 6- input file is missing we cannot run your code, I voted to close this question. Please read [How do I ask a good question? ](stackoverflow/help/how-to-ask) and How to create a Minimal, Reproducible Example – pippo1980 Commented Feb 15 at 17:02
- line 13, in raise FileNotFoundError(f"The file at {file_path} does not exist.") FileNotFoundError: The file at C:\Users\...\example_help.xlsx does not exist. – pippo1980 Commented Feb 15 at 17:11
- does max_density = np.nanmax(Z) retturn max value for just x,y coords where Z values is equivalent for KDE or for values outside it too(outside the triagles)? how such values can be higher than 1 ? is integration over KDE = 100 or 1 ?? how can you get a KDE value for a single point instead of an interval on wich you should integrate the probability value ? [PS my knowlwdge oin statistics is very bad, hope you got my question] – pippo1980 Commented Feb 16 at 18:56
- max_density = np.nanmax(Z) gives three different results see, uploaded picture output of your code, should you normalize the pdf with value from 0 to 1 to compare the three plots ??? – pippo1980 Commented Feb 16 at 20:53
- print('sum Z : ', np.sum(np.where(Z != np.nan))) results in 7960000 , I am missing something about gaussian_kde – pippo1980 Commented Feb 16 at 21:01
3 Answers
Reset to default 0I don't get your statement:
Expected: The X dataset is sparse, so the KDE should be faint or nearly blank. The Y and Z datasets have more data, so they should have stronger KDE gradients.
Gradients , contours , are draw for each of the 3 plots , X,Y,Z with this line
# Plot KDE
contour = ax.contourf(X, Y, Z, levels=12, cmap='viridis', alpha=0.85, zorder=2)
changing keyword levels=12
to levels=2
:
# Plot KDE
contour = ax.contourf(X, Y, Z, levels=2, cmap='viridis', alpha=0.85, zorder=2)
I get:
Z picture/plot (third one) is less sharp ?? isnt it?
but again I am not following, given that google says (for [https://www.google/search?q=kde+density+integrate+to+1][5])
The area under a probability density function should always integrate to 1, representing the fact that the total probability of a distribution should always sum to 100%. Hence, a KDE curve will always have an area under the curve of 1.
And if I got your algorithm right your are calculating KDE for each of the three set of data separately, but I could very well be wrong, hopefully someone more knowledgeable will step in.
from another perspective :
adding to your code in this section:
# KDE calculation
if np.linalg.matrix_rank(np.cov(cartesian_points.T)) < 2:
cartesian_points += np.random.normal(0, 1e-6, cartesian_points.shape) # Add small noise
kde = gaussian_kde(cartesian_points.T, bw_method=0.25)
the following lines:
integr = kde.integrate_box([-np.inf , -np.inf], [np.max(cartesian_points.T[0]), np.max(cartesian_points.T[1])])
print('\nintegr : ', integr)
got from Scipy Documentation about stats.gaussian_kde method gaussian_kde.integrate_box that:
Computes the integral of a pdf over a rectangular interval.
where KDE stands for "Kernel density estimation" while pdf stands for "probability density function"
to get:
# KDE calculation
if np.linalg.matrix_rank(np.cov(cartesian_points.T)) < 2:
cartesian_points += np.random.normal(0, 1e-6, cartesian_points.shape) # Add small noise
kde = gaussian_kde(cartesian_points.T, bw_method=0.25)
integr = kde.integrate_box([-np.inf , -np.inf], [np.max(cartesian_points.T[0]), np.max(cartesian_points.T[1])])
print('\nintegr : ', integr)
I get as output for the three calculated KDE :
integr : 0.9831556923550625
integr : 0.9990799118644755
integr : 0.9990259262300176
that could be indicating that calculated KDE is fine, integration over all estimation has a value of 1.
I dont know if I am getting your question right but if you comment out:
Z[Z > 0.4 * max_density] = 0.4 * max_density
Z[Z < 0.03 * max_density] = np.nan
to :
#Z[Z > 0.4 * max_density] = 0.4 * max_density
#Z[Z < 0.03 * max_density] = np.nan
You get:
for the points that to me look like:
If you look at the color bar values in the original pics you get that at X now you have a more sharp profile (less yellow and suddently different blu shades) between the contour levels, given that in X max density is 40.
Here original graph , with colorbar
While setting max value of colorbar equal for all the graph with:
levels = np.arange(0, 50, 5)
contour = ax.contourf(X, Y, Z, levels=levels, cmap='viridis', alpha=0.85, zorder=2)
and keeping commented out Z
as above
I get:
but again I am not following, given that google says (for https://www.google/search?q=kde+density+integrate+to+1)
The area under a probability density function should always integrate to 1, representing the fact that the total probability of a distribution should always sum to 100%. Hence, a KDE curve will always have an area under the curve of 1.
And if I got your algorithm right your are calculating KDE for each of the three set of data separately, but I could very weel be wrong, hopefully someone more knowledgeable will step in.
Moreover about normalization using:
def global_normalize_2d(data):
return (data - np.nanmin(data)) / (np.nanmax(data) - np.nanmin(data))
and applying it to Z , I get :
That looks like your first run results with commented out:
#Z[Z > 0.4 * max_density] = 0.4 * max_density
#Z[Z < 0.03 * max_density] = np.nan