python - Matplotlib KDE on Ternary Plot – Unexpected Density Scaling

I am working on a ternary KDE (kernel density estimate) plot using Matplotlib + SciPy’s gaussian_kde, but the density scaling is not behaving as expected.

The top panel (scatter plots) shows ternary distributions for different groups (X, Y, Z). These look fine.

Original Picture by OP.

ADDED Picture produced by code, see colorbars missing in original picture:

The bottom panel (KDE heatmaps) should show density contours based on the same datasets.

Expected:

The X dataset is sparse, so the KDE should be faint or nearly blank. The Y and Z datasets have more data, so they should have stronger KDE gradients.

Issue:

Even though X has fewer data points, its KDE still appears structured and unexpectedly dense comparable to Y and Z. Should not it be almost blank or significantly fainter compared to Y and Z? Is this due to bandwidth selection, normalization issues, or KDE smoothing?

I have included a sample dataset for testing: kde_ternary_example.csv

You can load the dataset using:

import pandas as pd

# Load dataset from Google Drive link
file_url = ";
df = pd.read_csv(file_url)

print(df.head())  # Preview the dataset

My Current Code:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from matplotlib.patches import Polygon
from scipy.stats import gaussian_kde

# Load dataset
file_url = ";
df = pd.read_csv(file_url)

# Define ternary combinations and corresponding filters
ternary_combinations = [
    (['B', 'C', 'A'], 'X'),
    (['C', 'D', 'B'], 'Y'),
    (['D', 'E', 'C'], 'Z')
]

# Function to convert barycentric to Cartesian coordinates
def barycentric_to_cartesian(data):
    triangle_vertices = np.array([[0, 0], [1, 0], [0.5, np.sqrt(3) / 2]])  # Equilateral triangle
    return np.dot(data, triangle_vertices)

# Function to check if points are inside the triangle
def is_inside_triangle(points, triangle_vertices):
    A, B, C = triangle_vertices
    v0, v1, v2 = B - A, C - A, points - A
    d00, d01, d11 = np.dot(v0, v0), np.dot(v0, v1), np.dot(v1, v1)
    d20, d21 = np.einsum('ij,j->i', v2, v0), np.einsum('ij,j->i', v2, v1)
    denom = d00 * d11 - d01 * d01
    u = (d11 * d20 - d01 * d21) / denom
    v = (d00 * d21 - d01 * d20) / denom
    return (u >= 0) & (v >= 0) & (u + v <= 1)

# Function to plot ternary diagram with KDE heatmap
def plot_ternary_diagram():
    fig, axes = plt.subplots(1, 3, figsize=(24, 8), constrained_layout=True, facecolor="whitesmoke")
    
    def plot_ternary(ax, data_columns, normative_filter):
        filtered_df = df[df['NN'] == normative_filter]
        data = filtered_df[data_columns].dropna().to_numpy()
        if data.shape[0] == 0:
            return  # Skip plotting if no data is available
        
        sum_values = np.sum(data, axis=1, keepdims=True)
        valid_rows = sum_values[:, 0] > 0
        data = data[valid_rows] / sum_values[valid_rows]
        cartesian_points = barycentric_to_cartesian(data)
        
        # Scatter plot
        ax.scatter(cartesian_points[:, 0], cartesian_points[:, 1], s=20, alpha=0.5, 
                   facecolors='none', edgecolors='black', linewidth=0.6, zorder=1)
        
        # KDE calculation
        if np.linalg.matrix_rank(np.cov(cartesian_points.T)) < 2:
            cartesian_points += np.random.normal(0, 1e-6, cartesian_points.shape)  # Add small noise
        kde = gaussian_kde(cartesian_points.T, bw_method=0.25)
        
        x = np.linspace(0, 1, 200)
        y = np.linspace(0, np.sqrt(3) / 2, 200)
        X, Y = np.meshgrid(x, y)
        grid_points = np.vstack([X.ravel(), Y.ravel()]).T
        
        # Mask points outside the triangle
        triangle_vertices = np.array([[0, 0], [1, 0], [0.5, np.sqrt(3) / 2]])
        mask = is_inside_triangle(grid_points, triangle_vertices)
        Z = np.full(grid_points.shape[0], np.nan)
        Z[mask] = kde(grid_points[mask].T)
        Z = Z.reshape(X.shape)
        
        # Set density limit
        max_density = np.nanmax(Z)
        Z[Z > 0.4 * max_density] = 0.4 * max_density
        Z[Z < 0.03 * max_density] = np.nan
        
        # Plot KDE
        contour = ax.contourf(X, Y, Z, levels=12, cmap='viridis', alpha=0.85, zorder=2)
        
        # Draw triangle boundary
        ax.add_patch(Polygon(triangle_vertices, closed=True, fill=False, edgecolor='black', 
                             linewidth=2.5, linestyle='dashed', zorder=3))
        
        # Remove axes and set aspect ratio
        ax.set_xticks([])
        ax.set_yticks([])
        ax.set_frame_on(False)
        ax.set_aspect('equal')
        
        # Colorbar
        cbar = plt.colorbar(contour, ax=ax, orientation='horizontal', fraction=0.05, pad=0.08)
        cbar.set_label('Density', fontsize=12)
    
    for ax, (ternary_set, norm_filter) in zip(axes, ternary_combinations):
        plot_ternary(ax, ternary_set, norm_filter)
    
    plt.show()

# Generate and save ternary plots with filters
plot_ternary_diagram()

My Questions

Why is the KDE density for X appearing structured and not as faint as expected?

Could bandwidth selection (bw_method = 0.25; 'scott') be causing unwanted smoothing effects?

Am I normalizing or transforming the data incorrectly?

I am working on a ternary KDE (kernel density estimate) plot using Matplotlib + SciPy’s gaussian_kde, but the density scaling is not behaving as expected.

The top panel (scatter plots) shows ternary distributions for different groups (X, Y, Z). These look fine.

Original Picture by OP.

ADDED Picture produced by code, see colorbars missing in original picture:

The bottom panel (KDE heatmaps) should show density contours based on the same datasets.

Expected:

The X dataset is sparse, so the KDE should be faint or nearly blank. The Y and Z datasets have more data, so they should have stronger KDE gradients.

Issue:

I have included a sample dataset for testing: kde_ternary_example.csv

You can load the dataset using:

import pandas as pd

# Load dataset from Google Drive link
file_url = "https://drive.google/uc?id=1p6qwnEom_ybyhbTNLUOThoVB5YE5xcyJ"
df = pd.read_csv(file_url)

print(df.head())  # Preview the dataset

My Current Code:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from matplotlib.patches import Polygon
from scipy.stats import gaussian_kde

# Load dataset
file_url = "https://drive.google/uc?id=1p6qwnEom_ybyhbTNLUOThoVB5YE5xcyJ"
df = pd.read_csv(file_url)

# Define ternary combinations and corresponding filters
ternary_combinations = [
    (['B', 'C', 'A'], 'X'),
    (['C', 'D', 'B'], 'Y'),
    (['D', 'E', 'C'], 'Z')
]

# Function to convert barycentric to Cartesian coordinates
def barycentric_to_cartesian(data):
    triangle_vertices = np.array([[0, 0], [1, 0], [0.5, np.sqrt(3) / 2]])  # Equilateral triangle
    return np.dot(data, triangle_vertices)

# Function to check if points are inside the triangle
def is_inside_triangle(points, triangle_vertices):
    A, B, C = triangle_vertices
    v0, v1, v2 = B - A, C - A, points - A
    d00, d01, d11 = np.dot(v0, v0), np.dot(v0, v1), np.dot(v1, v1)
    d20, d21 = np.einsum('ij,j->i', v2, v0), np.einsum('ij,j->i', v2, v1)
    denom = d00 * d11 - d01 * d01
    u = (d11 * d20 - d01 * d21) / denom
    v = (d00 * d21 - d01 * d20) / denom
    return (u >= 0) & (v >= 0) & (u + v <= 1)

# Function to plot ternary diagram with KDE heatmap
def plot_ternary_diagram():
    fig, axes = plt.subplots(1, 3, figsize=(24, 8), constrained_layout=True, facecolor="whitesmoke")
    
    def plot_ternary(ax, data_columns, normative_filter):
        filtered_df = df[df['NN'] == normative_filter]
        data = filtered_df[data_columns].dropna().to_numpy()
        if data.shape[0] == 0:
            return  # Skip plotting if no data is available
        
        sum_values = np.sum(data, axis=1, keepdims=True)
        valid_rows = sum_values[:, 0] > 0
        data = data[valid_rows] / sum_values[valid_rows]
        cartesian_points = barycentric_to_cartesian(data)
        
        # Scatter plot
        ax.scatter(cartesian_points[:, 0], cartesian_points[:, 1], s=20, alpha=0.5, 
                   facecolors='none', edgecolors='black', linewidth=0.6, zorder=1)
        
        # KDE calculation
        if np.linalg.matrix_rank(np.cov(cartesian_points.T)) < 2:
            cartesian_points += np.random.normal(0, 1e-6, cartesian_points.shape)  # Add small noise
        kde = gaussian_kde(cartesian_points.T, bw_method=0.25)
        
        x = np.linspace(0, 1, 200)
        y = np.linspace(0, np.sqrt(3) / 2, 200)
        X, Y = np.meshgrid(x, y)
        grid_points = np.vstack([X.ravel(), Y.ravel()]).T
        
        # Mask points outside the triangle
        triangle_vertices = np.array([[0, 0], [1, 0], [0.5, np.sqrt(3) / 2]])
        mask = is_inside_triangle(grid_points, triangle_vertices)
        Z = np.full(grid_points.shape[0], np.nan)
        Z[mask] = kde(grid_points[mask].T)
        Z = Z.reshape(X.shape)
        
        # Set density limit
        max_density = np.nanmax(Z)
        Z[Z > 0.4 * max_density] = 0.4 * max_density
        Z[Z < 0.03 * max_density] = np.nan
        
        # Plot KDE
        contour = ax.contourf(X, Y, Z, levels=12, cmap='viridis', alpha=0.85, zorder=2)
        
        # Draw triangle boundary
        ax.add_patch(Polygon(triangle_vertices, closed=True, fill=False, edgecolor='black', 
                             linewidth=2.5, linestyle='dashed', zorder=3))
        
        # Remove axes and set aspect ratio
        ax.set_xticks([])
        ax.set_yticks([])
        ax.set_frame_on(False)
        ax.set_aspect('equal')
        
        # Colorbar
        cbar = plt.colorbar(contour, ax=ax, orientation='horizontal', fraction=0.05, pad=0.08)
        cbar.set_label('Density', fontsize=12)
    
    for ax, (ternary_set, norm_filter) in zip(axes, ternary_combinations):
        plot_ternary(ax, ternary_set, norm_filter)
    
    plt.show()

# Generate and save ternary plots with filters
plot_ternary_diagram()

My Questions

Why is the KDE density for X appearing structured and not as faint as expected?

Could bandwidth selection (bw_method = 0.25; 'scott') be causing unwanted smoothing effects?

Am I normalizing or transforming the data incorrectly?

Share Improve this question edited Feb 16 at 20:59 pippo1980 3,0963 gold badges17 silver badges39 bronze badges asked Feb 15 at 5:59 Shubhadeep Roy 11 bronze badge

input file is missing we cannot run your code, I voted to close this question. Please read [How do I ask a good question? ](stackoverflow/help/how-to-ask) and How to create a Minimal, Reproducible Example – pippo1980 Commented Feb 15 at 17:02
line 13, in raise FileNotFoundError(f"The file at {file_path} does not exist.") FileNotFoundError: The file at C:\Users\...\example_help.xlsx does not exist. – pippo1980 Commented Feb 15 at 17:11
does max_density = np.nanmax(Z) retturn max value for just x,y coords where Z values is equivalent for KDE or for values outside it too(outside the triagles)? how such values can be higher than 1 ? is integration over KDE = 100 or 1 ?? how can you get a KDE value for a single point instead of an interval on wich you should integrate the probability value ? [PS my knowlwdge oin statistics is very bad, hope you got my question] – pippo1980 Commented Feb 16 at 18:56
max_density = np.nanmax(Z) gives three different results see, uploaded picture output of your code, should you normalize the pdf with value from 0 to 1 to compare the three plots ??? – pippo1980 Commented Feb 16 at 20:53
print('sum Z : ', np.sum(np.where(Z != np.nan))) results in 7960000 , I am missing something about gaussian_kde – pippo1980 Commented Feb 16 at 21:01

| Show 1 more comment

3 Answers 3

Sorted by: Reset to default 0

I don't get your statement:

Expected: The X dataset is sparse, so the KDE should be faint or nearly blank. The Y and Z datasets have more data, so they should have stronger KDE gradients.

Gradients , contours , are draw for each of the 3 plots , X,Y,Z with this line

# Plot KDE
contour = ax.contourf(X, Y, Z, levels=12, cmap='viridis', alpha=0.85, zorder=2)

changing keyword levels=12 to levels=2 :

# Plot KDE
contour = ax.contourf(X, Y, Z, levels=2, cmap='viridis', alpha=0.85, zorder=2)

I get:

Z picture/plot (third one) is less sharp ?? isnt it?

but again I am not following, given that google says (for [https://www.google/search?q=kde+density+integrate+to+1][5])

The area under a probability density function should always integrate to 1, representing the fact that the total probability of a distribution should always sum to 100%. Hence, a KDE curve will always have an area under the curve of 1.

And if I got your algorithm right your are calculating KDE for each of the three set of data separately, but I could very well be wrong, hopefully someone more knowledgeable will step in.

from another perspective :

adding to your code in this section:

# KDE calculation
if np.linalg.matrix_rank(np.cov(cartesian_points.T)) < 2:
    cartesian_points += np.random.normal(0, 1e-6, cartesian_points.shape)  # Add small noise
kde = gaussian_kde(cartesian_points.T, bw_method=0.25)

the following lines:

integr = kde.integrate_box([-np.inf , -np.inf], [np.max(cartesian_points.T[0]), np.max(cartesian_points.T[1])])        
print('\nintegr : ', integr)

got from Scipy Documentation about stats.gaussian_kde method gaussian_kde.integrate_box that:

Computes the integral of a pdf over a rectangular interval.

where KDE stands for "Kernel density estimation" while pdf stands for "probability density function"

to get:

# KDE calculation
if np.linalg.matrix_rank(np.cov(cartesian_points.T)) < 2:
            cartesian_points += np.random.normal(0, 1e-6, cartesian_points.shape)  # Add small noise
kde = gaussian_kde(cartesian_points.T, bw_method=0.25)
integr = kde.integrate_box([-np.inf , -np.inf], [np.max(cartesian_points.T[0]), np.max(cartesian_points.T[1])])       
print('\nintegr : ', integr)

I get as output for the three calculated KDE :

integr :  0.9831556923550625

integr :  0.9990799118644755

integr :  0.9990259262300176

that could be indicating that calculated KDE is fine, integration over all estimation has a value of 1.

I dont know if I am getting your question right but if you comment out:

Z[Z > 0.4 * max_density] = 0.4 * max_density
Z[Z < 0.03 * max_density] = np.nan

to :

#Z[Z > 0.4 * max_density] = 0.4 * max_density
#Z[Z < 0.03 * max_density] = np.nan

You get:

for the points that to me look like:

If you look at the color bar values in the original pics you get that at X now you have a more sharp profile (less yellow and suddently different blu shades) between the contour levels, given that in X max density is 40.

Here original graph , with colorbar

While setting max value of colorbar equal for all the graph with:

levels = np.arange(0, 50, 5)
        contour = ax.contourf(X, Y, Z, levels=levels, cmap='viridis', alpha=0.85, zorder=2)

and keeping commented out Z as above

I get:

but again I am not following, given that google says (for https://www.google/search?q=kde+density+integrate+to+1)

The area under a probability density function should always integrate to 1, representing the fact that the total probability of a distribution should always sum to 100%. Hence, a KDE curve will always have an area under the curve of 1.

And if I got your algorithm right your are calculating KDE for each of the three set of data separately, but I could very weel be wrong, hopefully someone more knowledgeable will step in.

Moreover about normalization using:

def global_normalize_2d(data):
    return (data - np.nanmin(data)) / (np.nanmax(data) - np.nanmin(data))

and applying it to Z , I get :

That looks like your first run results with commented out:

#Z[Z > 0.4 * max_density] = 0.4 * max_density
#Z[Z < 0.03 * max_density] = np.nan

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Matplotlib KDE on Ternary Plot – Unexpected Density Scaling - Stack Overflow

3 Answers 3

与本文相关的文章

评论列表(0)