I am trying to find a way to fit Log Pearson3 distribution to my streamflow data but I can't find the way how to! Any tips will be much appreciated. Here is the problem:
Both scipy and lmoment3 packages have Pearson3 but they don't have Log Pearson3 distributions to fit! scipy uses the Maximum Likelihood Estimation (MLE) method to fit the distribution and lmoment3 package fits the distributions by using the l-moment method. But as said, they both only have the Pearson3 distribution among their statistical distributions' list.
I calculate the Annual Maximum Flow (AMF) data which returns a time series of say 50 years of streamflow data. Then I use scipy and lmoment3 packages to fit distributions. I thought if I calculate the log of my AMF and then fit Pearson3 and then calculate the antilog in the end, it would be like fitting Log Pearson3, but it seems like its not! There are differences in how parametrs are being estimated in Pearson3 and Log Pearson3!
and I can't find any proper guide online!
Any thoughts on this? below is the code I use:
import os
import numpy as np
import pandas as pd
import scipy.stats as st
import lmoments3 as lm
from lmoments3 import distr, stats
# Define the folder path and get the list of CSV files
folder_path = '.../daily_ts/'
# folder_path = '.../all_sites/'
csv_files = sorted([f for f in os.listdir(folder_path) if f.endswith('.csv')])
# Iterate over each file and process the data
for i, file in enumerate(csv_files[5:6]):
station_code = file.split('_')[0]
# Read the data
df = pd.read_csv(os.path.join(folder_path, file), skiprows=27, names=['Date', 'Flow (ML)', 'Bureau QCode'])
station_code = file.split('_')[0]
# Process the data
df = df.dropna(subset=['Flow (ML)']) # Ensure 'Flow (ML)' has no NaN values
df['Date'] = pd.to_datetime(df['Date']) # Ensure the 'Date' column is in datetime format
df['Year'] = df['Date'].dt.year # Extract the year from the 'Date' column and add it as a new column
df_max_flow = df.loc[df.groupby('Year')['Flow (ML)'].idxmax()] # Max daily per year
df_max_flow_sorted = df_max_flow.sort_values(by='Flow (ML)', ascending=False) # Sort by 'Flow (ML)' in descending order
df_max_flow_sorted['Rank'] = range(1, len(df_max_flow_sorted) + 1)
df_max_flow_sorted['ARI_Empir'] = (df_max_flow_sorted['Rank'].iloc[-1] + 1) / df_max_flow_sorted['Rank']
df_max_flow_sorted['AEP_Empir'] = 1 / df_max_flow_sorted['ARI_Empir']
data = df_max_flow_sorted['Flow (ML)'].values
# Fit Log-Pearson3 using Maximun Likelihood Estimation method (MLE)
dist_name = 'pearson3'
# Use getattr to dynamically get the fitting method command
scipy_dist_fit = getattr(st, dist_name)
# Calculate natural log of the data
log_data = np.log(data)
param = scipy_dist_fit.fit(log_data)
# Applying the Kolmogorov-Smirnov test
ks_stat, p_val = st.kstest(log_data, dist_name, args=param)
print(f"parameters: {param}")
print(f"ks_stat, p_val: {ks_stat, p_val}")
ARI_dict = {}
AEP_dict = {}
data_dict = {}
logdata_dict = {}
# test the results of ARI and AEP according to the fitted distribution
loc = param[1]
scale = param[2]
shape = param[0]
# get the attribute to run SciPy stat in the loop
scipy_dist_fit = getattr(st, dist_name)
# run the SciPy stat
fitted_dist = scipy_dist_fit(shape, loc=loc, scale=scale)
# Calculate the return period and AEP based on the fitted_dist and add relevant columns to the table
AEP = 1 - fitted_dist.cdf(log_data)
ARI = 1 / AEP
# Store AEP and ARI in a dictionary
AEP_dict['AEP_LP3_MLE'] = np.sort(AEP)
ARI_dict['ARI_LP3_MLE'] = np.sort(ARI)[::-1]
data_dict['data'] = data
logdata_dict['log_data'] = log_data
AEP_df = pd.DataFrame(AEP_dict)
ARI_df = pd.DataFrame(ARI_dict)
data_df = pd.DataFrame(data_dict)
logdata_df = pd.DataFrame(logdata_dict)
# Reset indices of all DataFrames to ensure they align correctly
AEP_df = AEP_df.reset_index(drop=True)
ARI_df = ARI_df.reset_index(drop=True)
df_max_flow_sorted = df_max_flow_sorted.reset_index(drop=True)
data_df = data_df.reset_index(drop=True)
logdata_df = logdata_df.reset_index(drop=True)
df_fitDist = pd.concat([df_max_flow_sorted, AEP_df, ARI_df], axis=1)
LP3_FFA = pd.concat([data_df, logdata_df, AEP_df, ARI_df], axis=1)
# data = fitted_dist.ppf(1 - AEP) # Inverse of CDF
flood_100year = np.exp(fitted_dist.ppf(1 - 0.01))
# flood_100year =fitted_dist.ppf(1 - 0.01)
print(f"100-Year Flood = {flood_100year}")
flood_50year = np.exp(fitted_dist.ppf(1 - 0.02))
# flood_50year = fitted_dist.ppf(1 - 0.02)
print(f"50-Year Flood = {flood_50year}")