最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Trying to create predicted curves of CO2 using observed data and using a non-linear equation but my models are not matc

programmeradmin4浏览0评论

its fine I worked out what went wrong. There were 2 issues and I'll explain what happened. The original co2 dataframe has several time-related columns. When I was rescaling time I was meant to use the column 'time' rather than 'years'. In the df, 'time' is made up of fractional years. For x_val I was meant to minus co2_train['year'].values to get the correct time frame.

This is all being done in python. I have a data frame with CO2 data and the relevant columns are 'year' (year CE) and 'co2' (in ppm). The data ranges from 1959 to 2024 and this is my observed data. I split the data into two sections and I put the 1959 to 2004 data into a function called co2_model (has an equation to try to recreate the pattern of the observed data) to get parameters I can use to calculate predicted values of both periods I created previously. However, when I plot the predicted data against the observed data, I get straight lines for the predicted data when I was expecting sine curves (see pictures to see what I mean)

This is my code

# Build model based on equation
def co2_model(x, a, b, A, phi, c):
  # Model containing linear, quadratic and sinusoidal component 
    return (a*x) + (b*(x**2)) + (A*np.sin((2*np.pi*x)+ phi)) + c 

x = (co2['time']).values - co2['time'].values[0]
y = co2['co2'].values

# Split data into training and validation periods 
co2_train = co2[co2['year'] <= 2004]
co2_val = co2[co2['year'] > 2004]

# Rescaling time periods and specifying dependent variable 
x_train = co2_train['year'].values - co2_train['year'].values[0]
y_train = co2_train['co2'].values

x_val = co2_val['year'].values - co2_val['year'].values[0]
y_val = co2_val['co2'].values

# Fit the co2 model to the training data
params_train, covar_train = curve_fit(co2_model, x_train, y_train)
a, b, A, phi, c = params_train  # Extract fitted parameters

#Calculate predicted values for the training and validation data
y_train_predict = co2_model(x_train, a, b, A, phi, c)
y_val_predict = co2_model(x_val, a, b, A, phi, c)

# Plot the original data with the predictions 
plt.figure(figsize=(15, 6))
plt.scatter(x, y, color='grey', label='Observed temperature', alpha=0.5)

# Plot the training period predictions
plt.plot(x_train, y_train_predict, label='Training Period', color='blue', linestyle='--')

# Plot the validation period predictions
plt.plot(x_val, y_val_predict, label='Validation Period', color='red', linestyle='--')

plt.xlabel('Year (CE)')
plt.ylabel('CO$_2$ (ppm)') 

# Show the plot
plt.show()

This is what I get The graph I get

But I was expecting something similar to this but one-half red and the other blue Different curve

Please let me know if I am just making a silly mistake

its fine I worked out what went wrong. There were 2 issues and I'll explain what happened. The original co2 dataframe has several time-related columns. When I was rescaling time I was meant to use the column 'time' rather than 'years'. In the df, 'time' is made up of fractional years. For x_val I was meant to minus co2_train['year'].values to get the correct time frame.

This is all being done in python. I have a data frame with CO2 data and the relevant columns are 'year' (year CE) and 'co2' (in ppm). The data ranges from 1959 to 2024 and this is my observed data. I split the data into two sections and I put the 1959 to 2004 data into a function called co2_model (has an equation to try to recreate the pattern of the observed data) to get parameters I can use to calculate predicted values of both periods I created previously. However, when I plot the predicted data against the observed data, I get straight lines for the predicted data when I was expecting sine curves (see pictures to see what I mean)

This is my code

# Build model based on equation
def co2_model(x, a, b, A, phi, c):
  # Model containing linear, quadratic and sinusoidal component 
    return (a*x) + (b*(x**2)) + (A*np.sin((2*np.pi*x)+ phi)) + c 

x = (co2['time']).values - co2['time'].values[0]
y = co2['co2'].values

# Split data into training and validation periods 
co2_train = co2[co2['year'] <= 2004]
co2_val = co2[co2['year'] > 2004]

# Rescaling time periods and specifying dependent variable 
x_train = co2_train['year'].values - co2_train['year'].values[0]
y_train = co2_train['co2'].values

x_val = co2_val['year'].values - co2_val['year'].values[0]
y_val = co2_val['co2'].values

# Fit the co2 model to the training data
params_train, covar_train = curve_fit(co2_model, x_train, y_train)
a, b, A, phi, c = params_train  # Extract fitted parameters

#Calculate predicted values for the training and validation data
y_train_predict = co2_model(x_train, a, b, A, phi, c)
y_val_predict = co2_model(x_val, a, b, A, phi, c)

# Plot the original data with the predictions 
plt.figure(figsize=(15, 6))
plt.scatter(x, y, color='grey', label='Observed temperature', alpha=0.5)

# Plot the training period predictions
plt.plot(x_train, y_train_predict, label='Training Period', color='blue', linestyle='--')

# Plot the validation period predictions
plt.plot(x_val, y_val_predict, label='Validation Period', color='red', linestyle='--')

plt.xlabel('Year (CE)')
plt.ylabel('CO$_2$ (ppm)') 

# Show the plot
plt.show()

This is what I get The graph I get

But I was expecting something similar to this but one-half red and the other blue Different curve

Please let me know if I am just making a silly mistake

Share Improve this question edited yesterday carpet asked Feb 18 at 4:31 carpetcarpet 11 silver badge1 bronze badge New contributor carpet is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct. 5
  • 1 Periodicity is something that is really difficult to fit with curve_fit. You might be better with an FFT, which should pick out that annual fluctuation. Alternatively, do separate curve fits to high-pass-filtered and low-pass-filtered versions of your data. – lastchance Commented 2 days ago
  • to add on the comment of lastchance, the term (A*np.sin((2*np.pi*x)+ phi)) does not contain a parameter to fit the frequency of the sine. You have A and phi, which control the amplitude and the phase shift. Although again, there are a lot of examples on stackoverflow that show why fitting sines with curve_fit does not work well – Tino D Commented 2 days ago
  • I suspect what has happened is that you have chosen the wrong fixed period for your sin wave. x and x^2 work no matter what the scaling of x, but the sin(2*np.pi*x)+ phi) will only work correctly if the units of x are calendar years. I suggest that you set A to 5 before plotting the output and see what sort of a mess results. I'd be surprised if any modern least squares optimiser couldn't fit a few coefficients of a Fourier expansion. I have never used this particular one. Although from what @TinoD says this may be the case for this curve_fit - do you have a cite for that? – Martin Brown Commented 2 days ago
  • 1 @MartinBrown I would mainly focus on these two posts: one with why periodic data is hard to fit with optimizers and another with a possible solution to this problem. – Tino D Commented 2 days ago
  • 1 @carpet, Since your model is an addition between an exp and a sin, I recommend you first fit an exp function, remove the results from the data and then use this solution to fit the periodic data – Tino D Commented 2 days ago
Add a comment  | 

2 Answers 2

Reset to default 1

OK, I believe that you are probably plotting the monthly mean data from https://gml.noaa.gov/ccgg/trends/data.html

I've taken the liberty of downloading the link "Mauna Loa CO2 monthly mean data (text)" and stripped it down to the table (plus two header lines). I've assumed your data is in the columns "decimal date" and "monthly average". (There is a tiny flaw here for subsequent Fourier analysis - months don't have precisely the same length. You could correct for it - see the code below - but it makes negligible difference.)

I then did an FFT and looked at the frequency spectrum (below). You get an annual signal (f=1 time per year) and a modest seasonal harmonic (f=2 times per year).

I used the FFT to strip the annual and seasonal harmonic sinusoids from the data, then did a standard quadratic fit to the remaining data with curve_fit.

import math
import numpy as np
import matplotlib.pyplot as plt
from scipy.fft import rfft
from scipy.optimize import curve_fit

PI = math.pi

#---------------------------------------------------------------------

def amplitudePhaseAngle( p, q ):
    '''
    Convert   p.cos(h)+q.sin(h)   to    A.sin(h+phi)
    '''
    A = math.sqrt( p ** 2 + q ** 2 )
    phi = math.atan2( p, q )
    return A, phi

#---------------------------------------------------------------------

def plotSpectrum( freq, absft ):
    '''
    Plot the absolute values of the Fourier coefficients
    '''
    plt.plot( freq[:len(absft)], absft, '-o' )
    plt.grid()
    plt.show()

#---------------------------------------------------------------------

def fitSeries( N, nfreq, ft, w, t ):
   '''
   Fit a (truncated) Fourier series, using only the frequency indices in nfreq[]
   '''
   fit = np.zeros_like( t )
   for n in nfreq:
       if n == 0 or ( N % 2 == 0 and n == N // 2 ):
           A, phi = amplitudePhaseAngle( (1/N) * ft[n].real, 0.0 )
       else:
           A, phi = amplitudePhaseAngle( (2/N) * ft[n].real, -(2/N) * ft[n].imag )
       print( "Adding amplitude, circular frequency, phase =", A, w[n], phi )
       fit += A * np.sin( w[n] * t + phi )
   return fit

#---------------------------------------------------------------------

def quadfit( t, a, b, c ):
    return a + b * t + c * t ** 2

#---------------------------------------------------------------------


############
# Get data #
############
t, CO2 = np.loadtxt( 'co2_mm_mlo.txt', skiprows=2, usecols=[2,3], unpack=True )
t -= t[0]
N = len( t )
#t = np.linspace( 0, t[-1], N )         # use this if you want to correct for the fact that months have different length!
dt = ( t[-1] - t[0] ) / ( N - 1 )


########################################
# Remove the annual periodic component #
########################################
ft = rfft( CO2 )
ft[0] = 0
absft = np.abs( ft )

w = np.linspace( 0, 2 * PI / dt, N, endpoint=False )
dw = w[1] - w[0]
annual   = int( 2 * PI     / dw + 0.5 )          # index of nearest frequency to 1 time per year
harmonic = int( 2 * PI * 2 / dw + 0.5 )          # index of nearest frequency to 2 times per year
Fourier = fitSeries( N, [annual, harmonic], ft, w, t )


##########################################
# Now do a quadratic fit to what is left #
##########################################
CO2rev = CO2 - Fourier
p0 = ( CO2rev[0], 0.0, 0.0 )
popt, pcov = curve_fit( quadfit, t, CO2rev, p0=p0 )
print( 'Quadratic fit a + b.t + c.t^2;  a, b, c = ', *popt )


##################################
# Compare fit with original data #
##################################
plt.plot( t, CO2, 'o', color='gray', markersize=3, label='Observed' )
plt.plot( t, Fourier + quadfit( t, *popt ), 'b-', label='Fitted' )
plt.legend()
plt.show()


##################
# Check spectrum #
##################
plotSpectrum( w / ( 2 * PI ), absft )                      # plot against annual frequency
plotSpectrum( range( len(absft) ), absft )                 # plot against index of frequency

The full fit is given in the output. (You don't have to add the harmonic if you just want to use the single annual sinusoid.)

Adding amplitude, circular frequency, phase = 2.347225509468888 6.290476589299131 0.7749909207496195
Adding amplitude, circular frequency, phase = 1.0205014708960185 12.580953178598262 -2.203183819215078
Quadratic fit a + b.t + c.t^2;  a, b, c =  314.66275250650784 0.7556331890507764 0.013323163806095294

You are missing pulsation in your sin function.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import optimize

Let's import the dataset:

data = pd.read_csv("https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_weekly_mlo.csv", header=35)
data = data.replace({-999.99: None}).dropna(subset="average")
data = data.astype(float)

Then we define the model as follow:

def model(t, a, b, c, A, omega, phi):
    return a * t ** 2 + b * t + c + A * np.sin(2 * np.pi * omega * t + phi)

Where omega is the pulsation of the signal.

Then we select some training set based on year criterion:

q = data["decimal"] < 2004

And regress on this sub dataset:

popt, pcov = optimize.curve_fit(model, data.loc[q, "decimal"], data.loc[q, "average"])

Finally we interpolate and extrapolate for the complete time span:

yhat = model(data["decimal"], *popt)

Which leads to:

As you can see, curve_fit does capture the oscillation when omega parameters is introduced. Anyway the extrapolation is not very good on the long run as the polynomial trend is not enough captured for data before 2004.

Fitting the whole dataset with the same model gives indeed better results:

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论