its fine I worked out what went wrong. There were 2 issues and I'll explain what happened. The original co2 dataframe has several time-related columns. When I was rescaling time I was meant to use the column 'time' rather than 'years'. In the df, 'time' is made up of fractional years. For x_val I was meant to minus co2_train['year'].values to get the correct time frame.
This is all being done in python. I have a data frame with CO2 data and the relevant columns are 'year' (year CE) and 'co2' (in ppm). The data ranges from 1959 to 2024 and this is my observed data. I split the data into two sections and I put the 1959 to 2004 data into a function called co2_model (has an equation to try to recreate the pattern of the observed data) to get parameters I can use to calculate predicted values of both periods I created previously. However, when I plot the predicted data against the observed data, I get straight lines for the predicted data when I was expecting sine curves (see pictures to see what I mean)
This is my code
# Build model based on equation
def co2_model(x, a, b, A, phi, c):
# Model containing linear, quadratic and sinusoidal component
return (a*x) + (b*(x**2)) + (A*np.sin((2*np.pi*x)+ phi)) + c
x = (co2['time']).values - co2['time'].values[0]
y = co2['co2'].values
# Split data into training and validation periods
co2_train = co2[co2['year'] <= 2004]
co2_val = co2[co2['year'] > 2004]
# Rescaling time periods and specifying dependent variable
x_train = co2_train['year'].values - co2_train['year'].values[0]
y_train = co2_train['co2'].values
x_val = co2_val['year'].values - co2_val['year'].values[0]
y_val = co2_val['co2'].values
# Fit the co2 model to the training data
params_train, covar_train = curve_fit(co2_model, x_train, y_train)
a, b, A, phi, c = params_train # Extract fitted parameters
#Calculate predicted values for the training and validation data
y_train_predict = co2_model(x_train, a, b, A, phi, c)
y_val_predict = co2_model(x_val, a, b, A, phi, c)
# Plot the original data with the predictions
plt.figure(figsize=(15, 6))
plt.scatter(x, y, color='grey', label='Observed temperature', alpha=0.5)
# Plot the training period predictions
plt.plot(x_train, y_train_predict, label='Training Period', color='blue', linestyle='--')
# Plot the validation period predictions
plt.plot(x_val, y_val_predict, label='Validation Period', color='red', linestyle='--')
plt.xlabel('Year (CE)')
plt.ylabel('CO$_2$ (ppm)')
# Show the plot
plt.show()
This is what I get The graph I get
But I was expecting something similar to this but one-half red and the other blue Different curve
Please let me know if I am just making a silly mistake
its fine I worked out what went wrong. There were 2 issues and I'll explain what happened. The original co2 dataframe has several time-related columns. When I was rescaling time I was meant to use the column 'time' rather than 'years'. In the df, 'time' is made up of fractional years. For x_val I was meant to minus co2_train['year'].values to get the correct time frame.
This is all being done in python. I have a data frame with CO2 data and the relevant columns are 'year' (year CE) and 'co2' (in ppm). The data ranges from 1959 to 2024 and this is my observed data. I split the data into two sections and I put the 1959 to 2004 data into a function called co2_model (has an equation to try to recreate the pattern of the observed data) to get parameters I can use to calculate predicted values of both periods I created previously. However, when I plot the predicted data against the observed data, I get straight lines for the predicted data when I was expecting sine curves (see pictures to see what I mean)
This is my code
# Build model based on equation
def co2_model(x, a, b, A, phi, c):
# Model containing linear, quadratic and sinusoidal component
return (a*x) + (b*(x**2)) + (A*np.sin((2*np.pi*x)+ phi)) + c
x = (co2['time']).values - co2['time'].values[0]
y = co2['co2'].values
# Split data into training and validation periods
co2_train = co2[co2['year'] <= 2004]
co2_val = co2[co2['year'] > 2004]
# Rescaling time periods and specifying dependent variable
x_train = co2_train['year'].values - co2_train['year'].values[0]
y_train = co2_train['co2'].values
x_val = co2_val['year'].values - co2_val['year'].values[0]
y_val = co2_val['co2'].values
# Fit the co2 model to the training data
params_train, covar_train = curve_fit(co2_model, x_train, y_train)
a, b, A, phi, c = params_train # Extract fitted parameters
#Calculate predicted values for the training and validation data
y_train_predict = co2_model(x_train, a, b, A, phi, c)
y_val_predict = co2_model(x_val, a, b, A, phi, c)
# Plot the original data with the predictions
plt.figure(figsize=(15, 6))
plt.scatter(x, y, color='grey', label='Observed temperature', alpha=0.5)
# Plot the training period predictions
plt.plot(x_train, y_train_predict, label='Training Period', color='blue', linestyle='--')
# Plot the validation period predictions
plt.plot(x_val, y_val_predict, label='Validation Period', color='red', linestyle='--')
plt.xlabel('Year (CE)')
plt.ylabel('CO$_2$ (ppm)')
# Show the plot
plt.show()
This is what I get The graph I get
But I was expecting something similar to this but one-half red and the other blue Different curve
Please let me know if I am just making a silly mistake
Share Improve this question edited yesterday carpet asked Feb 18 at 4:31 carpetcarpet 11 silver badge1 bronze badge New contributor carpet is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct. 5 |2 Answers
Reset to default 1OK, I believe that you are probably plotting the monthly mean data from https://gml.noaa.gov/ccgg/trends/data.html
I've taken the liberty of downloading the link "Mauna Loa CO2 monthly mean data (text)" and stripped it down to the table (plus two header lines). I've assumed your data is in the columns "decimal date" and "monthly average". (There is a tiny flaw here for subsequent Fourier analysis - months don't have precisely the same length. You could correct for it - see the code below - but it makes negligible difference.)
I then did an FFT and looked at the frequency spectrum (below). You get an annual signal (f=1 time per year) and a modest seasonal harmonic (f=2 times per year).
I used the FFT to strip the annual and seasonal harmonic sinusoids from the data, then did a standard quadratic fit to the remaining data with curve_fit
.
import math
import numpy as np
import matplotlib.pyplot as plt
from scipy.fft import rfft
from scipy.optimize import curve_fit
PI = math.pi
#---------------------------------------------------------------------
def amplitudePhaseAngle( p, q ):
'''
Convert p.cos(h)+q.sin(h) to A.sin(h+phi)
'''
A = math.sqrt( p ** 2 + q ** 2 )
phi = math.atan2( p, q )
return A, phi
#---------------------------------------------------------------------
def plotSpectrum( freq, absft ):
'''
Plot the absolute values of the Fourier coefficients
'''
plt.plot( freq[:len(absft)], absft, '-o' )
plt.grid()
plt.show()
#---------------------------------------------------------------------
def fitSeries( N, nfreq, ft, w, t ):
'''
Fit a (truncated) Fourier series, using only the frequency indices in nfreq[]
'''
fit = np.zeros_like( t )
for n in nfreq:
if n == 0 or ( N % 2 == 0 and n == N // 2 ):
A, phi = amplitudePhaseAngle( (1/N) * ft[n].real, 0.0 )
else:
A, phi = amplitudePhaseAngle( (2/N) * ft[n].real, -(2/N) * ft[n].imag )
print( "Adding amplitude, circular frequency, phase =", A, w[n], phi )
fit += A * np.sin( w[n] * t + phi )
return fit
#---------------------------------------------------------------------
def quadfit( t, a, b, c ):
return a + b * t + c * t ** 2
#---------------------------------------------------------------------
############
# Get data #
############
t, CO2 = np.loadtxt( 'co2_mm_mlo.txt', skiprows=2, usecols=[2,3], unpack=True )
t -= t[0]
N = len( t )
#t = np.linspace( 0, t[-1], N ) # use this if you want to correct for the fact that months have different length!
dt = ( t[-1] - t[0] ) / ( N - 1 )
########################################
# Remove the annual periodic component #
########################################
ft = rfft( CO2 )
ft[0] = 0
absft = np.abs( ft )
w = np.linspace( 0, 2 * PI / dt, N, endpoint=False )
dw = w[1] - w[0]
annual = int( 2 * PI / dw + 0.5 ) # index of nearest frequency to 1 time per year
harmonic = int( 2 * PI * 2 / dw + 0.5 ) # index of nearest frequency to 2 times per year
Fourier = fitSeries( N, [annual, harmonic], ft, w, t )
##########################################
# Now do a quadratic fit to what is left #
##########################################
CO2rev = CO2 - Fourier
p0 = ( CO2rev[0], 0.0, 0.0 )
popt, pcov = curve_fit( quadfit, t, CO2rev, p0=p0 )
print( 'Quadratic fit a + b.t + c.t^2; a, b, c = ', *popt )
##################################
# Compare fit with original data #
##################################
plt.plot( t, CO2, 'o', color='gray', markersize=3, label='Observed' )
plt.plot( t, Fourier + quadfit( t, *popt ), 'b-', label='Fitted' )
plt.legend()
plt.show()
##################
# Check spectrum #
##################
plotSpectrum( w / ( 2 * PI ), absft ) # plot against annual frequency
plotSpectrum( range( len(absft) ), absft ) # plot against index of frequency
The full fit is given in the output. (You don't have to add the harmonic if you just want to use the single annual sinusoid.)
Adding amplitude, circular frequency, phase = 2.347225509468888 6.290476589299131 0.7749909207496195
Adding amplitude, circular frequency, phase = 1.0205014708960185 12.580953178598262 -2.203183819215078
Quadratic fit a + b.t + c.t^2; a, b, c = 314.66275250650784 0.7556331890507764 0.013323163806095294
You are missing pulsation in your sin
function.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import optimize
Let's import the dataset:
data = pd.read_csv("https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_weekly_mlo.csv", header=35)
data = data.replace({-999.99: None}).dropna(subset="average")
data = data.astype(float)
Then we define the model as follow:
def model(t, a, b, c, A, omega, phi):
return a * t ** 2 + b * t + c + A * np.sin(2 * np.pi * omega * t + phi)
Where omega
is the pulsation of the signal.
Then we select some training set based on year criterion:
q = data["decimal"] < 2004
And regress on this sub dataset:
popt, pcov = optimize.curve_fit(model, data.loc[q, "decimal"], data.loc[q, "average"])
Finally we interpolate and extrapolate for the complete time span:
yhat = model(data["decimal"], *popt)
Which leads to:
As you can see, curve_fit
does capture the oscillation when omega
parameters is introduced. Anyway the extrapolation is not very good on the long run as the polynomial trend is not enough captured for data before 2004.
Fitting the whole dataset with the same model gives indeed better results:
(A*np.sin((2*np.pi*x)+ phi))
does not contain a parameter to fit the frequency of the sine. You haveA
andphi
, which control the amplitude and the phase shift. Although again, there are a lot of examples on stackoverflow that show why fitting sines withcurve_fit
does not work well – Tino D Commented 2 days agox
andx^2
work no matter what the scaling ofx
, but thesin(2*np.pi*x)+ phi)
will only work correctly if the units ofx
are calendar years. I suggest that you setA
to5
before plotting the output and see what sort of a mess results. I'd be surprised if any modern least squares optimiser couldn't fit a few coefficients of a Fourier expansion. I have never used this particular one. Although from what @TinoD says this may be the case for thiscurve_fit
- do you have a cite for that? – Martin Brown Commented 2 days ago