DanAllan.com

Concise Curve Fitting in Python

When I need to fit a function to some data, Python can seem cumbersome. I’m not sure I could win a race against someone working in Igor or even Excel.

pyMC has received a lot of attention, but for traditional LM least-squared fitting, most users use scipy, which lacks some modern amenities. I need a tool that provides succinct syntax for straightforward tasks, handles data with missing values (a la pandas), and returns results in a form that I can easily plot.

Matt Newville’s lmfit project is a big step forward from scipy. Like many graphical data analysis programs, it can set bounds on individual fit parameters or hold them fixed.

Inspired by some ideas by @vnmanoharan in this discussion, I wrote a fresh interface to lmfit that addresses all the needs listed above. It has been merged into the development version of lmfit. A demo of the key features follows.

The Model class is a flexible, concise curve fitter. I will illustrate fitting example data to an exponential decay.

In [1]:
def decay(t, N, tau):
    return N*np.exp(-t/tau)

The parameters are in no particular order. We'll need some example data. I will use N=7 and tau=3, and I'll add a little noise.

In [2]:
t = np.linspace(0, 5, num=1000)
data = decay(t, 7, 3) + np.random.randn(*t.shape)

Simplest Usage

In [3]:
from lmfit import Model

model = Model(decay, independent_vars=['t'])
result = model.fit(data, t=t, N=10, tau=1)

The Model infers the parameter names by inspecting the arguments of the function, decay. Then I passed the independent variable, t, and initial guesses for each parameter. A residual function is automatically defined, and a least-squared regression is performed.

We can immediately see the best-fit values

In [4]:
result.values
Out[4]:
{'N': 6.8332246334656945, 'tau': 3.0502578166074512}

and easily pass those to the original model function for plotting:

In [5]:
plot(t, data)  # data
plot(t, decay(t=t, **result.values))  # best-fit model
Out[5]:
[<matplotlib.lines.Line2D at 0xb9a28cc>]

We can review the best-fit Parameters in more detail.

In [6]:
result.params
Out[6]:
Parameters([('tau', <Parameter 'tau', value=3.0502578166074512 +/- 0.0675, bounds=[-inf:inf]>), ('N', <Parameter 'N', value=6.8332246334656945 +/- 0.0869, bounds=[-inf:inf]>)])

More information about the fit is stored in the result,which is an lmfit.Mimimizer object.

Specifying Bounds and Holding Parameters Constant

Above, the Model class implicitly builds Parameter objects from keyword arguments of fit that match the argments of decay. You can build the Parameter objects explicity; the following is equivalent.

In [7]:
from lmfit import Parameter

result = model.fit(data, t=t, 
                   N=Parameter(value=10), 
                   tau=Parameter(value=1))
result.params
Out[7]:
Parameters([('tau', <Parameter 'tau', value=3.0502578166074512 +/- 0.0675, bounds=[-inf:inf]>), ('N', <Parameter 'N', value=6.8332246334656945 +/- 0.0869, bounds=[-inf:inf]>)])

By building Parameter objects explicitly, you can specify bounds (min, max) and set parameters constant (vary=False).

In [8]:
result = model.fit(data, t=t, 
                   N=Parameter(value=7, vary=False), 
                   tau=Parameter(value=1, min=0))
result.params
Out[8]:
Parameters([('tau', <Parameter 'tau', value=2.9550822200975864 +/- 0.0417, bounds=[0:inf]>), ('N', <Parameter 'N', value=7 (fixed), bounds=[-inf:inf]>)])

Defining Parameters in Advance

Passing parameters to fit can become unwieldly. As an alternative, you can extract the parameters from model like so, set them individually, and pass them to fit.

In [9]:
params = model.params()
In [10]:
params['N'].value = 10  # initial guess
params['tau'].value = 1
params['tau'].min = 0
In [11]:
result = model.fit(data, params, t=t)
result.params
Out[11]:
Parameters([('tau', <Parameter 'tau', value=3.0502578132121547 +/- 0.0675, bounds=[0:inf]>), ('N', <Parameter 'N', value=6.8332246370863503 +/- 0.0869, bounds=[-inf:inf]>)])

Keyword arguments override params, resetting value and all other properties (min, max, vary).

In [12]:
result = model.fit(data, params, t=t, tau=1)
result.params
Out[12]:
Parameters([('tau', <Parameter 'tau', value=3.0502578166074512 +/- 0.0675, bounds=[-inf:inf]>), ('N', <Parameter 'N', value=6.8332246334656945 +/- 0.0869, bounds=[-inf:inf]>)])

The input parameters are not modified by fit. They can be reused, retaining the same initial value. If you want to use the result of one fit as the initial guess for the next, simply pass params=result.params.

A Helpful Exception

All this implicit magic makes it very easy for the user to neglect to set a parameter. The fit function checks for this and raises a helpful exception.

In [13]:
result = model.fit(data, t=t, tau=1)  # N unspecified
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-6d8fedbef3f8> in <module>()
----> 1 result = model.fit(data, t=t, tau=1)  # N unspecified

/home/dallan/lmfit-py/lmfit/model.pyc in fit(self, data, params, sigma, **kwargs)
    191             raise ValueError("Assign each parameter an initial value by " +
    192                              "passing Parameters or keyword arguments to " +
--> 193                              "fit().")
    194 
    195         # Handle null/missing values.

ValueError: Assign each parameter an initial value by passing Parameters or keyword arguments to fit().

An extra parameter that cannot be matched to the model function will throw a UserWarning, but it will not raise, leaving open the possibility of unforeseen extensions calling for some parameters.

Weighted Fits

Use the sigma argument to perform a weighted fit. If you prefer to think of the fit in term of weights, sigma=1/weights.

In [14]:
weights = np.arange(len(data))
result = model.fit(data, params, t=t, sigma=1./weights)
result.params
Out[14]:
Parameters([('tau', <Parameter 'tau', value=3.096728970589659 +/- 0.113, bounds=[0:inf]>), ('N', <Parameter 'N', value=6.7514922300319453 +/- 0.256, bounds=[-inf:inf]>)])

Handling Missing Data

By default, attemping to fit data that includes a NaN, which conventionally indicates a "missing" observation, raises a lengthy exception. You can choose to drop (i.e., skip over) missing values instead.

In [15]:
data_with_holes = data.copy()
data_with_holes[[5, 500, 700]] = np.nan  # Replace arbitrary values with NaN.

model = Model(decay, independent_vars=['t'], missing='drop')
result = model.fit(data_with_holes, params, t=t)
result.params
Out[15]:
Parameters([('tau', <Parameter 'tau', value=3.0547114484523323 +/- 0.0677, bounds=[0:inf]>), ('N', <Parameter 'N', value=6.8308291273906265 +/- 0.087, bounds=[-inf:inf]>)])

If you don't want to ignore missing values, you can set the model to raise proactively, checking for missing values before attempting the fit.

In [16]:
model = Model(decay, independent_vars=['t'], missing='raise')
result = model.fit(data_with_holes, params, t=t)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-788e0b6b627f> in <module>()
      1 model = Model(decay, independent_vars=['t'], missing='raise')
----> 2 result = model.fit(data_with_holes, params, t=t)

/home/dallan/lmfit-py/lmfit/model.pyc in fit(self, data, params, sigma, **kwargs)
    196         mask = None
    197         if self.missing != 'none':
--> 198             mask = self._handle_missing(data)  # This can raise.
    199             if mask is not None:
    200                 data = data[mask]

/home/dallan/lmfit-py/lmfit/model.pyc in _handle_missing(self, data)
    117         if self.missing == 'raise':
    118             if np.any(isnull(data)):
--> 119                 raise ValueError("Data contains a null value.")
    120         elif self.missing == 'drop':
    121             mask = ~isnull(data)

ValueError: Data contains a null value.

The default setting is missing='none', which does not check for NaNs. This interface is consistent with the statsmodels project.

Null-chekcing relies on pandas.isnull if it is available. If pandas cannot be imported, it silently falls back on numpy.isnan.

Data Alignment

Imagine a collection of time series data with different lengths. It would be convenient to define one sufficiently long array t and use it for each time series, regardless of length. The pandas provides tools for aligning indexed data. And, unlike most wrappers to scipy.leastsq, Model can handle pandas objects out of the box, using its data alignment features.

Here I take just a slice of the data and fit it to the full t. It is automatically aligned to the correct section of t using Series' index.

In [17]:
from pandas import Series

model = Model(decay, independent_vars=['t'])
truncated_data = Series(data)[200:800]  # data points 200-800
t = Series(t)  # all 1000 points
result = model.fit(truncated_data, params, t=t)
result.params
Out[17]:
Parameters([('tau', <Parameter 'tau', value=3.2221825353028226 +/- 0.159, bounds=[0:inf]>), ('N', <Parameter 'N', value=6.5296051307920768 +/- 0.221, bounds=[-inf:inf]>)])

Data with missing entries and an unequal length still aligns properly.

In [18]:
model = Model(decay, independent_vars=['t'], missing='drop')
truncated_data_with_holes = Series(data_with_holes)[200:800]
result = model.fit(truncated_data_with_holes, params, t=t)
result.params
Out[18]:
Parameters([('tau', <Parameter 'tau', value=3.2397946733749583 +/- 0.16, bounds=[0:inf]>), ('N', <Parameter 'N', value=6.5107676500014584 +/- 0.219, bounds=[-inf:inf]>)])

Comments