Re: Robust fitting of data - Installing R (CRAN) packages

Gordon Haverland
 

On Thu, 8 Nov 2018 15:58:06 -0700
"Gordon Haverland" <ghaverla@...> wrote:
Like a lot of things in statistics,
This has nothing to do with comparing distributions, but is an example
of what computers can bring you.

The mean is a measure of central tendency. It is not the only one.
The median is the value which is "half way", 50% is below and 50% is
above. For a single moded distribution, the mode is the most common
value.

For symmetric distributions, the mean median and mode should all be
equal.

Calculating means (averages, expectations) is the presence of outliers
results in answers different than should be found. It turns out the
median is a more robust measure of central tendency. If you calculate
the median in the presence of some (not a lot) of outliers, you
probably do much better than calculating averages.

Numerical recipes has a function for doing a median fit of a straight
line to data. This is as opposed to a least squares fit.

Let's say you have a data set, and you add one point to the data set.
And then you fit via least squares and you fit via a median method, and
you look at how the parameters of the fitted straight line change as a
function of where this extra point is (you are moving this extra point
around). You are probing the sensitivity of the calculated parameters
to the presence of this extra data point. The values found from
least squares, will vary smoothly with the position of this extra data
point. The values of the median fit will change discontinously as a
function of where this extra point is (there will be jumps in
parameters).

A reasonable thing to do with any data set, is to calculate the average
X and Y of the data, and then make up a new data set where you subtract
(<X>,<Y>) from each data point. A least squares fit to this new data
will pass through (0,0). Normally we assume that there is no error in
X and hence all the error is in Y. But if we have a reasonable amount
of data, the "error" in moving the data by subtracting off the average
of X and Y should not be large. What we are left with, is just to
calculate the slope of the point that goes through (0,0).

Well, there is a way to robustly solve that problem - the Theil-Sen
estimator.

https://en.wikipedia.org/wiki/Theil%E2%80%93Sen_estimator

What you do (in theory) is calculate all the 2 point slopes possible in
the data, sort them and pick the one in the middle (the median). The
number of slopes you have to calculate becomes ridiculously large as
the number of data points increases, so their are way to calculate
fewer slopes.

Just in case you wanted to look at robust methods.

--

Gord

Join elug@groups.io to automatically receive all group messages.