

It doesn't sound like you are working with many data points. But I should point out that R runs on memory. If you don't have enough RAM, it can choke. 

If you think R is crashing, the first step is to run your R code from the R prompt or in an R IDE. If it's successful there, then R itself is not the problem.

Jared
On Sat, 10 Nov 2018 07:04:26 -0700
"Gordon Haverland" <ghaverla@...> wrote:
The data I am looking at, ...
Assuming the AD test is the one I need, I started playing.

I generated 2 vectors of Gaussian deviates of the same size with a mean of 0.5 and a SD of 0.1. And it turns out that if you double the number of deviates, the range sort of increases by a factor of 2. Sort of. Maybe.

I sorted the vector to manipulate.

I did a AD test on the two same length vectors and kSamples produced some kind of output. I then shifted and popped the first and last samples off (which leaves the median at the same value), and did the test again. Repeat, ...

With 20 data points in the original data, by the time I had shift/popped 5 times the AD test still wasn't seeing a significant difference.

Running with 40 data points in the original data, doing the shift/pop 4 times gets me to 32 data points (original range 0.43, new data range 0.18) which is just at the 5% threshold of being declared different.

So, I am guessing that for my football (soccer) data, I really want at least 40 data points in any "long" vector, and that I want my "short vector" to be probably more than 20.

A problem with looking for patterns in football data is typically not enough data. It is not unusual for a game to end with 0, 1 or 2 goals. That games can be changed from a loss to a tie, a win to a tie, ... by the issuing of a penalty (especially late in the game) results in a lot of hard feelings.

But the problem of insufficient data is all over problems. A farmer would like to take in a single sample of 1 teaspoon of soil, for a soil test. That would not take long, and is probably easy and cheap. One small soil test isn't going to provide any useful information. You need some large number of samples, and each sample needs to be larger than a teaspoon. Larger samples, more cost. More samples, more cost. So this is more squared.

The cost problem is severely aggravated by the "charge what the market will bare" model of pricing. That model of pricing, is predisposed to ignoring the people who need low costs because of how they sample the market. Once ignored, the MBAs determining costs never consider that segment of the economy ever again. Unless someone starts a new company looking to service this now ignored market segment, this part of the market will continue to be ignored forever. But even when the market changes (shifts) and becomes stable, the MBA need for ever increasing income means that prices will tend to go up all the time. Because it isn't enough to make a profit, you must have increasing profit from year to year.

What the market will bare.

What we need is to put all these MBAs in a big cage with some very hungry bears. And let the bears determine what the price should be.

-- 

Gord
The data I am looking at, is the distribution of possession time in association football (soccer). In particular, the English Premier League.

In most of the big leagues, there is one or two dominant teams in the top league of a country. The Bundesleague in Germany is mostly Bayern Munich. LaLiga in Spain is mostly Real Madrid and Barcelona (sometimes Atletico Madrid as well). And so on.

For years, the EPL had a Top-4. Over the last couple of years, it seems to have expanded into a Top-6. The other 14 teams, I refer to as Rest Of The Pack.

If a Top-6 team plays another Top-6 team, or a ROTP team plays a ROTP team, you might see one team having 65% (or so) possession. But, if a Top-6 team plays a ROTP team, so far this season the highest possession was 81% (to the Top-6 team).

The K-S test, is not sensitive to differences in range. So it is inappropriate for my needs, as range is one place where there should be differences. The Anderson-Darling (AD) test is supposed to be more sensitive to the range.

I don't know if the other tests in kSamples are appropriate. I am having some problems understanding why some of the tests don't seem to work (I am guessing that R is crashing, and that the pipe used to communicate between R and Perl only end up holding the old content, and so the next read returns the old data.

-- 

Gord
On Fri, 9 Nov 2018 11:57:34 -0700
"Jared " <@jared> wrote:
Gord, that was the best thing I read all week: "Slapping the name robust on something, doesn't mean it does what you think it does."
Reading about when to _NOT_ use Anderson-Darling test, there was a Google snippet that suggested a person could do "real statistics" with Excel. I suspect if you got it to average one number, it would divide by N-1 someplace. :-)

-- 

Gord
On Fri, 9 Nov 2018 11:57:34 -0700
"Jared " <@jared> wrote:
Gord, that was the best thing I read all week: "Slapping the name robust on something, doesn't mean it does what you think it does."

I love it.
Wonderful. :-)

-- 

Gord
Gord, that was the best thing I read all week: "Slapping the name robust on something, doesn't mean it does what you think it does."

I love it.

Jared
On Fri, 9 Nov 2018 11:36:29 -0700
"Jared " <@jared> wrote:
This sounds like regression through the origin?
Yep.

I think your degrees of freedom drop by one, which is fine as long as your data set is not too small.
I believe you drop one as well.

With the intercept dropped, things are calculated differently or have to be interpreted differently.

Be careful.
I think you need to be careful with just about anything robust. Slapping the name robust on something, doesn't mean it does what you think it does.

But, I do like the idea that one has enough data, that you don't need to concern yourself whether you have an odd or even number of data points to calculate the median.

On the distribution side, I am trying to learn about K-S type tests from the R kSamples module. My data range is finite.

-- 

Gord
This sounds like regression through the origin?

I think your degrees of freedom drop by one, which is fine as long as your data set is not too small.

With the intercept dropped, things are calculated differently or have to be interpreted differently. 

Be careful.

Jared
In R, modules are called packages. Some packages need to be compiled from source. Most are binary packages that can be uncompressed and the folder copied to your R library folder. But that's not the easiest way to install packages.

If you haven't installed any libraries, then it won't have created your personal library yet. 

In bash type R and then at the R prompt type install.packages("kSamples"). It will do all the work for you.

It might ask you to create a personal library to store the binary package, unzipped.

Type .libPaths() and you should see the folder where libraries are. One for your personal packages and maybe one for R core packages.

Jared
On Thu, 8 Nov 2018 20:24:08 -0700
"Gordon Haverland" <ghaverla@...> wrote:
install.packages("kSamples")
Talking to myself.

I started a short perl script inside emacs with perldb, which has
use Statistics::R;
and creates the R "object".

I created 2 vectors (lists) that were the same length in Perl, and then 'set' them in R, ask R to multiply them together, and then did a 'get' of the result. The result printed fine in Perl (in the debugger).

I then asked the R object to load the kSamples library.

$R->run(q`library(kSamples)`);

For those not familiar with Perl, there is a quoting mechanism involved there (q or qq or others(?)). In this instance, I am quoting with backticks.

One of the particular tests in kSamples, is the Anderson-Darling test (which is supposed to be a step or two up from the K-S test). And running the example from the kSample project at github (or a copy of it), I did compare the two vectors using the A-D test.

Relatively painless. It is possible to run the test without directing output anywhere. I am guessing this ends up in some default output variable? but doing something like:
my $o2 = $R->run(q`ad.test(...)`);
captures the output as a text string into the variable $o2, which can be printed directly.

-- 

Gord
Using su - to become root and cd'ing to root's home directory, I started a "R" shell with the command "R". Which worked. I then issued the command
install.packages("kSamples")

This downloaded, compiled and installed things. As compiling was part of this, source code must be somewhere. The screen output shows the source code is someplace in /tmp, which means it will get deleted at some point. Not really what I was expecting. I'm not sure if the tarball is somewhere permanent. I have not run the newly installed package yet. I didn't see anything which looked like running a test suite against the package, to see that it works.

-- 
Gord
On Thu, 8 Nov 2018 15:58:06 -0700
"Gordon Haverland" <ghaverla@...> wrote:
Like a lot of things in statistics,
This has nothing to do with comparing distributions, but is an example of what computers can bring you.

The mean is a measure of central tendency. It is not the only one. The median is the value which is "half way", 50% is below and 50% is above. For a single moded distribution, the mode is the most common value.

For symmetric distributions, the mean median and mode should all be equal.

Calculating means (averages, expectations) is the presence of outliers results in answers different than should be found. It turns out the median is a more robust measure of central tendency. If you calculate the median in the presence of some (not a lot) of outliers, you probably do much better than calculating averages.

Numerical recipes has a function for doing a median fit of a straight line to data. This is as opposed to a least squares fit.

Let's say you have a data set, and you add one point to the data set. And then you fit via least squares and you fit via a median method, and you look at how the parameters of the fitted straight line change as a function of where this extra point is (you are moving this extra point around). You are probing the sensitivity of the calculated parameters to the presence of this extra data point. The values found from least squares, will vary smoothly with the position of this extra data point. The values of the median fit will change discontinously as a function of where this extra point is (there will be jumps in parameters).

A reasonable thing to do with any data set, is to calculate the average X and Y of the data, and then make up a new data set where you subtract (<X>,<Y>) from each data point. A least squares fit to this new data will pass through (0,0). Normally we assume that there is no error in X and hence all the error is in Y. But if we have a reasonable amount of data, the "error" in moving the data by subtracting off the average of X and Y should not be large. What we are left with, is just to calculate the slope of the point that goes through (0,0).

Well, there is a way to robustly solve that problem - the Theil-Sen estimator.

https://en.wikipedia.org/wiki/Theil%E2%80%93Sen_estimator

What you do (in theory) is calculate all the 2 point slopes possible in the data, sort them and pick the one in the middle (the median). The number of slopes you have to calculate becomes ridiculously large as the number of data points increases, so their are way to calculate fewer slopes.

Just in case you wanted to look at robust methods.

-- 

Gord
How do you test a distribution?

Well you have a set of data. We start by sorting the data. The lowest value has no values below it, so it gets the value (Xl,0). The highest value has no values above it, so it gets the value of (Xh,1). All the other data points are now (Xi,fraction of way between Xl and Xh).

You now plot (Xi,Yi). In general, you get some kind of sigmoid (S shaped) curve. It is monotone increasing.

You could smooth that curve (if you think the distribution is smooth). If you have reason to believe your data (X,Y) is exact, you could fit a cubic spline to the data and specify that the slope at 0 is 0, and the slope at 1 is 0. That will probably introduce a little wiggle to the spline fit, since we really only have slopes of 0 at the extremities if the X variable, and not necessarily at the extremities of our sampled data. The cubic spline I was first taught, is fitted by solving a linear system for all the data points at one time. This means a little error in one data point affects all parameters calculated. Which often leads to wiggle. Some splines are "localised", the Akima spline is one such (family of) spline.

If you know something about the error in your data, you could calculate a smoothing spline through the data.

In any event there are lots of choices as to how to analyze things.

-- 

Gord
Like a lot of things in statistics, you cannot prove that two distributions are the same. What you can show is the probability is larger than something based on some metric you calculate.
On Mon, 29 Oct 2018 09:07:13 -0400 (EDT)<br />"igoldberg1" <igoldberg1@...> wrote:<br /><blockquote>WHERE DID YOU HEAR THIS? THERE HAS BEEN NO NEWS OF THIS ANYWHERE ELSE.</blockquote>NPR.org has a version of this.<br /><br />-- <br /><br />Gordghaverla@... (Gordon Haverland)Mon, 29 Oct 2018 08:20:31 -0700Re: IBM Nears Deal to Acquire Software Maker Red Hat
<div>They were talking about it this morning on 630 ched. 33 billion all cash deal is what they were reporting. </div>quilley.larry@... (Larry Quilley)Mon, 29 Oct 2018 06:56:11 -0700Re: IBM Nears Deal to Acquire Software Maker Red Hat
<p>Try searching for this.</p>
<p>is a good search term.</p>
<p>Reuters</p>
<p>Bloomberg</p>
<p>et cetera</p>
<p>Or perhaps Red Hat itself:</p>
<p><a class="moz-txt-link-freetext" href="https://www.redhat.com/en/about/press-releases/ibm-acquire-red-hat-completely-changing-cloud-landscape-and-becoming-world%E2%80%99s-1-hybrid-cloud-provider" rel="nofollow">https://www.redhat.com/en/about/press-releases/ibm-acquire-red-hat-completely-changing-cloud-landscape-and-becoming-world%E2%80%99s-1-hybrid-cloud-provider</a></p>
<br/>mhilarius@... (Maurice Hilarius)Mon, 29 Oct 2018 06:33:32 -0700Re: IBM Nears Deal to Acquire Software Maker Red Hat
On Mon, Oct 29, 2018 at 8:07 AM igoldberg1 <igoldberg1@...> wrote:<br /><blockquote><br />WHERE DID YOU HEAR THIS? THERE HAS BEEN NO NEWS OF THIS ANYWHERE ELSE.<br /></blockquote>Not sure what you are specifically referring as no news available.<br />I do not watch the news and very rarely look for news online.<br />Yet I have run into both items mentioned earlier.<br />It is also quite easy to see where it would be in the best interests of the<br />principals for this news to not be widely touted - - - the repercussions<br />are sort of large.<br /><br />Regards<br /><br />Daraldo1bigtenor@... (o1bigtenor)Mon, 29 Oct 2018 06:27:27 -0700Re: IBM Nears Deal to Acquire Software Maker Red Hat
<p style="font-size: 12pt; font-family: helvetica, arial, sans-serif; color: rgb(51, 51, 51);">WHERE DID YOU HEAR THIS? THERE HAS BEEN NO NEWS OF THIS ANYWHERE ELSE.<br/></p><p style="font-size: 12pt; font-family: helvetica, arial, sans-serif; color: rgb(51, 51, 51);"> iRA GOLDBERG<br/></p>igoldberg1@... (igoldberg1)Mon, 29 Oct 2018 06:07:18 -0700Re: IBM Nears Deal to Acquire Software Maker Red Hat
<p>IBM is now worse than that.</p>
<p>Have you heard of the Phoenix payroll system paid for by the
Government of Canada?</p>
<p><a class="moz-txt-link-freetext" href="https://www.itworldcanada.com/article/phoenix-payroll-system-timeline-of-the-governments-problems/396407" rel="nofollow">https://www.itworldcanada.com/article/phoenix-payroll-system-timeline-of-the-governments-problems/396407</a></p>
<p>IBM crapware and failure.</p>
<br/>mhilarius@... (Maurice Hilarius)Sun, 28 Oct 2018 18:44:49 -0700