Descriptive statistics with R, Part 2 -- 2x2 contingency tables -- and a quick taste of the Chi squared test.

Yo yo yo. Im back.

2x2 tables are critical, so you need to learn how to make them quickly in R.

Luckily it is pretty easy.

Just do your regular table statement like last time, but add in your second variable a la:

table(data$slept_poorly, data$bad_day_at_work)

This spits out the cross tabulation (frequencies) of those two variables:

The first variable you put in the table statement (here, slept poorly) is on the row, and the 2nd variable (here, bad_day_at_work) is the column. They are both coded 1 and 0, so those are the row and column "names". You see those days where she slept poorly AND had a bad day at work were a total of 5... (the bottom right value - where the 1's intersect).

Sometimes it is confusing as to what is the row and what is the column because the table statement doesn't actually give you the labels as to what is what. You can use another function, xtabs (also built into the base R package) to get the labels. The structure is similar but not exactly the same:

xtabs(~data$slept_poorly+data$bad_day_at_work)

Ok, so the xtabs part takes the place of "table" that you did before.

Then you need the parentheses, followed by the "tilde"... e.g ~ (shift of the key on the upper left of your keyboard). Next comes the variable you want in the row, followed by a plus (+) then the column variable. You end up with this in the output:

Nice huh?

Ok. you might say... I dont care about frequencies... I want percentages. Hold your horses.

First, put the results into a new vector like we have done before (you can do this with table or with xtabs)-- call it whatever you want. Ill call it frequencies just for the heck of it:

frequencies<-table(data$slept_poorly, data$bad_day_at_work)

Next, do the prop.table statement, just like we did when we were getting the proportions for just one variable. The "variable" you are getting the proportions of will be the vector you just created. You need one more step though... You have to define whether you want the "row" or "column" percentages. The row percentages will be the cell value divided by the sum of that row. The column percentage is the cell value divided by the sum of that column. A 1 means row percent, and a 2 means give me the column percentages. So to get the column percentages for the vector "frequences":

prop.table(frequencies,2)

This means that, of all of the days my wife had a bad day at work (the column representing a "1"), 50% of the time she also slept poorly (row representing the "1"). You can do any other arithmetic operations you want, like we did before (e.g. multiply by 100 to ge the standardize percent as opposed to just the decimal proportion).

One final thing. This makes getting the P value for the Chi-squared test REALLY easy.

Do this:

chisq.test(frequencies)

P value is 0.6481!

you will see a warning that the Chi squared approximation may be incorrect, because not all of the expected values are greater than 5. So you may want to do the Fishers Exact test instead:

fisher.test(frequencies)

P value is pretty close to the same.

The main benefit of R starts to show here. By putting things into vectors, you can easily manipulate them later. You can put ALL of the results of the chi squared test into one vector for example:

chisquared.results<-chisq.test(frequencies)

This puts all of those pieces of data from the chisquared test into the vector chisquared.results...

Do that, then check the structure of that vector:

str(chisquared.results)

You will see you get a huge printout of a bunch of crap. This is a new vector type, called a LIST. A list is a vector of multiple data types. You may have numbers, text, etc. all in one vector! This is pretty sweet as you will see in the future.

Here is your printout:

You see it is a list of 9 items. (List of 9).

for each vector in the List, you will see the name of that vector. Each new vector starts with a dollar sign ($). So, this list contains: statistic, parameter, p.value, method, data.name, observed, expected, residuals, and stdres (all of the values with a $ in front of them).

If you just want to see one of them, you can call it out by typing the statement into the console:

chisquared.results$p.value

The dataset containing the p.value (actually the List, but you are treating it like a dataset because it is kind of like a mini dataset) is the chisquared.results LIST you created with the chisq.test(frequencies) statement. You want to see just the P-value which is contained in that list, so you just separate the dataset with the variable, as you would with anything else. Voilas, you now have a printout of just the p-value.

The cool thing is you can do something like:

pvalue<-chisquared.results$p.value

which will put just the P-value into its own vector. You can then put that p-value into figures, automatically. You will probably do this all of the time in the future, when we get to graphing things. It lets you be able to automate almost anything. It is really, really sweet. Trust me.

Next up, mean, median, standard deviation, and interquartile range....

Pirate Stats - R for Beginners

Wednesday, June 12, 2013

Descriptive statistics with R, Part 2 -- 2x2 contingency tables -- and a quick taste of the Chi squared test.

No comments:

Post a Comment