Hey devoted readers.
The blog has moved to a wordpress account linked to our Clinical and Translational Research Support Center at the University of Louisville. Head over there for weekly updates on research!
http://blog.ctrsc.net/
Peace!
Pirate Stats - R for Beginners
Saturday, August 24, 2013
Thursday, August 15, 2013
The blog will return soon.
I have a few friends that will help me do some more regular updating. Stay tuned!
Tuesday, June 25, 2013
all variable names to lower case - automatically!
here is the code in two pieces. I like to do it in two steps instead of one because I tend to use the vector with the variable names for other things:
var.names<-tolower(colnames(data)
colnames(data)<-var.names
This is assuming your dataset is called "data"
Essentially, it takes the column names (e.g. variable names), puts them into a vector called var.names, then you assign that vector to the column names of the dataset "data".
The only issue I have run into is if you have two variables with the same name, but one in upper case and one in lower case in the same dataset. In this case, it will delete one of them. In this case, you are best to rename this duplicate variable first... or better yet, figure out why the heck you have two variable names with the same name in the same dataset!
Take that!
var.names<-tolower(colnames(data)
colnames(data)<-var.names
This is assuming your dataset is called "data"
Essentially, it takes the column names (e.g. variable names), puts them into a vector called var.names, then you assign that vector to the column names of the dataset "data".
The only issue I have run into is if you have two variables with the same name, but one in upper case and one in lower case in the same dataset. In this case, it will delete one of them. In this case, you are best to rename this duplicate variable first... or better yet, figure out why the heck you have two variable names with the same name in the same dataset!
Take that!
Monday, June 17, 2013
Exporting lists to a usable file
If you have a lot of data stored in a list (e.g. output from some statistical test), sometimes you just want to export it to a text file to save it or print or whatever.
It can be difficult.
It doesnt have to be.
do this:
sink("/Users/Desktop/timothywiemken/output.txt")
function
sink()
run those three lines of code, where:
"/Users..." is the output location for the file (I put everything on my desktop and save it elsewhere later)
function is the actual function you want to run. For example: chisq.test(data$number_of_glasses, data$heavy_metal_music)
and
sink() closes down the sink function
This will export a text file with whatever name you want (here, output.txt) to your desktop (or whatever location you want).
The output file will contain all of the values typically printed in the R console window... This function just "sinks" that output to a text file. It will be in the same format as in the console window.
Very useful!
It can be difficult.
It doesnt have to be.
do this:
sink("/Users/Desktop/timothywiemken/output.txt")
function
sink()
run those three lines of code, where:
"/Users..." is the output location for the file (I put everything on my desktop and save it elsewhere later)
function is the actual function you want to run. For example: chisq.test(data$number_of_glasses, data$heavy_metal_music)
and
sink() closes down the sink function
This will export a text file with whatever name you want (here, output.txt) to your desktop (or whatever location you want).
The output file will contain all of the values typically printed in the R console window... This function just "sinks" that output to a text file. It will be in the same format as in the console window.
Very useful!
Saturday, June 15, 2013
Yup
Well, no one has commented yet so I have no idea if this helps anyone. It does take a lot of time to do the blog, so I probably will not update it much unless I get some traffic. I will continue to update it though -- probably mostly with new things I learn... the downside is that some of the code may be more advanced. Sorry if anyone is actually using this blog... if something changes I may come back to the basics.
Wednesday, June 12, 2013
Descriptive statistics with R, Part 2 -- 2x2 contingency tables -- and a quick taste of the Chi squared test.
Yo yo yo. Im back.
2x2 tables are critical, so you need to learn how to make them quickly in R.
Luckily it is pretty easy.
Just do your regular table statement like last time, but add in your second variable a la:
table(data$slept_poorly, data$bad_day_at_work)
This spits out the cross tabulation (frequencies) of those two variables:
The first variable you put in the table statement (here, slept poorly) is on the row, and the 2nd variable (here, bad_day_at_work) is the column. They are both coded 1 and 0, so those are the row and column "names". You see those days where she slept poorly AND had a bad day at work were a total of 5... (the bottom right value - where the 1's intersect).
Sometimes it is confusing as to what is the row and what is the column because the table statement doesn't actually give you the labels as to what is what. You can use another function, xtabs (also built into the base R package) to get the labels. The structure is similar but not exactly the same:
xtabs(~data$slept_poorly+data$bad_day_at_work)
Ok, so the xtabs part takes the place of "table" that you did before.
Then you need the parentheses, followed by the "tilde"... e.g ~ (shift of the key on the upper left of your keyboard). Next comes the variable you want in the row, followed by a plus (+) then the column variable. You end up with this in the output:
Nice huh?
Ok. you might say... I dont care about frequencies... I want percentages. Hold your horses.
First, put the results into a new vector like we have done before (you can do this with table or with xtabs)-- call it whatever you want. Ill call it frequencies just for the heck of it:
frequencies<-table(data$slept_poorly, data$bad_day_at_work)
Next, do the prop.table statement, just like we did when we were getting the proportions for just one variable. The "variable" you are getting the proportions of will be the vector you just created. You need one more step though... You have to define whether you want the "row" or "column" percentages. The row percentages will be the cell value divided by the sum of that row. The column percentage is the cell value divided by the sum of that column. A 1 means row percent, and a 2 means give me the column percentages. So to get the column percentages for the vector "frequences":
prop.table(frequencies,2)
This means that, of all of the days my wife had a bad day at work (the column representing a "1"), 50% of the time she also slept poorly (row representing the "1"). You can do any other arithmetic operations you want, like we did before (e.g. multiply by 100 to ge the standardize percent as opposed to just the decimal proportion).
One final thing. This makes getting the P value for the Chi-squared test REALLY easy.
Do this:
chisq.test(frequencies)
P value is 0.6481!
you will see a warning that the Chi squared approximation may be incorrect, because not all of the expected values are greater than 5. So you may want to do the Fishers Exact test instead:
fisher.test(frequencies)
P value is pretty close to the same.
The main benefit of R starts to show here. By putting things into vectors, you can easily manipulate them later. You can put ALL of the results of the chi squared test into one vector for example:
chisquared.results<-chisq.test(frequencies)
This puts all of those pieces of data from the chisquared test into the vector chisquared.results...
Do that, then check the structure of that vector:
str(chisquared.results)
You will see you get a huge printout of a bunch of crap. This is a new vector type, called a LIST. A list is a vector of multiple data types. You may have numbers, text, etc. all in one vector! This is pretty sweet as you will see in the future.
Here is your printout:
You see it is a list of 9 items. (List of 9).
for each vector in the List, you will see the name of that vector. Each new vector starts with a dollar sign ($). So, this list contains: statistic, parameter, p.value, method, data.name, observed, expected, residuals, and stdres (all of the values with a $ in front of them).
If you just want to see one of them, you can call it out by typing the statement into the console:
chisquared.results$p.value
The dataset containing the p.value (actually the List, but you are treating it like a dataset because it is kind of like a mini dataset) is the chisquared.results LIST you created with the chisq.test(frequencies) statement. You want to see just the P-value which is contained in that list, so you just separate the dataset with the variable, as you would with anything else. Voilas, you now have a printout of just the p-value.
The cool thing is you can do something like:
pvalue<-chisquared.results$p.value
which will put just the P-value into its own vector. You can then put that p-value into figures, automatically. You will probably do this all of the time in the future, when we get to graphing things. It lets you be able to automate almost anything. It is really, really sweet. Trust me.
Next up, mean, median, standard deviation, and interquartile range....
2x2 tables are critical, so you need to learn how to make them quickly in R.
Luckily it is pretty easy.
Just do your regular table statement like last time, but add in your second variable a la:
table(data$slept_poorly, data$bad_day_at_work)
This spits out the cross tabulation (frequencies) of those two variables:
The first variable you put in the table statement (here, slept poorly) is on the row, and the 2nd variable (here, bad_day_at_work) is the column. They are both coded 1 and 0, so those are the row and column "names". You see those days where she slept poorly AND had a bad day at work were a total of 5... (the bottom right value - where the 1's intersect).
Sometimes it is confusing as to what is the row and what is the column because the table statement doesn't actually give you the labels as to what is what. You can use another function, xtabs (also built into the base R package) to get the labels. The structure is similar but not exactly the same:
xtabs(~data$slept_poorly+data$bad_day_at_work)
Ok, so the xtabs part takes the place of "table" that you did before.
Then you need the parentheses, followed by the "tilde"... e.g ~ (shift of the key on the upper left of your keyboard). Next comes the variable you want in the row, followed by a plus (+) then the column variable. You end up with this in the output:
Nice huh?
Ok. you might say... I dont care about frequencies... I want percentages. Hold your horses.
First, put the results into a new vector like we have done before (you can do this with table or with xtabs)-- call it whatever you want. Ill call it frequencies just for the heck of it:
frequencies<-table(data$slept_poorly, data$bad_day_at_work)
Next, do the prop.table statement, just like we did when we were getting the proportions for just one variable. The "variable" you are getting the proportions of will be the vector you just created. You need one more step though... You have to define whether you want the "row" or "column" percentages. The row percentages will be the cell value divided by the sum of that row. The column percentage is the cell value divided by the sum of that column. A 1 means row percent, and a 2 means give me the column percentages. So to get the column percentages for the vector "frequences":
prop.table(frequencies,2)
This means that, of all of the days my wife had a bad day at work (the column representing a "1"), 50% of the time she also slept poorly (row representing the "1"). You can do any other arithmetic operations you want, like we did before (e.g. multiply by 100 to ge the standardize percent as opposed to just the decimal proportion).
One final thing. This makes getting the P value for the Chi-squared test REALLY easy.
Do this:
chisq.test(frequencies)
P value is 0.6481!
you will see a warning that the Chi squared approximation may be incorrect, because not all of the expected values are greater than 5. So you may want to do the Fishers Exact test instead:
fisher.test(frequencies)
P value is pretty close to the same.
The main benefit of R starts to show here. By putting things into vectors, you can easily manipulate them later. You can put ALL of the results of the chi squared test into one vector for example:
chisquared.results<-chisq.test(frequencies)
This puts all of those pieces of data from the chisquared test into the vector chisquared.results...
Do that, then check the structure of that vector:
str(chisquared.results)
You will see you get a huge printout of a bunch of crap. This is a new vector type, called a LIST. A list is a vector of multiple data types. You may have numbers, text, etc. all in one vector! This is pretty sweet as you will see in the future.
Here is your printout:
You see it is a list of 9 items. (List of 9).
for each vector in the List, you will see the name of that vector. Each new vector starts with a dollar sign ($). So, this list contains: statistic, parameter, p.value, method, data.name, observed, expected, residuals, and stdres (all of the values with a $ in front of them).
If you just want to see one of them, you can call it out by typing the statement into the console:
chisquared.results$p.value
The dataset containing the p.value (actually the List, but you are treating it like a dataset because it is kind of like a mini dataset) is the chisquared.results LIST you created with the chisq.test(frequencies) statement. You want to see just the P-value which is contained in that list, so you just separate the dataset with the variable, as you would with anything else. Voilas, you now have a printout of just the p-value.
The cool thing is you can do something like:
pvalue<-chisquared.results$p.value
which will put just the P-value into its own vector. You can then put that p-value into figures, automatically. You will probably do this all of the time in the future, when we get to graphing things. It lets you be able to automate almost anything. It is really, really sweet. Trust me.
Next up, mean, median, standard deviation, and interquartile range....
Tuesday, June 11, 2013
Back home - hope to have a new post tomorrow.
I refused to pay the $10/day for internet in my hotel. Otherwise, I would have posted something already. Good conference though...
Subscribe to:
Comments (Atom)





