Hey devoted readers.
The blog has moved to a wordpress account linked to our Clinical and Translational Research Support Center at the University of Louisville. Head over there for weekly updates on research!
http://blog.ctrsc.net/
Peace!
Saturday, August 24, 2013
Thursday, August 15, 2013
The blog will return soon.
I have a few friends that will help me do some more regular updating. Stay tuned!
Tuesday, June 25, 2013
all variable names to lower case - automatically!
here is the code in two pieces. I like to do it in two steps instead of one because I tend to use the vector with the variable names for other things:
var.names<-tolower(colnames(data)
colnames(data)<-var.names
This is assuming your dataset is called "data"
Essentially, it takes the column names (e.g. variable names), puts them into a vector called var.names, then you assign that vector to the column names of the dataset "data".
The only issue I have run into is if you have two variables with the same name, but one in upper case and one in lower case in the same dataset. In this case, it will delete one of them. In this case, you are best to rename this duplicate variable first... or better yet, figure out why the heck you have two variable names with the same name in the same dataset!
Take that!
var.names<-tolower(colnames(data)
colnames(data)<-var.names
This is assuming your dataset is called "data"
Essentially, it takes the column names (e.g. variable names), puts them into a vector called var.names, then you assign that vector to the column names of the dataset "data".
The only issue I have run into is if you have two variables with the same name, but one in upper case and one in lower case in the same dataset. In this case, it will delete one of them. In this case, you are best to rename this duplicate variable first... or better yet, figure out why the heck you have two variable names with the same name in the same dataset!
Take that!
Monday, June 17, 2013
Exporting lists to a usable file
If you have a lot of data stored in a list (e.g. output from some statistical test), sometimes you just want to export it to a text file to save it or print or whatever.
It can be difficult.
It doesnt have to be.
do this:
sink("/Users/Desktop/timothywiemken/output.txt")
function
sink()
run those three lines of code, where:
"/Users..." is the output location for the file (I put everything on my desktop and save it elsewhere later)
function is the actual function you want to run. For example: chisq.test(data$number_of_glasses, data$heavy_metal_music)
and
sink() closes down the sink function
This will export a text file with whatever name you want (here, output.txt) to your desktop (or whatever location you want).
The output file will contain all of the values typically printed in the R console window... This function just "sinks" that output to a text file. It will be in the same format as in the console window.
Very useful!
It can be difficult.
It doesnt have to be.
do this:
sink("/Users/Desktop/timothywiemken/output.txt")
function
sink()
run those three lines of code, where:
"/Users..." is the output location for the file (I put everything on my desktop and save it elsewhere later)
function is the actual function you want to run. For example: chisq.test(data$number_of_glasses, data$heavy_metal_music)
and
sink() closes down the sink function
This will export a text file with whatever name you want (here, output.txt) to your desktop (or whatever location you want).
The output file will contain all of the values typically printed in the R console window... This function just "sinks" that output to a text file. It will be in the same format as in the console window.
Very useful!
Saturday, June 15, 2013
Yup
Well, no one has commented yet so I have no idea if this helps anyone. It does take a lot of time to do the blog, so I probably will not update it much unless I get some traffic. I will continue to update it though -- probably mostly with new things I learn... the downside is that some of the code may be more advanced. Sorry if anyone is actually using this blog... if something changes I may come back to the basics.
Wednesday, June 12, 2013
Descriptive statistics with R, Part 2 -- 2x2 contingency tables -- and a quick taste of the Chi squared test.
Yo yo yo. Im back.
2x2 tables are critical, so you need to learn how to make them quickly in R.
Luckily it is pretty easy.
Just do your regular table statement like last time, but add in your second variable a la:
table(data$slept_poorly, data$bad_day_at_work)
This spits out the cross tabulation (frequencies) of those two variables:
The first variable you put in the table statement (here, slept poorly) is on the row, and the 2nd variable (here, bad_day_at_work) is the column. They are both coded 1 and 0, so those are the row and column "names". You see those days where she slept poorly AND had a bad day at work were a total of 5... (the bottom right value - where the 1's intersect).
Sometimes it is confusing as to what is the row and what is the column because the table statement doesn't actually give you the labels as to what is what. You can use another function, xtabs (also built into the base R package) to get the labels. The structure is similar but not exactly the same:
xtabs(~data$slept_poorly+data$bad_day_at_work)
Ok, so the xtabs part takes the place of "table" that you did before.
Then you need the parentheses, followed by the "tilde"... e.g ~ (shift of the key on the upper left of your keyboard). Next comes the variable you want in the row, followed by a plus (+) then the column variable. You end up with this in the output:
Nice huh?
Ok. you might say... I dont care about frequencies... I want percentages. Hold your horses.
First, put the results into a new vector like we have done before (you can do this with table or with xtabs)-- call it whatever you want. Ill call it frequencies just for the heck of it:
frequencies<-table(data$slept_poorly, data$bad_day_at_work)
Next, do the prop.table statement, just like we did when we were getting the proportions for just one variable. The "variable" you are getting the proportions of will be the vector you just created. You need one more step though... You have to define whether you want the "row" or "column" percentages. The row percentages will be the cell value divided by the sum of that row. The column percentage is the cell value divided by the sum of that column. A 1 means row percent, and a 2 means give me the column percentages. So to get the column percentages for the vector "frequences":
prop.table(frequencies,2)
This means that, of all of the days my wife had a bad day at work (the column representing a "1"), 50% of the time she also slept poorly (row representing the "1"). You can do any other arithmetic operations you want, like we did before (e.g. multiply by 100 to ge the standardize percent as opposed to just the decimal proportion).
One final thing. This makes getting the P value for the Chi-squared test REALLY easy.
Do this:
chisq.test(frequencies)
P value is 0.6481!
you will see a warning that the Chi squared approximation may be incorrect, because not all of the expected values are greater than 5. So you may want to do the Fishers Exact test instead:
fisher.test(frequencies)
P value is pretty close to the same.
The main benefit of R starts to show here. By putting things into vectors, you can easily manipulate them later. You can put ALL of the results of the chi squared test into one vector for example:
chisquared.results<-chisq.test(frequencies)
This puts all of those pieces of data from the chisquared test into the vector chisquared.results...
Do that, then check the structure of that vector:
str(chisquared.results)
You will see you get a huge printout of a bunch of crap. This is a new vector type, called a LIST. A list is a vector of multiple data types. You may have numbers, text, etc. all in one vector! This is pretty sweet as you will see in the future.
Here is your printout:
You see it is a list of 9 items. (List of 9).
for each vector in the List, you will see the name of that vector. Each new vector starts with a dollar sign ($). So, this list contains: statistic, parameter, p.value, method, data.name, observed, expected, residuals, and stdres (all of the values with a $ in front of them).
If you just want to see one of them, you can call it out by typing the statement into the console:
chisquared.results$p.value
The dataset containing the p.value (actually the List, but you are treating it like a dataset because it is kind of like a mini dataset) is the chisquared.results LIST you created with the chisq.test(frequencies) statement. You want to see just the P-value which is contained in that list, so you just separate the dataset with the variable, as you would with anything else. Voilas, you now have a printout of just the p-value.
The cool thing is you can do something like:
pvalue<-chisquared.results$p.value
which will put just the P-value into its own vector. You can then put that p-value into figures, automatically. You will probably do this all of the time in the future, when we get to graphing things. It lets you be able to automate almost anything. It is really, really sweet. Trust me.
Next up, mean, median, standard deviation, and interquartile range....
2x2 tables are critical, so you need to learn how to make them quickly in R.
Luckily it is pretty easy.
Just do your regular table statement like last time, but add in your second variable a la:
table(data$slept_poorly, data$bad_day_at_work)
This spits out the cross tabulation (frequencies) of those two variables:
The first variable you put in the table statement (here, slept poorly) is on the row, and the 2nd variable (here, bad_day_at_work) is the column. They are both coded 1 and 0, so those are the row and column "names". You see those days where she slept poorly AND had a bad day at work were a total of 5... (the bottom right value - where the 1's intersect).
Sometimes it is confusing as to what is the row and what is the column because the table statement doesn't actually give you the labels as to what is what. You can use another function, xtabs (also built into the base R package) to get the labels. The structure is similar but not exactly the same:
xtabs(~data$slept_poorly+data$bad_day_at_work)
Ok, so the xtabs part takes the place of "table" that you did before.
Then you need the parentheses, followed by the "tilde"... e.g ~ (shift of the key on the upper left of your keyboard). Next comes the variable you want in the row, followed by a plus (+) then the column variable. You end up with this in the output:
Nice huh?
Ok. you might say... I dont care about frequencies... I want percentages. Hold your horses.
First, put the results into a new vector like we have done before (you can do this with table or with xtabs)-- call it whatever you want. Ill call it frequencies just for the heck of it:
frequencies<-table(data$slept_poorly, data$bad_day_at_work)
Next, do the prop.table statement, just like we did when we were getting the proportions for just one variable. The "variable" you are getting the proportions of will be the vector you just created. You need one more step though... You have to define whether you want the "row" or "column" percentages. The row percentages will be the cell value divided by the sum of that row. The column percentage is the cell value divided by the sum of that column. A 1 means row percent, and a 2 means give me the column percentages. So to get the column percentages for the vector "frequences":
prop.table(frequencies,2)
This means that, of all of the days my wife had a bad day at work (the column representing a "1"), 50% of the time she also slept poorly (row representing the "1"). You can do any other arithmetic operations you want, like we did before (e.g. multiply by 100 to ge the standardize percent as opposed to just the decimal proportion).
One final thing. This makes getting the P value for the Chi-squared test REALLY easy.
Do this:
chisq.test(frequencies)
P value is 0.6481!
you will see a warning that the Chi squared approximation may be incorrect, because not all of the expected values are greater than 5. So you may want to do the Fishers Exact test instead:
fisher.test(frequencies)
P value is pretty close to the same.
The main benefit of R starts to show here. By putting things into vectors, you can easily manipulate them later. You can put ALL of the results of the chi squared test into one vector for example:
chisquared.results<-chisq.test(frequencies)
This puts all of those pieces of data from the chisquared test into the vector chisquared.results...
Do that, then check the structure of that vector:
str(chisquared.results)
You will see you get a huge printout of a bunch of crap. This is a new vector type, called a LIST. A list is a vector of multiple data types. You may have numbers, text, etc. all in one vector! This is pretty sweet as you will see in the future.
Here is your printout:
You see it is a list of 9 items. (List of 9).
for each vector in the List, you will see the name of that vector. Each new vector starts with a dollar sign ($). So, this list contains: statistic, parameter, p.value, method, data.name, observed, expected, residuals, and stdres (all of the values with a $ in front of them).
If you just want to see one of them, you can call it out by typing the statement into the console:
chisquared.results$p.value
The dataset containing the p.value (actually the List, but you are treating it like a dataset because it is kind of like a mini dataset) is the chisquared.results LIST you created with the chisq.test(frequencies) statement. You want to see just the P-value which is contained in that list, so you just separate the dataset with the variable, as you would with anything else. Voilas, you now have a printout of just the p-value.
The cool thing is you can do something like:
pvalue<-chisquared.results$p.value
which will put just the P-value into its own vector. You can then put that p-value into figures, automatically. You will probably do this all of the time in the future, when we get to graphing things. It lets you be able to automate almost anything. It is really, really sweet. Trust me.
Next up, mean, median, standard deviation, and interquartile range....
Tuesday, June 11, 2013
Back home - hope to have a new post tomorrow.
I refused to pay the $10/day for internet in my hotel. Otherwise, I would have posted something already. Good conference though...
Thursday, June 6, 2013
Heading to a meeting for a few days...
I may not have any postings for a while, but hang in there. Ill be back.
Wednesday, June 5, 2013
Basic Descriptive Statistics with R - Part 1: Frequencies and Percentages
Good afternoon pals.
Today we are going to look at frequencies and percentages for categorical variables. If I have some time tomorrow, we will do contingency tables with R (e.g. 2x2 tables). We will move on to mean, median, IQR and standard deviation after that.
So doing tables is really easy with the built in table function.
If you want the count (frequencies) of number_of_glasses, just do the following:
table(data$number_of_glasses)
R will spit out the results. The number on the top is the actual value for number_of_glasses. The number on the bottom is the number of cases (e.g. the number of rows) that contain that particular value:
You can see that one day, my wife drank 1 glass of water, and on three days, she drank 5 glasses of water, etc. One day she had 15 glasses.
This table function is a good way to screen your data to see if what is entered in makes sense... it is much better than viewing the data in the actual database. If you have strange values, they will be clearly evident. Also, if you have blanks in a variable that is a text (or character) variable, the first value will not have a number on the top (its a blank), but it will have a number below. This means that you have x number of blanks in that variable.
If your variable is numeric like the number_of_glasses variable and you want to know how many missing values there are, you can do the following:
table(is.na(data$number_of_glasses).
The result will be numbers above the text "TRUE" and "FALSE". Any missing values will show up above the "TRUE" text. If there are no missing values, you wont even have the "TRUE" listed.
This works great for counts. Now for the best thing R can do - put ANY results into a new variable, so you can manipulate your results as you please with other functions:
counts<-table(data$number_of_glasses)
If you do that, the data shown in my image above will be input into a new variable called "counts". You can change the variable to whatever you want. I always just use counts, because I like it. This variable will not be in the dataset "data", it will be out on its own.
If you type the word counts in your R script and run it, you will see the output, just like we did above. The benefit of doing this, is that now you can get the proportions instead of just the counts. To get the proportions, you need to do the script with counts first, then do:
prop.table(counts)
You can now see that 5% of the time, she had 1 glass of water, and 15% of the time she had 5 glasses of water.
You can also do some arithmetic stuff to this... for example, if you dont like proportions and converting the decimals to percentages in your head, you can multiply them by 100 in the same statement:
prop.table(counts)*100
Not bad ey? I agree.
Peace.
Today we are going to look at frequencies and percentages for categorical variables. If I have some time tomorrow, we will do contingency tables with R (e.g. 2x2 tables). We will move on to mean, median, IQR and standard deviation after that.
So doing tables is really easy with the built in table function.
If you want the count (frequencies) of number_of_glasses, just do the following:
table(data$number_of_glasses)
R will spit out the results. The number on the top is the actual value for number_of_glasses. The number on the bottom is the number of cases (e.g. the number of rows) that contain that particular value:
You can see that one day, my wife drank 1 glass of water, and on three days, she drank 5 glasses of water, etc. One day she had 15 glasses.
This table function is a good way to screen your data to see if what is entered in makes sense... it is much better than viewing the data in the actual database. If you have strange values, they will be clearly evident. Also, if you have blanks in a variable that is a text (or character) variable, the first value will not have a number on the top (its a blank), but it will have a number below. This means that you have x number of blanks in that variable.
If your variable is numeric like the number_of_glasses variable and you want to know how many missing values there are, you can do the following:
table(is.na(data$number_of_glasses).
The result will be numbers above the text "TRUE" and "FALSE". Any missing values will show up above the "TRUE" text. If there are no missing values, you wont even have the "TRUE" listed.
This works great for counts. Now for the best thing R can do - put ANY results into a new variable, so you can manipulate your results as you please with other functions:
counts<-table(data$number_of_glasses)
If you do that, the data shown in my image above will be input into a new variable called "counts". You can change the variable to whatever you want. I always just use counts, because I like it. This variable will not be in the dataset "data", it will be out on its own.
If you type the word counts in your R script and run it, you will see the output, just like we did above. The benefit of doing this, is that now you can get the proportions instead of just the counts. To get the proportions, you need to do the script with counts first, then do:
prop.table(counts)
You can now see that 5% of the time, she had 1 glass of water, and 15% of the time she had 5 glasses of water.
You can also do some arithmetic stuff to this... for example, if you dont like proportions and converting the decimals to percentages in your head, you can multiply them by 100 in the same statement:
prop.table(counts)*100
Not bad ey? I agree.
Peace.
Monday, June 3, 2013
Another way of recoding variables - without ifelse
Ok - hope all is well with everyone today!
I wanted to start by letting everyone know that you can use ifelse statements within other ifelse statements to recode a multi-category variable, much like you can in SAS.
For example, you may have the number_of_glasses variable, which you want to recode into three categories, 0, 1-5, and >5 glasses per day, coded 1,2,and 3, respectively.
You can do this by using multiple ifelse:
data$glasses.categories<-ifelse(data$number_of_glasses==0, 1,
ifelse(data$number_of_glasses>0 & data$number_of_glasses<6, 2, ifelse(data$number_of_glasses>5,3,NA)
The breakdown of this is essentially saying if the number of glasses of water is equal to zero, give the new variable, glasses.categories a 1, otherwise, do the next ifelse statement which says if the number of glasses is greater than zero (e.g. start at 1) AND is also less than 6 (e.g. 5 and below), give the new variable glasses.categories a 2, otherwise do the final ifelse statement which says give the new variable glasses.categories a 3 if the variable number_of_glasses is greater than 5. The final "else" part says if it doesnt meet any of these criteria, give it an NA, which is R for blank or missing. You always need that last "else" part. Since my criteria should be exhaustive (e.g. all of the possible values of number_of_glasses are covered in my if statements or my else statements), the only other possibility would be to be a missing value, so the final NA carries over those missing values.
There is another way to do this as well, without using these ifelse statements (actually multiple ways but Ill show you the one I use most often).
To do the same recode as above, Ill first start by making a new variable that is equal to 1.
data$glasses.categories<-1
So all values in the dataset for this variable will be equal to 1.
Next, recode over these 1's with your first criteria - if the number_of_glasses falls between 1 and 5, give it a 2. You can use a subset of your data by using the straight brackets, like so:
data$glasses.categories[data$number_of_glasses>0 & data$number_of_glasses<6]<-2
This telling R to look at the glasses.categories variable in the dataset "data", but only those rows or cases where the number_of_glasses variable is between (and includes) 1 and 5 and give those a 2. This will recode OVER TOP of the 1's that are already in that variable for those cases... those values that were created with your first step, where you made everything equal to 1.
Finally, recode over the 1's in that variable with your final criteria.
data$glasses.categories[data$number_of_glasses>5]<-3
You will end up with the same values as you did using the ifelse. So the final block of code would be:
data$glasses.categories<-1
I wanted to start by letting everyone know that you can use ifelse statements within other ifelse statements to recode a multi-category variable, much like you can in SAS.
For example, you may have the number_of_glasses variable, which you want to recode into three categories, 0, 1-5, and >5 glasses per day, coded 1,2,and 3, respectively.
You can do this by using multiple ifelse:
data$glasses.categories<-ifelse(data$number_of_glasses==0, 1,
ifelse(data$number_of_glasses>0 & data$number_of_glasses<6, 2, ifelse(data$number_of_glasses>5,3,NA)
The breakdown of this is essentially saying if the number of glasses of water is equal to zero, give the new variable, glasses.categories a 1, otherwise, do the next ifelse statement which says if the number of glasses is greater than zero (e.g. start at 1) AND is also less than 6 (e.g. 5 and below), give the new variable glasses.categories a 2, otherwise do the final ifelse statement which says give the new variable glasses.categories a 3 if the variable number_of_glasses is greater than 5. The final "else" part says if it doesnt meet any of these criteria, give it an NA, which is R for blank or missing. You always need that last "else" part. Since my criteria should be exhaustive (e.g. all of the possible values of number_of_glasses are covered in my if statements or my else statements), the only other possibility would be to be a missing value, so the final NA carries over those missing values.
There is another way to do this as well, without using these ifelse statements (actually multiple ways but Ill show you the one I use most often).
To do the same recode as above, Ill first start by making a new variable that is equal to 1.
data$glasses.categories<-1
So all values in the dataset for this variable will be equal to 1.
Next, recode over these 1's with your first criteria - if the number_of_glasses falls between 1 and 5, give it a 2. You can use a subset of your data by using the straight brackets, like so:
data$glasses.categories[data$number_of_glasses>0 & data$number_of_glasses<6]<-2
This telling R to look at the glasses.categories variable in the dataset "data", but only those rows or cases where the number_of_glasses variable is between (and includes) 1 and 5 and give those a 2. This will recode OVER TOP of the 1's that are already in that variable for those cases... those values that were created with your first step, where you made everything equal to 1.
Finally, recode over the 1's in that variable with your final criteria.
data$glasses.categories[data$number_of_glasses>5]<-3
You will end up with the same values as you did using the ifelse. So the final block of code would be:
data$glasses.categories<-1
data$glasses.categories[data$number_of_glasses>0 & data$number_of_glasses<6]<-2
data$glasses.categories[data$number_of_glasses>5]<-3
******** VERY IMPORTANT NOTE ***********
You need to be careful using this final method in the event you have missing values in your dataset. I would highly recommend one more line of code if your variable has any missing values:
data$glasses.categories[is.na(data$number_of_glasses)]<-NA
This will carry over any missing values from your old variable to your new variable.
Have fun! Next up... I don't know yet. Im getting tired of recoding, so we may move on to basic contingency tables and measures of central tendency (mean, median) as well as variation (Standard Deviation, Interquartile Range).
Peace!
******** VERY IMPORTANT NOTE ***********
You need to be careful using this final method in the event you have missing values in your dataset. I would highly recommend one more line of code if your variable has any missing values:
data$glasses.categories[is.na(data$number_of_glasses)]<-NA
This will carry over any missing values from your old variable to your new variable.
Have fun! Next up... I don't know yet. Im getting tired of recoding, so we may move on to basic contingency tables and measures of central tendency (mean, median) as well as variation (Standard Deviation, Interquartile Range).
Peace!
Saturday, June 1, 2013
I hope this blog is easy to understand.
I am incredibly frustrated today, searching for useful information on log-linear regression models online. Yeah, there are tons of articles explaining them. Are any of them useful or understandable? Not a one. WTF statisticians. Why do you insist on making everything difficult? Are you trying to ensure your job security? This stuff doesn't have to be difficult. I am confident in this.
After all, as Einstein stated,
After all, as Einstein stated,
“If you can't explain it to a six year old, you don't understand it yourself.”
When he said that, I think what he meant was if you can't explain it to a six year old and have them understand it, at least reasonably well, you don't understand it yourself. Of course, you can probably explain anything to a six year old - that doesn't mean they understand it.
Yes, I am acting as the six year old today. Take that.
Recoding variables in R with if-then or if-else statements
Ok, sorry I havent been posting in a couple of days. This work week was completely insane. Oh well. Its Saturday, and I'm still going to be working on a few things, but I wanted to get this post up for anyone who is actually reading this.
Today is a big day - recoding variables. If you do any type of research, recoding is something you inevitably have to do. When I first started using R, other R users used to tell me "Oh, don't use R for recoding, use something else - R is too difficult for recodes". Once I figured out how to recode stuff, I found this to be completely false. It is pretty easy once you get the hang of it. There are only a couple of things you need to be very careful with, the main thing being what to do with missing data.
I am going to focus on ifelse statements today (if then is very similar, but ifelse works just like it does in Excel. Since almost everyone using R is going to be at least reasonably familiar with Excel, I will focus on this. Maybe the next post, Ill look at some other methods.
Here is the basic structure of ifelse:
dataset$new.variable<-ifelse(dataset$old.variable=="some value", "what you want the new variable to be if the prior logical statement is true", "what you want the new variable to be if the prior logical statement is false").
Let us recode the "number_of_glasses" variable in our water.csv dataset. Pretend, that we are only interested in those days where my wife drank more than 10 glasses of water in a day. We want to create a new variable that is 1 when she drank more than or equal to 10 glasses, and a 0 if she drank less than 10. Here is the code:
data$water.10<-ifelse(data$number_of_glasses>9,1,0)
Pretty simple. From left to right in words:
1) data - we are going to put our new variable back into the same dataset that contains all of our other data. If you dont put data here, and just put the variable name, the new variable will be out on its own, not contained in your dataset! This can be useful at times, but here we want it to be in our regular dataset.
2) the new variable will be called water.10 (separated from the dataset by a dollar sign, as always)
3) gets!
4) ifelse - this is a function built into the base R package
5) data$number_of_glasses>9 - we are taking the number_of_glasses variable from the dataset "data" and making a logical value which is when this variable is greater than 9 (e.g. greater or equal to 10). You could also do this: data$number_of_glasses>=10 . I tend to use the lower or upper number and just the greater or less than sign to incorporate the "or equal to" part. It really doesn't matter which one you do.
6) 1 - if the logic is true (e.g. the number of glasses is greater or equal to 10 [or as I put it, greater than 9]), the new variable, water.10 in the dataset "data" will take on a value of 1.
7) 0 - if the logic is false (e.g. the number of glasses is less than 10 [or NOT greater or equal to 10]), the value for water.10 in the dataset "data" will take on a value of 0.
I think you can also say if oldvariable==whatever else newvariable==something else. I dont use this structure ever, but I can see some potential uses. For more info, google "R Control Structures"
_________________________________
Ok this part is REALLY important!
If you have any missing data in the variable in which you want to recode, the statement above, will carry over any missing values. For example, if on day 2, we didnt have the number of glasses of water she drank, the new variable, water.10 would also have a missing value for day 2 based on the code above Typically you want this to happen, so this is a good thing.
If you want any missing values to be recoded with your logic, you need to do something else. I am not an expert in this area, but I can give you one example.
If your logic contains a value that is equal to some value, you can use the %in% operator, instead of the double equals "==". For example:
if you were interested in making a variable where the value was 1 when she drank exactly 10 glasses of water and 0 when she drank any other number (again, missing values will be carried over as missing into the new variable):
data$water.10<-ifelse(data$number_of_glasses==10,1,0)
If you wanted those missing values to be zero, you would change this statement to:
data$water.10<-ifelse(data$number_of_glasses %in% 10,1,0)
Looks strange? I agree.
Im not sure at this point how to carry over missing values if you are using greater or equal to..
The best workaround I can think of is:
data$water.10<-ifelse(data$number_of_glasses>9 | is.na(data$number_of_glasses),1,0)
this would give a 1 to water.10 where the number_of_glasses was greater or equal to 10 OR if it was missing. Ill ask my pal Rob if he has a better approach. I hate having to use AND (&) and OR (|) all of the time.
Have a good weekend! Peace to my pals in Oklahoma.
Today is a big day - recoding variables. If you do any type of research, recoding is something you inevitably have to do. When I first started using R, other R users used to tell me "Oh, don't use R for recoding, use something else - R is too difficult for recodes". Once I figured out how to recode stuff, I found this to be completely false. It is pretty easy once you get the hang of it. There are only a couple of things you need to be very careful with, the main thing being what to do with missing data.
I am going to focus on ifelse statements today (if then is very similar, but ifelse works just like it does in Excel. Since almost everyone using R is going to be at least reasonably familiar with Excel, I will focus on this. Maybe the next post, Ill look at some other methods.
Here is the basic structure of ifelse:
dataset$new.variable<-ifelse(dataset$old.variable=="some value", "what you want the new variable to be if the prior logical statement is true", "what you want the new variable to be if the prior logical statement is false").
Let us recode the "number_of_glasses" variable in our water.csv dataset. Pretend, that we are only interested in those days where my wife drank more than 10 glasses of water in a day. We want to create a new variable that is 1 when she drank more than or equal to 10 glasses, and a 0 if she drank less than 10. Here is the code:
data$water.10<-ifelse(data$number_of_glasses>9,1,0)
Pretty simple. From left to right in words:
1) data - we are going to put our new variable back into the same dataset that contains all of our other data. If you dont put data here, and just put the variable name, the new variable will be out on its own, not contained in your dataset! This can be useful at times, but here we want it to be in our regular dataset.
2) the new variable will be called water.10 (separated from the dataset by a dollar sign, as always)
3) gets!
4) ifelse - this is a function built into the base R package
5) data$number_of_glasses>9 - we are taking the number_of_glasses variable from the dataset "data" and making a logical value which is when this variable is greater than 9 (e.g. greater or equal to 10). You could also do this: data$number_of_glasses>=10 . I tend to use the lower or upper number and just the greater or less than sign to incorporate the "or equal to" part. It really doesn't matter which one you do.
6) 1 - if the logic is true (e.g. the number of glasses is greater or equal to 10 [or as I put it, greater than 9]), the new variable, water.10 in the dataset "data" will take on a value of 1.
7) 0 - if the logic is false (e.g. the number of glasses is less than 10 [or NOT greater or equal to 10]), the value for water.10 in the dataset "data" will take on a value of 0.
I think you can also say if oldvariable==whatever else newvariable==something else. I dont use this structure ever, but I can see some potential uses. For more info, google "R Control Structures"
_________________________________
Ok this part is REALLY important!
If you have any missing data in the variable in which you want to recode, the statement above, will carry over any missing values. For example, if on day 2, we didnt have the number of glasses of water she drank, the new variable, water.10 would also have a missing value for day 2 based on the code above Typically you want this to happen, so this is a good thing.
If you want any missing values to be recoded with your logic, you need to do something else. I am not an expert in this area, but I can give you one example.
If your logic contains a value that is equal to some value, you can use the %in% operator, instead of the double equals "==". For example:
if you were interested in making a variable where the value was 1 when she drank exactly 10 glasses of water and 0 when she drank any other number (again, missing values will be carried over as missing into the new variable):
data$water.10<-ifelse(data$number_of_glasses==10,1,0)
If you wanted those missing values to be zero, you would change this statement to:
data$water.10<-ifelse(data$number_of_glasses %in% 10,1,0)
Looks strange? I agree.
Im not sure at this point how to carry over missing values if you are using greater or equal to..
The best workaround I can think of is:
data$water.10<-ifelse(data$number_of_glasses>9 | is.na(data$number_of_glasses),1,0)
this would give a 1 to water.10 where the number_of_glasses was greater or equal to 10 OR if it was missing. Ill ask my pal Rob if he has a better approach. I hate having to use AND (&) and OR (|) all of the time.
Have a good weekend! Peace to my pals in Oklahoma.
Wednesday, May 29, 2013
Basic R Operators
Good morning pirates!
Here is a quick overview of the most common operators. By operators I mean things like "equal", "not equal", "greater than", etc.
This will be a rough one.
equal: ==
not equal: !=
greater than: >
less than: <
greater or equal to: >=
less or equal to: <=
I think that is about it.
So how do you use these? Let us take the subset function from the previous post:
Using equal (double equal sign):
data.metal<-subset(data, data$heavy_metal_music==1)
That would give you all of the cases in "data" where heavy_metal_music is equal to 1 (e.g. "yes")
Using not equal:
data.metal<-subset(data, data$heavy_metal_music!=1)
That would give you all of the cases in "data" where heavy_metal_music is not equal to 1 (e.g. "no" or anything else entered into that cell -- this is where data quality becomes important!)
I won't give an example of the other operators, as by now you can probably tell, you just change the operator in the statement to the operators listed above.
Yes, my font changed. I copied the data.metal subset statement from a prior post and I decided to not change the font back after the paste. Take that! Im such a rebel.
__________________________________________________________
Ok, so how do you specify if the data are missing or not missing?
As you may know, R treats empty cells (e.g. missing data) as NA. It will put NA into empty cells that otherwise have numbers in them (we will get back to data types later). If the cells have text in them or a non-numeric character (hyphen, slash, colon, etc), R will actually leave that cell truly blank.
To figure out what variable type you have, you can type the following:
class(data$variable)
of course, you would change the word 'variable' to the variable name for which you are interested in seeing the variable type (e.g. the variable class). R will treat integers, numbers, and factors the same -- NA is missing.
So if you want to keep all of the cases in your subset where heavy_metal_music is missing, you can do the following (again, this only works for variables that are numeric -- because R puts NA in the blank cell):
data.metal<-subset(data, is.na(data$heavy_metal_music))
the is.na(variable) part tells R to subset if the heavy_metal_music is NA (or missing)
To specify NOT MISSING:
data.metal<-subset(data, !is.na(data$heavy_metal_music))
Might look familiar -- just like not equal uses the exclamation mark for NOT, is.na uses it the same.
_____________________________________________________________
So what if the variable is a 'character' variable (e.g. it has text in it).
data.metal<-subset(data, data$heavy_metal_music=="")
The double quotes specifies a textual "blank" in the cell. You could say:
data.metal<-subset(data, data$heavy_metal_music!="")
to mean not equal blank.
You have to be careful here though, as sometimes, a space in that cell (for example, if you had data in a cell in excel, but hit the space bar to clear it out instead of backspace or delete) will not be caught by the "" operator. In this case, you need to say " " double quote with a space in between. If you have two spaces, you need a double quote with two spaces in between, etc.
Its kind of a pain, so make sure your data are clean before you put them into R.
Keep it realz.
Here is a quick overview of the most common operators. By operators I mean things like "equal", "not equal", "greater than", etc.
This will be a rough one.
equal: ==
not equal: !=
greater than: >
less than: <
greater or equal to: >=
less or equal to: <=
I think that is about it.
So how do you use these? Let us take the subset function from the previous post:
Using equal (double equal sign):
data.metal<-subset(data, data$heavy_metal_music==1)
That would give you all of the cases in "data" where heavy_metal_music is equal to 1 (e.g. "yes")
Using not equal:
data.metal<-subset(data, data$heavy_metal_music!=1)
That would give you all of the cases in "data" where heavy_metal_music is not equal to 1 (e.g. "no" or anything else entered into that cell -- this is where data quality becomes important!)
I won't give an example of the other operators, as by now you can probably tell, you just change the operator in the statement to the operators listed above.
Yes, my font changed. I copied the data.metal subset statement from a prior post and I decided to not change the font back after the paste. Take that! Im such a rebel.
__________________________________________________________
As you may know, R treats empty cells (e.g. missing data) as NA. It will put NA into empty cells that otherwise have numbers in them (we will get back to data types later). If the cells have text in them or a non-numeric character (hyphen, slash, colon, etc), R will actually leave that cell truly blank.
To figure out what variable type you have, you can type the following:
class(data$variable)
of course, you would change the word 'variable' to the variable name for which you are interested in seeing the variable type (e.g. the variable class). R will treat integers, numbers, and factors the same -- NA is missing.
So if you want to keep all of the cases in your subset where heavy_metal_music is missing, you can do the following (again, this only works for variables that are numeric -- because R puts NA in the blank cell):
data.metal<-subset(data, is.na(data$heavy_metal_music))
the is.na(variable) part tells R to subset if the heavy_metal_music is NA (or missing)
To specify NOT MISSING:
data.metal<-subset(data, !is.na(data$heavy_metal_music))
Might look familiar -- just like not equal uses the exclamation mark for NOT, is.na uses it the same.
_____________________________________________________________
So what if the variable is a 'character' variable (e.g. it has text in it).
data.metal<-subset(data, data$heavy_metal_music=="")
The double quotes specifies a textual "blank" in the cell. You could say:
data.metal<-subset(data, data$heavy_metal_music!="")
to mean not equal blank.
You have to be careful here though, as sometimes, a space in that cell (for example, if you had data in a cell in excel, but hit the space bar to clear it out instead of backspace or delete) will not be caught by the "" operator. In this case, you need to say " " double quote with a space in between. If you have two spaces, you need a double quote with two spaces in between, etc.
Its kind of a pain, so make sure your data are clean before you put them into R.
Keep it realz.
Tuesday, May 28, 2013
Welcome back - lets learn how to specify variables in your code and subset!
'twas a long weekend that did not feel so long. Regardless, I'm back to work so lets talk about an important function, subsetting. First, we have to be confident in our abilities to call out specific variables in our data.
Since you can be working with multiple datasets at once in R, you always need to specify the dataset and the variable within that dataset. There are some other tricks to get around this (I wont talk about them because they dont always do what you think they do. If you are interested, look into ??attach and ??detach).
Regardless, if your dataset is called "data" and you want to do something to the "number_of_glasses" variable, you need to specify both of them in your code. The dollar sign ($) is what does this for you, as follows:
data$number_of_glasses
Thats all there is to that. Always specify the dataset AND the variable with the dollar sign in between. If you dont, you will get an error saying that the object is not found. For example, if you just typed:
number_of_days
You would get the following error:
Error: object 'number_of_days' not found
Alternatively, if you type:
data$number_of_days
and run it.
You will see the actual values for the number_of_days variable within the "data" dataset.
______________________________________________________________________
Ok I mentioned that first because we are now going to learn to subset, and without the above explanation, the subset function may not make sense to you.
Often, it is useful to make a subset of a dataset. For example, if you want a dataset where the
heavy_metal_music variable is equal to 1, you can do this... very easily.
data.metal<-subset(data, data$heavy_metal_music==1)
Starting from the left:
1) data.metal - this is the new dataset that will contain the subsetted data from your original dataset
2) <- your friendly neighborhood gets operator! As always, this puts the information from the right side of the symbol into whatever you specify on the left side.
3) subset - this is a function built into the base R package... for... drumroll... subsetting!
4) data - this is your old dataset containing all of the information
5) data$heavy_metal_music - this is the dataset and the variable on which you want to subset. now the earlier comments probably make sense.
6) ==1 . I told you early on that R doesnt use the equals sign as the gets operator (thats what <- is for). R uses the double equals sign to mean equals. So you are saying where heavy_metal_music is equal to 1.
In words:
the new dataset "data.metal" gets a subset of the dataset "data", where heavy_metal_music (in the "data" dataset) is equal to 1.
Hope that makes sense.
Next posting will be on other basic R operators. For example, how do we specify greater than, less than, greater or equal to, not equal to, etc.
Following that, I think we can move on to some basic data recodes, using other subsetting functions and if-else statements.
Have a non crappy Tuesday!
Since you can be working with multiple datasets at once in R, you always need to specify the dataset and the variable within that dataset. There are some other tricks to get around this (I wont talk about them because they dont always do what you think they do. If you are interested, look into ??attach and ??detach).
Regardless, if your dataset is called "data" and you want to do something to the "number_of_glasses" variable, you need to specify both of them in your code. The dollar sign ($) is what does this for you, as follows:
data$number_of_glasses
Thats all there is to that. Always specify the dataset AND the variable with the dollar sign in between. If you dont, you will get an error saying that the object is not found. For example, if you just typed:
number_of_days
You would get the following error:
Error: object 'number_of_days' not found
Alternatively, if you type:
data$number_of_days
and run it.
You will see the actual values for the number_of_days variable within the "data" dataset.
______________________________________________________________________
Ok I mentioned that first because we are now going to learn to subset, and without the above explanation, the subset function may not make sense to you.
Often, it is useful to make a subset of a dataset. For example, if you want a dataset where the
heavy_metal_music variable is equal to 1, you can do this... very easily.
data.metal<-subset(data, data$heavy_metal_music==1)
Starting from the left:
1) data.metal - this is the new dataset that will contain the subsetted data from your original dataset
2) <- your friendly neighborhood gets operator! As always, this puts the information from the right side of the symbol into whatever you specify on the left side.
3) subset - this is a function built into the base R package... for... drumroll... subsetting!
4) data - this is your old dataset containing all of the information
5) data$heavy_metal_music - this is the dataset and the variable on which you want to subset. now the earlier comments probably make sense.
6) ==1 . I told you early on that R doesnt use the equals sign as the gets operator (thats what <- is for). R uses the double equals sign to mean equals. So you are saying where heavy_metal_music is equal to 1.
In words:
the new dataset "data.metal" gets a subset of the dataset "data", where heavy_metal_music (in the "data" dataset) is equal to 1.
Hope that makes sense.
Next posting will be on other basic R operators. For example, how do we specify greater than, less than, greater or equal to, not equal to, etc.
Following that, I think we can move on to some basic data recodes, using other subsetting functions and if-else statements.
Have a non crappy Tuesday!
Thursday, May 23, 2013
Merging databases
Good morning pals,
Im gonna tell you how to merge you some data.
Here is the first database, (assuming it is called data) is the one that I showed you the other day, with my wife's drinking water habits:
The data I want to merge is as follows and contains only two variables. One is the day, 1-22 (two extra days in this database), and the second variable is ran_out_of_gas (1 for yes, 0 for no). She lets her car gas tank get pretty low, so the risk of running out of gas is high. You may notice that the first variable, Day has a capital "D". This is important to note. This database is called "data1", just because.
Ok, to merge these, you need to know two things: 1) which variable you want to "merge on", and 2) if you want to keep all of the cases or just the cases in the dataset "data" which has 2 fewer cases.
Lets first assume we want all of the cases in data, and we want to remove the two extra cases in data1.
First, read in your two datasets:
data<-read.csv("/Users/timothywiemken/Desktop/data.csv")
data1<-read.csv("/Users/timothywiemken/Desktop/data1.csv")
Next, merge them into a new dataset called "merged"
merged<-merge(data, data1, by.x="day", by.y="Day", all.x=T)
Lets talk about this statement from the left to the right.
1) "merged" - the far left before the gets operator (<-) is the new dataset you want to make after merging things.
2) "merge" - this is the function built into the base R package to merge two datasets (only two at a time! if you need to merge more than two, you need to merge "merged" with your third dataset")
3) "data" - this is the first dataset you want to merge. This is called the "x" dataset.
4) "data1" - this is the second dataset you want to merge. This is called the "y" dataset.
5) "by.x" - this specifies the variable in the "x" dataset (see #3) that you will merge on (needs to be the same variable as what will be in the "y" dataset but it can have more or less cases (e.g. rows). here, we specify "day" in lowercase, as this is the variable in this dataset that is the same as one in the "y" dataset.
6) "by.y" - here we specify "Days" with the capital "D", as this is the variable that matches the variable "day" in the "x" dataset. This one has more cases in it, but that doesnt matter.
7) "all.x=T" - this tells R to keep all of the data in the "x" dataset (see #3) and drop any extra cases in the "y" dataset. If there are more cases in "x" than in "y", you will still have all of the "x" cases. You can switch this to "all.y=T" if you want all of the cases in "y". You can specify "all=T" if you want all of the cases in BOTH datasets. The "T" stands for TRUE. You can type out the word TRUE (all capitals), but R uses "T" and "F" as acceptable shortcuts for writing out TRUE and FALSE.
Thats it! You should be able to merge these two datasets now.
Im gonna tell you how to merge you some data.
Here is the first database, (assuming it is called data) is the one that I showed you the other day, with my wife's drinking water habits:
The data I want to merge is as follows and contains only two variables. One is the day, 1-22 (two extra days in this database), and the second variable is ran_out_of_gas (1 for yes, 0 for no). She lets her car gas tank get pretty low, so the risk of running out of gas is high. You may notice that the first variable, Day has a capital "D". This is important to note. This database is called "data1", just because.
Ok, to merge these, you need to know two things: 1) which variable you want to "merge on", and 2) if you want to keep all of the cases or just the cases in the dataset "data" which has 2 fewer cases.
Lets first assume we want all of the cases in data, and we want to remove the two extra cases in data1.
First, read in your two datasets:
data<-read.csv("/Users/timothywiemken/Desktop/data.csv")
data1<-read.csv("/Users/timothywiemken/Desktop/data1.csv")
Next, merge them into a new dataset called "merged"
merged<-merge(data, data1, by.x="day", by.y="Day", all.x=T)
Lets talk about this statement from the left to the right.
1) "merged" - the far left before the gets operator (<-) is the new dataset you want to make after merging things.
2) "merge" - this is the function built into the base R package to merge two datasets (only two at a time! if you need to merge more than two, you need to merge "merged" with your third dataset")
3) "data" - this is the first dataset you want to merge. This is called the "x" dataset.
4) "data1" - this is the second dataset you want to merge. This is called the "y" dataset.
5) "by.x" - this specifies the variable in the "x" dataset (see #3) that you will merge on (needs to be the same variable as what will be in the "y" dataset but it can have more or less cases (e.g. rows). here, we specify "day" in lowercase, as this is the variable in this dataset that is the same as one in the "y" dataset.
6) "by.y" - here we specify "Days" with the capital "D", as this is the variable that matches the variable "day" in the "x" dataset. This one has more cases in it, but that doesnt matter.
7) "all.x=T" - this tells R to keep all of the data in the "x" dataset (see #3) and drop any extra cases in the "y" dataset. If there are more cases in "x" than in "y", you will still have all of the "x" cases. You can switch this to "all.y=T" if you want all of the cases in "y". You can specify "all=T" if you want all of the cases in BOTH datasets. The "T" stands for TRUE. You can type out the word TRUE (all capitals), but R uses "T" and "F" as acceptable shortcuts for writing out TRUE and FALSE.
Thats it! You should be able to merge these two datasets now.
Wednesday, May 22, 2013
Just realized...
... that quicktime player does screen capture with audio.
I have too much text in these posts, so Im gonna start doing a video for most of them to add to the flavor. Ill go back to some of the old ones and update as time allows.
t
I have too much text in these posts, so Im gonna start doing a video for most of them to add to the flavor. Ill go back to some of the old ones and update as time allows.
t
A tip for making all of your variable names lower case automatically!
Yo yo yo.
Ive been using this code snippet a lot lately. Most of our databases have variable names that are mixed upper and lower case letters. As you may have guessed by now, Im not a huge fan of upper case - its hard to remember which variables have which case of text.
So.
After you read in your dataset (since you are a master after the "how to read in your data" post)... paste the following code into the console or the R script editor. Once you paste it all, highlight the two lines and hit command+enter (on the mac!) to run it.
test<-tolower(colnames(data))
colnames(data)<-test
What is happening here is as follows:
The first line takes the column names (your variable names!) from the dataset called data (if you named your data something else, you need to change this). The colnames function is built into the base R package, so you dont need any particular package to make this work. The tolower function is also built into the base R package. So this function takes the column names from your dataset, data, and puts everything to lowercase text. you can also use "toupper" if you prefer all upper case text. You may notice that on the far left (before the gets "<-" operator), i have the word "test". What happens here is that R takes all of the lowercase names and puts it into a vector out on its own. You use this in the next line.
The second line takes the variable names that are all lowercase in the "test" vector and applies them to the column names of the dataset, data. Again, if your dataset is named something other than data, you need to change this.
Are you starting to see the utility of always keeping your dataset named "data"? I hope so. It is pretty handy.
This is a little more advanced than I was hoping to get at this point, but it is such a useful function - and I pretty much use it every time I import a dataset, I figured it would be useful to stick this at the beginning.
This blog post wasn't funny you say? No humor? Yeah, you are right. It was a long day, so I have no humor left. I have two humerus's, but I don't think that is what you were hoping for.
Ive been using this code snippet a lot lately. Most of our databases have variable names that are mixed upper and lower case letters. As you may have guessed by now, Im not a huge fan of upper case - its hard to remember which variables have which case of text.
So.
After you read in your dataset (since you are a master after the "how to read in your data" post)... paste the following code into the console or the R script editor. Once you paste it all, highlight the two lines and hit command+enter (on the mac!) to run it.
test<-tolower(colnames(data))
colnames(data)<-test
What is happening here is as follows:
The first line takes the column names (your variable names!) from the dataset called data (if you named your data something else, you need to change this). The colnames function is built into the base R package, so you dont need any particular package to make this work. The tolower function is also built into the base R package. So this function takes the column names from your dataset, data, and puts everything to lowercase text. you can also use "toupper" if you prefer all upper case text. You may notice that on the far left (before the gets "<-" operator), i have the word "test". What happens here is that R takes all of the lowercase names and puts it into a vector out on its own. You use this in the next line.
The second line takes the variable names that are all lowercase in the "test" vector and applies them to the column names of the dataset, data. Again, if your dataset is named something other than data, you need to change this.
Are you starting to see the utility of always keeping your dataset named "data"? I hope so. It is pretty handy.
This is a little more advanced than I was hoping to get at this point, but it is such a useful function - and I pretty much use it every time I import a dataset, I figured it would be useful to stick this at the beginning.
This blog post wasn't funny you say? No humor? Yeah, you are right. It was a long day, so I have no humor left. I have two humerus's, but I don't think that is what you were hoping for.
A quick note about variable (vector) names in R
I forgot to mention this earlier.
Don't put any characters other than numbers, letters, periods, or underscores in your variable (vector) names. If you do, R will automatically convert them to periods.
For example. If your variable name in the .csv file is Number of Days Without Pooping, after you read in the file, it will change the variable name to Number.of.Days.Without.Pooping.
I also highly recommend only using lower case letters for everything. It makes your life easier. If one variable is called FirstName, another is called firstname, and a third is called Firstname, R will see all of these as different, because, after all -- it is case sensitive.
Did I mention R is case sensitive?
Don't put any characters other than numbers, letters, periods, or underscores in your variable (vector) names. If you do, R will automatically convert them to periods.
For example. If your variable name in the .csv file is Number of Days Without Pooping, after you read in the file, it will change the variable name to Number.of.Days.Without.Pooping.
I also highly recommend only using lower case letters for everything. It makes your life easier. If one variable is called FirstName, another is called firstname, and a third is called Firstname, R will see all of these as different, because, after all -- it is case sensitive.
Did I mention R is case sensitive?
Here it is! Reading data into R
Ok - so this isn't really hard at all. When I was first starting, I made it much more difficult than it should have been. For the sake of nostalgia, Im going to make it more complicated than it needs to be here too. Read the end of this if you dont want the extra crap.
There are three main things you need to know reading data into R. At least things that I have found to make this process easier
Numero Uno
Your database should be in a .csv file. For those of you not familiar with .csv, it stands for comma separated values. If your data are in Excel, you can just choose file - save as - and change the file type to .csv. You will get a bunch of warnings, just click ok to all of them. This comma separated values business just means that every cell in your Excel file is separated by a comma. So R reads this in and knows that the comma is a 'delimiter' - a consistent thing that separates out each of your variables and data points. The main issue with .csv is that you cannot have any strange characters, otherwise R will not read it in. For example, if you have a degrees symbol in your database, this will not work. To be safe, the only characters you should have in your database are commas and decimal points. You can get away with hypens and underscores as well. In fact, you can read in a bunch of other stuff, but do you really need this junk? If you have a variable that is "comments" and it has a bunch of crazy characters, do you really plan to use R to analyze it? You may want a qualitative analysis software instead -- if you have a temperature variable with degrees symbols -- clean that crap up. If you have non-numeric values in your database (like a special character), R will get pissed. Just clean up your data as much as you can before you try to move it to R. It will save you some major headaches.
Numero dos
If you dont know what im talking about with regard to your dataset that you want to read into R, check out the following image. It is a test database I will use in most of my examples. This database is a data collection of factors associated with my wife using multiple glasses for water in one day. Is it necessary to use 5 glasses a day for water? probably not. I want to know what factors are associated with using more than 1 glass a day. There are 5 variables (R terminology for a variable is a vector sometimes) in this dataset: 1) day = 1 to 20 (examining glass usage over 20 days), 2) number_of_glasses = how many glasses did she use for water that day?, 3) bad_day_at_work = one if she had a bad day and zero if she had a good day, 4) heavy_metal_music = one if she listened to slayer on the way home from work and zero if she did not, and 5) slept_poorly = yes if she slept bad the night before and no if she slept ok. Obviously these are fake data, since she listens to slayer every day. A real dataset would have other data types - this has only "numeric" and "character" data. Numeric are the numbers and character is the slept_poorly variable since it is text. More on this later.
What was the point of number 2 here? Ha, number two. I dont think there was a point - just to show you the database Ill be using. This isnt actually one of the important things you need to know to make R using easier. You can download the data here (it is an excel file, so you will need to save as a .csv) https://www.dropbox.com/s/9xtebfcg9ouzyn6/water.xlsx
Numero tres
Use a standard value for any missing values. I highly recommend negative 1 (-1) for any blank spaces. Blanks would also work. Don't use something like 999 because some variables can actually take the value of 999. If you use one standard value for all missing values (like if I didnt know the value for slept_poorly on day 20), it makes your life easier in the future.
Ok that might be all you need to know for now. There are many other ways to get data into R, but this is the one I use most often, so it is the one I will show you to keep things simple.
1)Now, download your data and put it on your desktop (remember to convert to .csv!).
2) Go to R Studio, open a new R script (File - new - R Script)
3) type the following (not exactly the following - see below to what you need to change to make it work on your computer):
data<-read.csv("/Users/timothywiemken/Desktop/water.csv", na.strings=-1)
So you need to change the part in the quotes to the path of the file on your desktop. If you are on a Mac, the path always starts with /Users/ the next part is the name of your computer, followed by Desktop (if it is on your desktop) then the name of the file. Close your quotes after that.
If you are on a PC, this path will be something like "C:/Desktop/water.csv"
Remember R is CASE SENSITIVE, so be sure to capitalize stuff you need to capitalize. On a Mac, to get the path, right click the file and choose "get info". On a PC right click and choose "properties". The path of the file will be in the window that opens. An example on a mac is as follows (under the "where" part) -- this location does not include the file name so be sure to add it after the last slash. This says /Users/timothywiemken/Desktop . Not only would I need to add quotes around it, but I would need to add another slash after Desktop, followed by the file name, water.csv.
If I had use -1 for any missing values, the na.strings=-1 part will tell R that anything that is -1 in the dataset should be converted to missing (e.g. delete that value). You can change this to whatever if you use something other than -1. For example, if you decide to use 999, since I told you not to, you could say: na.strings=999.
Ok back to this, because you may wonder what the hell it all means:
data<-read.csv("/Users/timothywiemken/Desktop/water.csv", na.strings=-1)
the first part, data, is just what I am going to call the dataframe I am reading into R. I tend to call pretty much everything data, data1, data2, data3, etc. You may think I have no imagination. This is debatable. The main reason I call every dataset data, is that it makes copying and pasting code from one project to another. You end up having to type and retype the dataframe name all of the time in R, so it is easier to just always call the data the same thing. The "<-" is the "gets" operator. It is kind of like an equals sign, but in computer science, equals means something else (equals, means the arithmetic equals sign). If you were to say this whole line of code out loud, it would read something like this (without the stuff in parentheses):
The data frame "data" GETS (<-) the CSV file READ (read.csv) from the location "/Users/timothywiemken/Desktop/water.csv". If there are any negative 1's in the data frame, these will be set to missing.
Make sense? I thought so!
Pretty easy? Yup!
A lot of text and pictures for a simple procedure? Indeed.
So your R studio window will looks something like this, once you type in the code and hit command+enter (again, on a mac):
You see on the top right window? It says, Under DATA, data, 20 obs. of 5 variables. We know there are 20 observations (rows - or 20 days in the study), and 5 variables. There ya go. You can click this and it will open the dataset for you to view. You cannot really edit the dataset in this window. However, you can fix the .csv file on your desktop, and re-read in the data.
Anything you do in the console or the R script viewer does not change the actual data frame (the physical .csv file). So no worries there. It is like always working in the "work" library in SAS.
You can make any changes you want and just re-read in your data.
R works almost exclusively in your computers RAM (so you need a lot of free RAM), not in the ROM.
If you don't like R Studio, you can view your dataset by typing in the following code into the console or R script viewer and running it:
View(data)
You can change data to whatever your data frame's name is - if you decide you dont want to use data all of the time.
Ok, that is enough for now. You should be able to read in your data by now. If not, leave a comment and Ill clarify anything!
RIP Jeffrey.
There are three main things you need to know reading data into R. At least things that I have found to make this process easier
Numero Uno
Your database should be in a .csv file. For those of you not familiar with .csv, it stands for comma separated values. If your data are in Excel, you can just choose file - save as - and change the file type to .csv. You will get a bunch of warnings, just click ok to all of them. This comma separated values business just means that every cell in your Excel file is separated by a comma. So R reads this in and knows that the comma is a 'delimiter' - a consistent thing that separates out each of your variables and data points. The main issue with .csv is that you cannot have any strange characters, otherwise R will not read it in. For example, if you have a degrees symbol in your database, this will not work. To be safe, the only characters you should have in your database are commas and decimal points. You can get away with hypens and underscores as well. In fact, you can read in a bunch of other stuff, but do you really need this junk? If you have a variable that is "comments" and it has a bunch of crazy characters, do you really plan to use R to analyze it? You may want a qualitative analysis software instead -- if you have a temperature variable with degrees symbols -- clean that crap up. If you have non-numeric values in your database (like a special character), R will get pissed. Just clean up your data as much as you can before you try to move it to R. It will save you some major headaches.
Numero dos
If you dont know what im talking about with regard to your dataset that you want to read into R, check out the following image. It is a test database I will use in most of my examples. This database is a data collection of factors associated with my wife using multiple glasses for water in one day. Is it necessary to use 5 glasses a day for water? probably not. I want to know what factors are associated with using more than 1 glass a day. There are 5 variables (R terminology for a variable is a vector sometimes) in this dataset: 1) day = 1 to 20 (examining glass usage over 20 days), 2) number_of_glasses = how many glasses did she use for water that day?, 3) bad_day_at_work = one if she had a bad day and zero if she had a good day, 4) heavy_metal_music = one if she listened to slayer on the way home from work and zero if she did not, and 5) slept_poorly = yes if she slept bad the night before and no if she slept ok. Obviously these are fake data, since she listens to slayer every day. A real dataset would have other data types - this has only "numeric" and "character" data. Numeric are the numbers and character is the slept_poorly variable since it is text. More on this later.
What was the point of number 2 here? Ha, number two. I dont think there was a point - just to show you the database Ill be using. This isnt actually one of the important things you need to know to make R using easier. You can download the data here (it is an excel file, so you will need to save as a .csv) https://www.dropbox.com/s/9xtebfcg9ouzyn6/water.xlsx
Numero tres
Use a standard value for any missing values. I highly recommend negative 1 (-1) for any blank spaces. Blanks would also work. Don't use something like 999 because some variables can actually take the value of 999. If you use one standard value for all missing values (like if I didnt know the value for slept_poorly on day 20), it makes your life easier in the future.
Ok that might be all you need to know for now. There are many other ways to get data into R, but this is the one I use most often, so it is the one I will show you to keep things simple.
1)Now, download your data and put it on your desktop (remember to convert to .csv!).
2) Go to R Studio, open a new R script (File - new - R Script)
3) type the following (not exactly the following - see below to what you need to change to make it work on your computer):
data<-read.csv("/Users/timothywiemken/Desktop/water.csv", na.strings=-1)
So you need to change the part in the quotes to the path of the file on your desktop. If you are on a Mac, the path always starts with /Users/ the next part is the name of your computer, followed by Desktop (if it is on your desktop) then the name of the file. Close your quotes after that.
If you are on a PC, this path will be something like "C:/Desktop/water.csv"
Remember R is CASE SENSITIVE, so be sure to capitalize stuff you need to capitalize. On a Mac, to get the path, right click the file and choose "get info". On a PC right click and choose "properties". The path of the file will be in the window that opens. An example on a mac is as follows (under the "where" part) -- this location does not include the file name so be sure to add it after the last slash. This says /Users/timothywiemken/Desktop . Not only would I need to add quotes around it, but I would need to add another slash after Desktop, followed by the file name, water.csv.
If I had use -1 for any missing values, the na.strings=-1 part will tell R that anything that is -1 in the dataset should be converted to missing (e.g. delete that value). You can change this to whatever if you use something other than -1. For example, if you decide to use 999, since I told you not to, you could say: na.strings=999.
Ok back to this, because you may wonder what the hell it all means:
data<-read.csv("/Users/timothywiemken/Desktop/water.csv", na.strings=-1)
the first part, data, is just what I am going to call the dataframe I am reading into R. I tend to call pretty much everything data, data1, data2, data3, etc. You may think I have no imagination. This is debatable. The main reason I call every dataset data, is that it makes copying and pasting code from one project to another. You end up having to type and retype the dataframe name all of the time in R, so it is easier to just always call the data the same thing. The "<-" is the "gets" operator. It is kind of like an equals sign, but in computer science, equals means something else (equals, means the arithmetic equals sign). If you were to say this whole line of code out loud, it would read something like this (without the stuff in parentheses):
The data frame "data" GETS (<-) the CSV file READ (read.csv) from the location "/Users/timothywiemken/Desktop/water.csv". If there are any negative 1's in the data frame, these will be set to missing.
Make sense? I thought so!
Pretty easy? Yup!
A lot of text and pictures for a simple procedure? Indeed.
So your R studio window will looks something like this, once you type in the code and hit command+enter (again, on a mac):
You see on the top right window? It says, Under DATA, data, 20 obs. of 5 variables. We know there are 20 observations (rows - or 20 days in the study), and 5 variables. There ya go. You can click this and it will open the dataset for you to view. You cannot really edit the dataset in this window. However, you can fix the .csv file on your desktop, and re-read in the data.
Anything you do in the console or the R script viewer does not change the actual data frame (the physical .csv file). So no worries there. It is like always working in the "work" library in SAS.
You can make any changes you want and just re-read in your data.
R works almost exclusively in your computers RAM (so you need a lot of free RAM), not in the ROM.
If you don't like R Studio, you can view your dataset by typing in the following code into the console or R script viewer and running it:
View(data)
You can change data to whatever your data frame's name is - if you decide you dont want to use data all of the time.
Ok, that is enough for now. You should be able to read in your data by now. If not, leave a comment and Ill clarify anything!
RIP Jeffrey.
Scared yet?
Good morning world.
Today will be a suck day, but lets try to make it a little better.
If you are a little intimidated by R thus far (yup, this was written like this on purpose), don't be. Nearly everything I have mentioned thus far is solely to get you acquainted with R. We will continue to go over all of these concepts (help, packages, the console, R script editor, etc.) as we begin to do real work with R. I find it easier to get the ball rolling by discussing the topics we already went over -- then do some work for a while -- once we come back to the topics, they will make more sense (trust me... muuuuhaaaahahaha).
Just remember, R is the
Yes. I like cats. You will see them from time to time (or all of the time) on this blog.
I may start to do some videos too if I can find a good (e.g. free) screen capture software for the mac. Maybe this will be easier than reading my incoherent ramblings.
Next topic will be the first real topic that should get you started in R: reading in data.
Adios!
Today will be a suck day, but lets try to make it a little better.
If you are a little intimidated by R thus far (yup, this was written like this on purpose), don't be. Nearly everything I have mentioned thus far is solely to get you acquainted with R. We will continue to go over all of these concepts (help, packages, the console, R script editor, etc.) as we begin to do real work with R. I find it easier to get the ball rolling by discussing the topics we already went over -- then do some work for a while -- once we come back to the topics, they will make more sense (trust me... muuuuhaaaahahaha).
Just remember, R is the
Yes. I like cats. You will see them from time to time (or all of the time) on this blog.
I may start to do some videos too if I can find a good (e.g. free) screen capture software for the mac. Maybe this will be easier than reading my incoherent ramblings.
Next topic will be the first real topic that should get you started in R: reading in data.
Adios!
Tuesday, May 21, 2013
My wife is crazy.
Yup, this posting has nothing to do with R - you should probably expect this pretty often since I tend to get distracted easily.
My wife comes downstairs laughing, saying "hahah, I just read your blog and it is soooooo nerdy". Yup. Let us see who is the nerd when my R skills create world peace. You thought world peace would come from Wyld Stallyns music? Sorry Bill S. Preston esq. and Ted Theodore Logan. Your music sucks. R will rule the world.
It may, in fact, align the planets into universal harmony - only time will tell. Or Glenn Danzig. He knows. He might tell.
My wife comes downstairs laughing, saying "hahah, I just read your blog and it is soooooo nerdy". Yup. Let us see who is the nerd when my R skills create world peace. You thought world peace would come from Wyld Stallyns music? Sorry Bill S. Preston esq. and Ted Theodore Logan. Your music sucks. R will rule the world.
It may, in fact, align the planets into universal harmony - only time will tell. Or Glenn Danzig. He knows. He might tell.
What is this package business and why do I care?
See, I told you that was the title.
R Packages
As I mentioned previously, after you download R, it does a fair amount of stuff. However, as you begin to want to do more, you need to install various packages, created by R users (many times, very prominent people), to make the software do what you want.
A package is kind of like a SAS macro, packaged into a file that integrates into the R DNA (its like a wonderful retrovirus). Packages can do one function, or can have the ability to do many, many different things (this is the case most of the time).
For example, if you want to export some high resolution figures as a PDF file (a good vector format for publication quality charts), you can install the package "Cairo".
To install packages, type the following into the R console (then hit enter) or into the R script viewer (then hit command+enter -- on mac):
install.packages("Cairo")
Yes, you need the quotes around the word Cairo.
Yes. You may have just realized that R is indeed case sensitive. This is a critical concept that is not the same in all software packages. R sees the word Cairo as different than cairo. If you use the lower case "c", you will get an error. If you dont use the quotes, you will get an error.
When you run that function, you will see a bunch of crap starting to show in the console window -- it is installing the package. Some packages are not available for certain versions of R. I find it critical to not update R to the latest version (this is why I said to avoid version 3 in my first post -- all of the packages need to be re-written in order for them to work on this version -- this is a lot of work for people that create packages so it may take some time to get certain packages to work in new versions). I am a chronic updater who always wants the latest version of everything, so it is difficult for me to wait... it is critical to do this though. You only have to install the package once on your computer.
Once you install the package, you need to open the package to allow R to use the functions within the package.
To open the package, type the following in the console or R script viewer:
library(Cairo)
This time, dont use quotes, but keep the capitalization for Cairo. You can also type the following (it works the same.. Im not sure why they both work or what the difference is):
require (Cairo)
You only have to open the library once per R session. Once you close R and open it back up, you will need to re-open the library with one of the above scripts.
Many packages have their own help files. Once you install the package, try to type the following (change the word Cairo to whatever the package name you want the help for):
help(Cairo)
You will see in the console that there is no specific help file for Cairo. You can then revert to the ??Cairo, as mentioned in the prior post.
help(base) does work, however. This is the help file for the base R application.
There are tons of packages for R, which allow R to do pretty much whatever you want. If it doesn't do what you want, you are probably just doing something wrong. Once you get skilled (ok Napoleon), if a particular package still doesn't do what you want, you can email the package creator -- you can usually find their email address by looking for the help file on google, as outlined in the prior post (the PDF file that I showed). I have done this before with very good results. Please don't overly bother package creators though. They spend a lot of time making these things, and don't need to be harassed by people who may or may not know what they are doing.
The most important part of this blog is to make sure that you always cite the packages you use in any manuscript or whatever you are publishing. To get the appropriate citation for the packages you are using, type the following in the R console (swapping out Cairo with the name of the package you want the citation for):
citation("Cairo")
Yup, this time you need the quotes. Confusing huh? Yeah, I still haven't quite figured out why quotes are needed sometimes and not others. Oh well. You can't 'em all - if you win any of them, be pleased.
Ok - I promise - next post is about reading data into R.
Arrivederci.
R Packages
As I mentioned previously, after you download R, it does a fair amount of stuff. However, as you begin to want to do more, you need to install various packages, created by R users (many times, very prominent people), to make the software do what you want.
A package is kind of like a SAS macro, packaged into a file that integrates into the R DNA (its like a wonderful retrovirus). Packages can do one function, or can have the ability to do many, many different things (this is the case most of the time).
For example, if you want to export some high resolution figures as a PDF file (a good vector format for publication quality charts), you can install the package "Cairo".
To install packages, type the following into the R console (then hit enter) or into the R script viewer (then hit command+enter -- on mac):
install.packages("Cairo")
Yes, you need the quotes around the word Cairo.
Yes. You may have just realized that R is indeed case sensitive. This is a critical concept that is not the same in all software packages. R sees the word Cairo as different than cairo. If you use the lower case "c", you will get an error. If you dont use the quotes, you will get an error.
When you run that function, you will see a bunch of crap starting to show in the console window -- it is installing the package. Some packages are not available for certain versions of R. I find it critical to not update R to the latest version (this is why I said to avoid version 3 in my first post -- all of the packages need to be re-written in order for them to work on this version -- this is a lot of work for people that create packages so it may take some time to get certain packages to work in new versions). I am a chronic updater who always wants the latest version of everything, so it is difficult for me to wait... it is critical to do this though. You only have to install the package once on your computer.
Once you install the package, you need to open the package to allow R to use the functions within the package.
To open the package, type the following in the console or R script viewer:
library(Cairo)
This time, dont use quotes, but keep the capitalization for Cairo. You can also type the following (it works the same.. Im not sure why they both work or what the difference is):
require (Cairo)
You only have to open the library once per R session. Once you close R and open it back up, you will need to re-open the library with one of the above scripts.
Many packages have their own help files. Once you install the package, try to type the following (change the word Cairo to whatever the package name you want the help for):
help(Cairo)
You will see in the console that there is no specific help file for Cairo. You can then revert to the ??Cairo, as mentioned in the prior post.
help(base) does work, however. This is the help file for the base R application.
There are tons of packages for R, which allow R to do pretty much whatever you want. If it doesn't do what you want, you are probably just doing something wrong. Once you get skilled (ok Napoleon), if a particular package still doesn't do what you want, you can email the package creator -- you can usually find their email address by looking for the help file on google, as outlined in the prior post (the PDF file that I showed). I have done this before with very good results. Please don't overly bother package creators though. They spend a lot of time making these things, and don't need to be harassed by people who may or may not know what they are doing.
The most important part of this blog is to make sure that you always cite the packages you use in any manuscript or whatever you are publishing. To get the appropriate citation for the packages you are using, type the following in the R console (swapping out Cairo with the name of the package you want the citation for):
citation("Cairo")
Yup, this time you need the quotes. Confusing huh? Yeah, I still haven't quite figured out why quotes are needed sometimes and not others. Oh well. You can't 'em all - if you win any of them, be pleased.
Ok - I promise - next post is about reading data into R.
Arrivederci.
Help files and Other good websites
Yo yo yo.
R has very in-depth help files associated with pretty much everything it does. You can access the help files in many ways. First, you can google your question and end up at a website with a PDF of the packages you want help for (usually at cran.r-project.org) or just a general website with information. For example, if I google "R glmulti" (glmulti is a package in R - we will talk about packages in the next blog), you will see something like below. The second option here (the PDF) is the help file for the glmulti package. It gives you all the info you need for the package. Unfortunately R package help files are very convoluted.
Your next option is to type two question marks followed by the function you are interested in getting help for -- in either the console (then hitting enter), or in the R script viewer (see 2 posts ago for the differentiation between these). For example, I may be interested in finding the help file for the table function in R (table is a built in function that gives you the frequencies of a variable... like proc freq for SAS). In this case, I would type ??table in the R console or in the R script viewer as follows:
Once you hit command+enter (if you type it into the R script viewer as I have) or enter (if you type it into the console at the bottom), you will see some info on the bottom right window (see the Search Results R, with a bunch of crap over there? Yup, thats it).
Next, you need to scroll down in that window to actually find the function you want the help for. There are multiple sections here.. THe first section is Vignettes, the second is Code Demonstrations, and the third is Help files. You may have different numbers of options here depending on how many packages you have installed. For example, the first option in my Code Demonstrations (as you can see in the picture above) is graphics::Hershey. The way this is laid out is the package name followed by two colons, followed by the function name. So if you wanted help for the function Hershey, you could type ??Hershey in the console or R script viewer and this would be one of the options. The part before the two colons is the package in which the function resides. So if you dont have the graphics package installed, you wont see this option. Since the Hershey function has the function table in it also, it shows up here. Many functions also contain the table function, so you see a bunch of crap when you run the ??table. If you scroll down into the Help files section, you will see base::table. R, after you download it has a number of "packages" pre-installed. Two of the most commonly used are "base" (the base R application) and "stats". So if you see anything with base or stats before the two colons, these are just the basic functions built into R. If you scroll down, you will eventually see something like this (the last option here is base:: table -- click it and you will see the R help file for table).
I still find the R help files to be... well.. not that helpful. They are just too damn complicated. Over time, you kind of get used to the way things are written and can make your way through them.
I find the website www.statmethods.net (AKA Quick R) to be extremely helpful and very basic (sometimes too basic -- if that is possible). If you have a question, I would start with this website. The UCLA biostats site is also extremely helpful for R, SAS, SPSS, and STATA (http://www.ats.ucla.edu/stat/ ).
These two sites should be a good way to start with R. Next up - what is this package business, and why do I care?
R has very in-depth help files associated with pretty much everything it does. You can access the help files in many ways. First, you can google your question and end up at a website with a PDF of the packages you want help for (usually at cran.r-project.org) or just a general website with information. For example, if I google "R glmulti" (glmulti is a package in R - we will talk about packages in the next blog), you will see something like below. The second option here (the PDF) is the help file for the glmulti package. It gives you all the info you need for the package. Unfortunately R package help files are very convoluted.
Your next option is to type two question marks followed by the function you are interested in getting help for -- in either the console (then hitting enter), or in the R script viewer (see 2 posts ago for the differentiation between these). For example, I may be interested in finding the help file for the table function in R (table is a built in function that gives you the frequencies of a variable... like proc freq for SAS). In this case, I would type ??table in the R console or in the R script viewer as follows:
Once you hit command+enter (if you type it into the R script viewer as I have) or enter (if you type it into the console at the bottom), you will see some info on the bottom right window (see the Search Results R, with a bunch of crap over there? Yup, thats it).
Next, you need to scroll down in that window to actually find the function you want the help for. There are multiple sections here.. THe first section is Vignettes, the second is Code Demonstrations, and the third is Help files. You may have different numbers of options here depending on how many packages you have installed. For example, the first option in my Code Demonstrations (as you can see in the picture above) is graphics::Hershey. The way this is laid out is the package name followed by two colons, followed by the function name. So if you wanted help for the function Hershey, you could type ??Hershey in the console or R script viewer and this would be one of the options. The part before the two colons is the package in which the function resides. So if you dont have the graphics package installed, you wont see this option. Since the Hershey function has the function table in it also, it shows up here. Many functions also contain the table function, so you see a bunch of crap when you run the ??table. If you scroll down into the Help files section, you will see base::table. R, after you download it has a number of "packages" pre-installed. Two of the most commonly used are "base" (the base R application) and "stats". So if you see anything with base or stats before the two colons, these are just the basic functions built into R. If you scroll down, you will eventually see something like this (the last option here is base:: table -- click it and you will see the R help file for table).
I still find the R help files to be... well.. not that helpful. They are just too damn complicated. Over time, you kind of get used to the way things are written and can make your way through them.
I find the website www.statmethods.net (AKA Quick R) to be extremely helpful and very basic (sometimes too basic -- if that is possible). If you have a question, I would start with this website. The UCLA biostats site is also extremely helpful for R, SAS, SPSS, and STATA (http://www.ats.ucla.edu/stat/ ).
These two sites should be a good way to start with R. Next up - what is this package business, and why do I care?
Questions?
Hey pals - to my surprise, it appears that some people are actually reading this blog. If this is the case and it isn't just a bunch of bots trolling blogger, please leave some comments. I'm more than happy to discuss any topics, so as long as they are not too advanced for where I am currently at on the blog. We can talk offline as well if your questions are more advanced.
The R Studio Interface
The R Studio Interface - looks fancy! But not really.
Once you install R and R studio and open it, it will look something like this (I'm on a mac, so PC users may find a slightly different view):
Yours will also not have anything in the right column or the lower column (I'm currently running some analyses). Your view will also be a white background instead of the black background. You can change all of the look and whatnot in the R Studio preferences (on a mac it is the R studio dropdown at the top - then preferences). I think my view is twilight or something like that. Wait. It can be twilight. I wouldn't allow anything with that word after the terrible books and movies. Ok I had to look. The "appearance" setting in the preferences is "Tomorrow Night Bright". It works the best for my constant staring at the computer screen. White backgrounds are just too harsh.
The bottom black panel here is the R console. It works exactly like it does in the regular R or R64 app that you installed before R studio (remember that R studio actually just uses the R app - it is just a front end to run the R application -- R studio is just much prettier). You can type any commands into the console and hit enter (or return) to run them. I dont like the console much - I just use it to view the output of any functions I run (so in this respect, it serves kind of like the output window in SAS and SPSS, but you can also type commands in the console window).
The top black window is the one I use. This is the R script viewer. You can open a new one (I do this every time I start a new project) by clicking file and new R script. Actually, when you first open R studio you will only see the console window, it will just be larger. When you open a new R script, the console will get smaller and the R script window will show. You can resize stuff as you would any table or something in MS Word (the cursor changes when you hover over the gray line separating the two windows). This is where I type all of my commands. I like this because it is easier to save an R script than to save the whole workspace (this term will come back later -- it is essentially the entire environment you are working with). Regardless, type stuff here - you can hit enter as much as you want... to run the commands you type, you can do one of two things: 1) if the command is only on 1 line, you can click somewhere on that line (you dont have to highlight the line or anything) and click command+enter (on a mac). Im not sure what the equivalent is on a PC, but I think it is probably control+enter and 2) if the commands are on multiple lines, just highlight the entire section and hit command+enter. This window functions just like the SAS editor window and the SPSS syntax editor window. You type your commands and run them. The output shows in the lower console window. As you can see on my screenshot above, I have a lot of these 'editor' windows (yup, I still speak in SAS language) open - again, this works just like the SAS editor window.
The white box at the top right shows any dataframes (the terminology R users use often synonymously with data set), vectors (we will come back to this later, but a vector can be thought of as a variable - the variable may or may not be attached to a dataframe), functions, lists, etc. The lower window can be modified to show whatever you want. I use it pretty much solely for viewing graphics. This is a major benefit of R - the graphics are awesome.
I forgot about one thing the other day - if you are a mac user, you will need to install Xquartz to view graphics (download here: http://xquartz.macosforge.org/landing/ ).
So that is the overview of how R studio looks. Next up, help files and some useful websites to get started. Then we will focus on getting data into R and move on to recoding variables. The bane of my existence.
Peace.
Once you install R and R studio and open it, it will look something like this (I'm on a mac, so PC users may find a slightly different view):
Yours will also not have anything in the right column or the lower column (I'm currently running some analyses). Your view will also be a white background instead of the black background. You can change all of the look and whatnot in the R Studio preferences (on a mac it is the R studio dropdown at the top - then preferences). I think my view is twilight or something like that. Wait. It can be twilight. I wouldn't allow anything with that word after the terrible books and movies. Ok I had to look. The "appearance" setting in the preferences is "Tomorrow Night Bright". It works the best for my constant staring at the computer screen. White backgrounds are just too harsh.
The bottom black panel here is the R console. It works exactly like it does in the regular R or R64 app that you installed before R studio (remember that R studio actually just uses the R app - it is just a front end to run the R application -- R studio is just much prettier). You can type any commands into the console and hit enter (or return) to run them. I dont like the console much - I just use it to view the output of any functions I run (so in this respect, it serves kind of like the output window in SAS and SPSS, but you can also type commands in the console window).
The top black window is the one I use. This is the R script viewer. You can open a new one (I do this every time I start a new project) by clicking file and new R script. Actually, when you first open R studio you will only see the console window, it will just be larger. When you open a new R script, the console will get smaller and the R script window will show. You can resize stuff as you would any table or something in MS Word (the cursor changes when you hover over the gray line separating the two windows). This is where I type all of my commands. I like this because it is easier to save an R script than to save the whole workspace (this term will come back later -- it is essentially the entire environment you are working with). Regardless, type stuff here - you can hit enter as much as you want... to run the commands you type, you can do one of two things: 1) if the command is only on 1 line, you can click somewhere on that line (you dont have to highlight the line or anything) and click command+enter (on a mac). Im not sure what the equivalent is on a PC, but I think it is probably control+enter and 2) if the commands are on multiple lines, just highlight the entire section and hit command+enter. This window functions just like the SAS editor window and the SPSS syntax editor window. You type your commands and run them. The output shows in the lower console window. As you can see on my screenshot above, I have a lot of these 'editor' windows (yup, I still speak in SAS language) open - again, this works just like the SAS editor window.
The white box at the top right shows any dataframes (the terminology R users use often synonymously with data set), vectors (we will come back to this later, but a vector can be thought of as a variable - the variable may or may not be attached to a dataframe), functions, lists, etc. The lower window can be modified to show whatever you want. I use it pretty much solely for viewing graphics. This is a major benefit of R - the graphics are awesome.
I forgot about one thing the other day - if you are a mac user, you will need to install Xquartz to view graphics (download here: http://xquartz.macosforge.org/landing/ ).
So that is the overview of how R studio looks. Next up, help files and some useful websites to get started. Then we will focus on getting data into R and move on to recoding variables. The bane of my existence.
Peace.
Monday, May 20, 2013
Grammar and Spelling Errors
By the way. I refuse to proofread anything on here, so there will be spelling and grammatical errors. Is it really that critical for this type of blog? If you think so, sorry. Unfortunately this will not change my plans for world domination... I mean, my plans for writing this blog. My goal is to just get info out there so everyone can use this wonderful software.
Starting at the beginning. With the dinosaurs. When we rode them around to get to McDonalds.
What version of R to use?
Well, R version 3 just came out. I wouldn't recommend it quite yet. Try to download 2.15.3 if you can find it. After you install it, I HIGHLY recommend installing R Studio after it (just google R studio download). R Studio is a much nicer interface than the basic R program. It looks nicer and makes doing most things just plain easier.
R Studio is just a front-end to R. So it actually uses the version of R that you have installed on your computer. You cant just install R Studio, it wont work.
How does R work?
So R (the "base R") does a lot of stuff. It isn't like SAS and SPSS though. To do many things, you have to install various packages created by other users. We will go over this later.
Are these packages reliable?
Yeah, I think so. The R network requires a lot of legwork for someone creating a package, so anyone willing to do this is more than likely pretty reliable. The benefit of packages is the flexibility of R. Essentially, this allows you to do whatever you want - and any new method is pretty much immediately available in R. SAS and SPSS take years to add some methods.
My favorite things about R
1) Graphics are incredible
2) Flexibility, like an olympic gymnast.
2) Yup, 2 again. R rules.
3) Once the languages "clicks" with you, it becomes pretty intuitive. This can take some time though.
4) User created packages allow for immediate implementation of all kinds of useful functions.
5) Since I mentioned functions - you can write functions in R. A function is just a fancy word for a computer program that does whatever you want. We will talk about this in depth later. I may ask my pal Dr. Kelley write this section (I may have to start calling him Dr. Special K... it has a better ring to it).
6) It's free
Why I hate R
Yup, daily I pretty much hate R too. Its a love-hate thing.
1) It can be very complicated if you want to do fancy stuff
2) It has a decent learning curve (unless you learn how to use it from me)
3) You need a bunch of packages to do a lot of stuff - it can be hard to remember which package does what.
4) All of the documentation online is written by people that make everything overly complicated. This is extremely frustrating.
Take that. Next posting will be how to read in your data. The most important part.
Well, R version 3 just came out. I wouldn't recommend it quite yet. Try to download 2.15.3 if you can find it. After you install it, I HIGHLY recommend installing R Studio after it (just google R studio download). R Studio is a much nicer interface than the basic R program. It looks nicer and makes doing most things just plain easier.
R Studio is just a front-end to R. So it actually uses the version of R that you have installed on your computer. You cant just install R Studio, it wont work.
How does R work?
So R (the "base R") does a lot of stuff. It isn't like SAS and SPSS though. To do many things, you have to install various packages created by other users. We will go over this later.
Are these packages reliable?
Yeah, I think so. The R network requires a lot of legwork for someone creating a package, so anyone willing to do this is more than likely pretty reliable. The benefit of packages is the flexibility of R. Essentially, this allows you to do whatever you want - and any new method is pretty much immediately available in R. SAS and SPSS take years to add some methods.
My favorite things about R
1) Graphics are incredible
2) Flexibility, like an olympic gymnast.
2) Yup, 2 again. R rules.
3) Once the languages "clicks" with you, it becomes pretty intuitive. This can take some time though.
4) User created packages allow for immediate implementation of all kinds of useful functions.
5) Since I mentioned functions - you can write functions in R. A function is just a fancy word for a computer program that does whatever you want. We will talk about this in depth later. I may ask my pal Dr. Kelley write this section (I may have to start calling him Dr. Special K... it has a better ring to it).
6) It's free
Why I hate R
Yup, daily I pretty much hate R too. Its a love-hate thing.
1) It can be very complicated if you want to do fancy stuff
2) It has a decent learning curve (unless you learn how to use it from me)
3) You need a bunch of packages to do a lot of stuff - it can be hard to remember which package does what.
4) All of the documentation online is written by people that make everything overly complicated. This is extremely frustrating.
Take that. Next posting will be how to read in your data. The most important part.
First Post - What to Expect from this Blog
Who are you?
Well pals, my name is Timothy Wiemken. I am an Assistant Professor of Medicine in the University of Louisville Division of Infectious Diseases and am the Assistant Director of Epidemiology and Biostatistics at the University of Louisville Clinical and Translational Research Support Center. Fancy titles huh? Does it mean anything to you? Probably not. Check out our center at www.ctrsc.net.
Why should I listen to what you have to say - I mean, the blog is pirate stats???
Well har-de-har smarty pants. I learned statistical computing using SPSS and Epi Info a number of years ago. I was never quite pleased with either - There were too many clicks in SPSS, and Epi Info - well, 'nuff said. After a while of using SPSS, I learned the SPSS syntax, which eliminated the clicking and re-clicking, but it just was not as flexible as I wanted it to be. I later learned SAS when I was getting my master's at Saint Louis University from an amazing professor, Dr. Q. John Fu. I didn't get proficient in SAS until I used it for a number of years at my first job in infectious diseases, as a data analyst. SAS was great, but it, like SPSS and most other packages, was expensive. SAS is also not particularly flexible unless you are awesome at Macro coding... which I am not. R (AKA: ARRRRR, hence the pirate stats), seemed to eliminate these issues. R is the most flexible of all of the softwares capable of statistical computing (of those I have used), and is completely free. Download it at www.r-project.org.
Why do I need a tutorial on R?
The R language is more of a computer science language, whereas SAS and SPSS are more biostatistical. At least this is what my computer science pal tells me. I am definitely not a computer scientist (not because I don't want to be... mainly because I don't want to go to school again. One PhD is enough). After thinking like SPSS and SAS for so long, it was difficult to begin to use R. One of my friends, Dr. Guy Brock at the UofL School of Public Health was instrumental in getting me started. Dr. Rob Kelley, my computer science pal (also an Assistant Prof in Infectious Diseases) has also been extremely helpful in allowing me to think more like a computer scientist. Thinking differently is critical to begin to use R - so those of you who already know another language may find it difficult to switch (as I did).
So can you actually teach me how to use R?
Maybe - particularly if you aren't a total moron. Actually, I think anyone can learn. How my blog may be different than others is as follows 1) I am not a computer scientist and tend to think more like an epidemiologist -- I dont think theoretically and focus on strictly practical applications of biostatistics and data management, 2) I was proficient in other languages (SPSS and SAS) before and had to re-learn how to think biostats for the R language -- I can provide some comparisons to other packages which I think makes it easier to switch (it did for me), 3) most of the other blogs are overly complicated -- for practical application, you dont have to know all this extra crap -- I will focus on simple application to get you through what you will actually need to use R for any basic to intermediate biostatistical analysis (99% of the application that anyone actually reading this blog will want to see).
So take that -- read on. If you like it, let me know. If you don't, go to Facebook to complain like the rest of the world. If you like it, enjoy being able to use R -- it is totally sweet, free, and amazing. Did I mention it was sweet? You may get diabetes from it.
--- Timothy Wiemken, PhD MPH CIC (certified in infection control and epidemiology)
Well pals, my name is Timothy Wiemken. I am an Assistant Professor of Medicine in the University of Louisville Division of Infectious Diseases and am the Assistant Director of Epidemiology and Biostatistics at the University of Louisville Clinical and Translational Research Support Center. Fancy titles huh? Does it mean anything to you? Probably not. Check out our center at www.ctrsc.net.
Why should I listen to what you have to say - I mean, the blog is pirate stats???
Well har-de-har smarty pants. I learned statistical computing using SPSS and Epi Info a number of years ago. I was never quite pleased with either - There were too many clicks in SPSS, and Epi Info - well, 'nuff said. After a while of using SPSS, I learned the SPSS syntax, which eliminated the clicking and re-clicking, but it just was not as flexible as I wanted it to be. I later learned SAS when I was getting my master's at Saint Louis University from an amazing professor, Dr. Q. John Fu. I didn't get proficient in SAS until I used it for a number of years at my first job in infectious diseases, as a data analyst. SAS was great, but it, like SPSS and most other packages, was expensive. SAS is also not particularly flexible unless you are awesome at Macro coding... which I am not. R (AKA: ARRRRR, hence the pirate stats), seemed to eliminate these issues. R is the most flexible of all of the softwares capable of statistical computing (of those I have used), and is completely free. Download it at www.r-project.org.
Why do I need a tutorial on R?
The R language is more of a computer science language, whereas SAS and SPSS are more biostatistical. At least this is what my computer science pal tells me. I am definitely not a computer scientist (not because I don't want to be... mainly because I don't want to go to school again. One PhD is enough). After thinking like SPSS and SAS for so long, it was difficult to begin to use R. One of my friends, Dr. Guy Brock at the UofL School of Public Health was instrumental in getting me started. Dr. Rob Kelley, my computer science pal (also an Assistant Prof in Infectious Diseases) has also been extremely helpful in allowing me to think more like a computer scientist. Thinking differently is critical to begin to use R - so those of you who already know another language may find it difficult to switch (as I did).
So can you actually teach me how to use R?
Maybe - particularly if you aren't a total moron. Actually, I think anyone can learn. How my blog may be different than others is as follows 1) I am not a computer scientist and tend to think more like an epidemiologist -- I dont think theoretically and focus on strictly practical applications of biostatistics and data management, 2) I was proficient in other languages (SPSS and SAS) before and had to re-learn how to think biostats for the R language -- I can provide some comparisons to other packages which I think makes it easier to switch (it did for me), 3) most of the other blogs are overly complicated -- for practical application, you dont have to know all this extra crap -- I will focus on simple application to get you through what you will actually need to use R for any basic to intermediate biostatistical analysis (99% of the application that anyone actually reading this blog will want to see).
So take that -- read on. If you like it, let me know. If you don't, go to Facebook to complain like the rest of the world. If you like it, enjoy being able to use R -- it is totally sweet, free, and amazing. Did I mention it was sweet? You may get diabetes from it.
--- Timothy Wiemken, PhD MPH CIC (certified in infection control and epidemiology)
Subscribe to:
Comments (Atom)





















