Ok, sorry I havent been posting in a couple of days. This work week was completely insane. Oh well. Its Saturday, and I'm still going to be working on a few things, but I wanted to get this post up for anyone who is actually reading this.
Today is a big day - recoding variables. If you do any type of research, recoding is something you inevitably have to do. When I first started using R, other R users used to tell me "Oh, don't use R for recoding, use something else - R is too difficult for recodes". Once I figured out how to recode stuff, I found this to be completely false. It is pretty easy once you get the hang of it. There are only a couple of things you need to be very careful with, the main thing being what to do with missing data.
I am going to focus on ifelse statements today (if then is very similar, but ifelse works just like it does in Excel. Since almost everyone using R is going to be at least reasonably familiar with Excel, I will focus on this. Maybe the next post, Ill look at some other methods.
Here is the basic structure of ifelse:
dataset$new.variable<-ifelse(dataset$old.variable=="some value", "what you want the new variable to be if the prior logical statement is true", "what you want the new variable to be if the prior logical statement is false").
Let us recode the "number_of_glasses" variable in our water.csv dataset. Pretend, that we are only interested in those days where my wife drank more than 10 glasses of water in a day. We want to create a new variable that is 1 when she drank more than or equal to 10 glasses, and a 0 if she drank less than 10. Here is the code:
data$water.10<-ifelse(data$number_of_glasses>9,1,0)
Pretty simple. From left to right in words:
1) data - we are going to put our new variable back into the same dataset that contains all of our other data. If you dont put data here, and just put the variable name, the new variable will be out on its own, not contained in your dataset! This can be useful at times, but here we want it to be in our regular dataset.
2) the new variable will be called water.10 (separated from the dataset by a dollar sign, as always)
3) gets!
4) ifelse - this is a function built into the base R package
5) data$number_of_glasses>9 - we are taking the number_of_glasses variable from the dataset "data" and making a logical value which is when this variable is greater than 9 (e.g. greater or equal to 10). You could also do this: data$number_of_glasses>=10 . I tend to use the lower or upper number and just the greater or less than sign to incorporate the "or equal to" part. It really doesn't matter which one you do.
6) 1 - if the logic is true (e.g. the number of glasses is greater or equal to 10 [or as I put it, greater than 9]), the new variable, water.10 in the dataset "data" will take on a value of 1.
7) 0 - if the logic is false (e.g. the number of glasses is less than 10 [or NOT greater or equal to 10]), the value for water.10 in the dataset "data" will take on a value of 0.
I think you can also say if oldvariable==whatever else newvariable==something else. I dont use this structure ever, but I can see some potential uses. For more info, google "R Control Structures"
_________________________________
Ok this part is REALLY important!
If you have any missing data in the variable in which you want to recode, the statement above, will carry over any missing values. For example, if on day 2, we didnt have the number of glasses of water she drank, the new variable, water.10 would also have a missing value for day 2 based on the code above Typically you want this to happen, so this is a good thing.
If you want any missing values to be recoded with your logic, you need to do something else. I am not an expert in this area, but I can give you one example.
If your logic contains a value that is equal to some value, you can use the %in% operator, instead of the double equals "==". For example:
if you were interested in making a variable where the value was 1 when she drank exactly 10 glasses of water and 0 when she drank any other number (again, missing values will be carried over as missing into the new variable):
data$water.10<-ifelse(data$number_of_glasses==10,1,0)
If you wanted those missing values to be zero, you would change this statement to:
data$water.10<-ifelse(data$number_of_glasses %in% 10,1,0)
Looks strange? I agree.
Im not sure at this point how to carry over missing values if you are using greater or equal to..
The best workaround I can think of is:
data$water.10<-ifelse(data$number_of_glasses>9 | is.na(data$number_of_glasses),1,0)
this would give a 1 to water.10 where the number_of_glasses was greater or equal to 10 OR if it was missing. Ill ask my pal Rob if he has a better approach. I hate having to use AND (&) and OR (|) all of the time.
Have a good weekend! Peace to my pals in Oklahoma.
No comments:
Post a Comment