Tuesday, June 25, 2013

all variable names to lower case - automatically!

here is the code in two pieces.  I like to do it in two steps instead of one because I tend to use the vector with the variable names for other things:

var.names<-tolower(colnames(data)
colnames(data)<-var.names


This is assuming your dataset is called "data"

Essentially, it takes the column names (e.g. variable names), puts them into a vector called var.names, then you assign that vector to the column names of the dataset "data".

The only issue I have run into is if you have two variables with the same name, but one in upper case and one in lower case in the same dataset.  In this case, it will delete one of them.  In this case, you are best to rename this duplicate variable first... or better yet, figure out why the heck you have two variable names with the same name in the same dataset!

Take that!


Monday, June 17, 2013

Exporting lists to a usable file

If you have a lot of data stored in a list (e.g. output from some statistical test), sometimes you just want to export it to a text file to save it or print or whatever.

It can be difficult.

It doesnt have to be.

do this:

sink("/Users/Desktop/timothywiemken/output.txt")

function

sink()

run those three lines of code, where:
"/Users..." is the output location for the file (I put everything on my desktop and save it elsewhere later)

function is the actual function you want to run.  For example:  chisq.test(data$number_of_glasses, data$heavy_metal_music)

and

sink() closes down the sink function

This will export a text file with whatever name you want (here, output.txt) to your desktop (or whatever location you want).

The output file will contain all of the values typically printed in the R console window... This function just "sinks" that output to a text file.  It will be in the same format as in the console window.

Very useful!

Saturday, June 15, 2013

Yup

Well, no one has commented yet so I have no idea if this helps anyone.  It does take a lot of time to do the blog, so I probably will not update it much unless I get some traffic.   I will continue to update it though -- probably mostly with new things I learn... the downside is that some of the code may be more advanced.  Sorry if anyone is actually using this blog...  if something changes I may come back to the basics.

Wednesday, June 12, 2013

Descriptive statistics with R, Part 2 -- 2x2 contingency tables -- and a quick taste of the Chi squared test.

Yo yo yo.  Im back.

2x2 tables are critical, so you need to learn how to make them quickly in R.

Luckily it is pretty easy.

Just do your regular table statement like last time, but add in your second variable a la:

table(data$slept_poorly, data$bad_day_at_work)

This spits out the cross tabulation (frequencies) of those two variables:



The first variable you put in the table statement (here, slept poorly) is on the row, and the 2nd variable (here, bad_day_at_work) is the column.  They are both coded 1 and 0, so those are the row and column "names".   You see those days where she slept poorly AND had a bad day at work were a total of 5... (the bottom right value - where the 1's intersect).

Sometimes it is confusing as to what is the row and what is the column because the table statement doesn't actually give you the labels as to what is what.  You can use another function, xtabs (also built into the base R package) to get the labels.  The structure is similar but not exactly the same:

xtabs(~data$slept_poorly+data$bad_day_at_work)

Ok, so the xtabs part takes the place of "table" that you did before.

Then you need the parentheses, followed by the "tilde"... e.g ~ (shift of the key on the upper left of your keyboard).  Next comes the variable you want in the row, followed by a plus (+) then the column variable.  You end up with this in the output:


Nice huh?

Ok.  you might say... I dont care about frequencies... I want percentages.  Hold your horses.

First, put the results into a new vector like we have done before (you can do this with table or with xtabs)-- call it whatever you want.  Ill call it frequencies just for the heck of it:

frequencies<-table(data$slept_poorly, data$bad_day_at_work)

Next, do the prop.table statement, just like we did when we were getting the proportions for just one variable.  The "variable" you are getting the proportions of will be the vector you just created.  You need one more step though... You have to define whether you want the "row" or "column" percentages.  The row percentages will be the cell value divided by the sum of that row.  The column percentage is the cell value divided by the sum of that column.  A 1 means row percent, and a 2 means give me the column percentages.  So to get the column percentages for the vector "frequences":

prop.table(frequencies,2)



This means that, of all of the days my wife had a bad day at work (the column representing a "1"), 50% of the time she also slept poorly (row representing the "1").  You can do any other arithmetic operations you want, like we did before (e.g. multiply by 100 to ge the standardize percent as opposed to just the decimal proportion).

One final thing.  This makes getting the P value for the Chi-squared test REALLY easy.

Do this:

chisq.test(frequencies)



P value is 0.6481!

you will see a warning that the Chi squared approximation may be incorrect, because not all of the expected values are greater than 5.  So you may want to do the Fishers Exact test instead:

fisher.test(frequencies)


P value is pretty close to the same.

The main benefit of R starts to show here.  By putting things into vectors, you can easily manipulate them later.  You can put ALL of the results of the chi squared test into one vector for example:

chisquared.results<-chisq.test(frequencies)

This puts all of those pieces of data from the chisquared test into the vector chisquared.results...

Do that, then check the structure of that vector:

str(chisquared.results)

You will  see you get a huge printout of a bunch of crap.  This is a new vector type, called a LIST.  A list is a vector of multiple data types.  You may have numbers, text, etc. all in one vector!  This is pretty sweet as you will see in the future.

Here is your printout:


You see it is a list of 9 items.  (List of 9).

for each vector in the List, you will see the name of that vector.  Each new vector starts with a dollar sign ($).  So, this list contains: statistic, parameter, p.value, method, data.name, observed, expected, residuals, and stdres (all of the values with a $ in front of them).

If you just want to see one of them, you can call it out by typing the statement into the console:

chisquared.results$p.value

The dataset containing the p.value (actually the List, but you are treating it like a dataset because it is kind of like a mini dataset) is the chisquared.results LIST you created with the chisq.test(frequencies) statement.  You want to see just the P-value which is contained in that list, so you just separate the dataset with the variable, as you would with anything else.  Voilas, you now have a printout of just the p-value.

The cool thing is you can do something like:

pvalue<-chisquared.results$p.value

which will put just the P-value into its own vector.  You can then put that p-value into figures, automatically.  You will probably do this all of the time in the future, when we get to graphing things.  It lets you be able to automate almost anything.  It is really, really sweet.  Trust me.

Next up, mean, median, standard deviation, and interquartile range....


Tuesday, June 11, 2013

Back home - hope to have a new post tomorrow.

I refused to pay the $10/day for internet in my hotel.  Otherwise, I would have posted something already.  Good conference though...

Thursday, June 6, 2013

Wednesday, June 5, 2013

Basic Descriptive Statistics with R - Part 1: Frequencies and Percentages

Good afternoon pals.

Today we are going to look at frequencies and percentages for categorical variables.  If I have some time tomorrow, we will do contingency tables with R (e.g. 2x2 tables).  We will move on to mean, median, IQR and standard deviation after that.


So doing tables is really easy with the built in table function.

If you want the count (frequencies) of number_of_glasses, just do the following:

table(data$number_of_glasses)

R will spit out the results.  The number on the top is the actual value for number_of_glasses.  The number on the bottom is the number of cases (e.g. the number of rows) that contain that particular value:



You can see that one day, my wife drank 1 glass of water, and on three days, she drank 5 glasses of water, etc.  One day she had 15 glasses.

This table function is a good way to screen your data to see if what is entered in makes sense... it is much better than viewing the data in the actual database.  If you have strange values, they will be clearly evident.  Also, if you have blanks in a variable that is a text (or character) variable, the first value will not have a number on the top (its a blank), but it will have a number below.  This means that you have x number of blanks in that variable.

If your variable is numeric like the number_of_glasses variable and you want to know how many missing values there are, you can do the following:

table(is.na(data$number_of_glasses).

The result will be numbers above the text "TRUE" and "FALSE".  Any missing values will show up above the "TRUE" text.  If there are no missing values, you wont even have the "TRUE" listed.

This works great for counts.  Now for the best thing R can do - put ANY results into a new variable, so you can manipulate your results as you please with other functions:

counts<-table(data$number_of_glasses)

If you do that, the data shown in my image above will be input into a new variable called "counts".  You can change the variable to whatever you want.  I always just use counts, because I like it.  This variable will not be in the dataset "data", it will be out on its own.

If you type the word counts in your R script and run it, you will see the output, just like we did above.  The benefit of doing this, is that now you can get the proportions instead of just the counts.   To get the proportions, you need to do the script with counts first, then do:

prop.table(counts)


You can now see that 5% of the time, she had 1 glass of water, and 15% of the time she had 5 glasses of water.

You can also do some arithmetic stuff to this... for example, if you dont like proportions and converting the decimals to percentages in your head, you can multiply them by 100 in the same statement:

prop.table(counts)*100

Not bad ey?  I agree.

Peace.

Monday, June 3, 2013

Another way of recoding variables - without ifelse

Ok - hope all is well with everyone today!

I wanted to start by letting everyone know that you can use ifelse statements within other ifelse statements to recode a multi-category variable, much like you can in SAS.

For example, you may have the number_of_glasses variable, which you want to recode into three categories, 0, 1-5, and >5 glasses per day, coded 1,2,and 3, respectively.

You can do this by using multiple ifelse:

data$glasses.categories<-ifelse(data$number_of_glasses==0, 1,
ifelse(data$number_of_glasses>0 & data$number_of_glasses<6, 2, ifelse(data$number_of_glasses>5,3,NA)

The breakdown of this is essentially saying if the number of glasses of water is equal to zero, give the new variable, glasses.categories a 1, otherwise, do the next ifelse statement which says if the number of glasses is greater than zero (e.g. start at 1) AND is also less than 6 (e.g. 5 and below), give the new variable glasses.categories a 2, otherwise do the final ifelse statement which says give the new variable glasses.categories a 3 if the variable number_of_glasses is greater than 5.  The final "else" part says if it doesnt meet any of these criteria, give it an NA, which is R for blank or missing.  You always need that last "else" part.  Since my criteria should be exhaustive (e.g. all of the possible values of number_of_glasses are covered in my if statements or my else statements), the only other possibility would be to be a missing value, so the final NA carries over those missing values.

There is another way to do this as well, without using these ifelse statements (actually multiple ways but Ill show you the one I use most often).

To do the same recode as above, Ill first start by making a new variable that is equal to 1.

data$glasses.categories<-1

So all values in the dataset for this variable will be equal to 1.

Next, recode over these 1's with your first criteria - if the number_of_glasses falls between 1 and 5, give it a 2.  You can use a subset of your data by using the straight brackets, like so:

data$glasses.categories[data$number_of_glasses>0 & data$number_of_glasses<6]<-2

This telling R to look at the glasses.categories variable in the dataset "data", but only those rows or cases where the number_of_glasses variable is between (and includes) 1 and 5 and give those a 2.  This will recode OVER TOP of the 1's that are already in that variable for those cases... those values that were created with your first step, where you made everything equal to 1.

Finally, recode over the 1's in that variable with your final criteria.

data$glasses.categories[data$number_of_glasses>5]<-3


You will end up with the same values as you did using the ifelse.  So the final block of code would be:

data$glasses.categories<-1
data$glasses.categories[data$number_of_glasses>0 & data$number_of_glasses<6]<-2
data$glasses.categories[data$number_of_glasses>5]<-3

******** VERY IMPORTANT NOTE ***********

You need to be careful using this final method in the event you have missing values in your dataset. I would highly recommend one more line of code if your variable has any missing values:

data$glasses.categories[is.na(data$number_of_glasses)]<-NA

This will carry over any missing values from your old variable to your new variable.


Have fun!  Next up... I don't know yet.  Im getting tired of recoding, so we may move on to basic contingency tables and measures of central tendency (mean, median) as well as variation (Standard Deviation, Interquartile Range).

Peace!




Saturday, June 1, 2013

I hope this blog is easy to understand.

I am incredibly frustrated today, searching for useful information on log-linear regression models online.  Yeah, there are tons of articles explaining them.  Are any of them useful or understandable?  Not a one.  WTF statisticians.  Why do you insist on making everything difficult?  Are you trying to ensure your job security?  This stuff doesn't have to be difficult.  I am confident in this. 

After all, as Einstein stated,

“If you can't explain it to a six year old, you don't understand it yourself.”


When he said that, I think what he meant was if you can't explain it to a six year old and have them understand it, at least reasonably well, you don't understand it yourself.  Of course, you can probably explain anything to a six year old - that doesn't mean they understand it.

Yes,  I am acting as the six year old today.  Take that.

Recoding variables in R with if-then or if-else statements

Ok, sorry I havent been posting in a couple of days.  This work week was completely insane.  Oh well.  Its Saturday, and I'm still going to be working on a few things, but I wanted to get this post up for anyone who is actually reading this.

Today is a big day - recoding variables.  If you do any type of research, recoding is something you inevitably have to do.  When I first started using R, other R users used to tell me "Oh, don't use R for recoding, use something else - R is too difficult for recodes".  Once I figured out how to recode stuff, I found this to be completely false.  It is pretty easy once you get the hang of it.  There are only a couple of things you need to be very careful with, the main thing being what to do with missing data.

I am going to focus on ifelse statements today (if then is very similar, but ifelse works just like it does in Excel.  Since almost everyone using R is going to be at least reasonably familiar with Excel, I will focus on this.  Maybe the next post, Ill look at some other methods.

Here is the basic structure of ifelse:

dataset$new.variable<-ifelse(dataset$old.variable=="some value", "what you want the new variable to be if the prior logical statement is true", "what you want the new variable to be if the prior logical statement is false").

Let us recode the "number_of_glasses" variable in our water.csv dataset.  Pretend, that we are only interested in those days where my wife drank more than 10 glasses of water in a day.  We want to create a new variable that is 1 when she drank more than or equal to 10 glasses, and a 0 if she drank less than 10.  Here is the code:

data$water.10<-ifelse(data$number_of_glasses>9,1,0)

Pretty simple.  From left to right in words:

1) data - we are going to put our new variable back into the same dataset that contains all of our other data.  If you dont put data here, and just put the variable name, the new variable will be out on its own, not contained in your dataset!  This can be useful at times, but here we want it to be in our regular dataset.
2) the new variable will be called water.10 (separated from the dataset by a dollar sign, as always)
3) gets!
4) ifelse - this is a function built into the base R package
5) data$number_of_glasses>9 - we are taking the number_of_glasses variable from the dataset "data" and making a logical value which is when this variable is greater than 9 (e.g. greater or equal to 10).  You could also do this:  data$number_of_glasses>=10 .  I tend to use the lower or upper number and just the greater or less than sign to incorporate the "or equal to" part.  It really doesn't matter which one you do.
6) 1 - if the logic is true (e.g. the number of glasses is greater or equal to 10 [or as I put it, greater than 9]), the new variable, water.10 in the dataset "data" will take on a value of 1.
7) 0 - if the logic is false (e.g. the number of glasses is less than 10 [or NOT greater or equal to 10]), the value for water.10 in the dataset "data" will take on a value of 0.

I think you can also say if oldvariable==whatever else newvariable==something  else.  I dont use this structure ever, but I can see some potential uses.  For more info, google "R Control Structures"

_________________________________

Ok this part is REALLY important!

If you have any missing data in the variable in which you want to recode,  the statement above, will carry over any missing values.  For example, if on day 2, we didnt have the number of glasses of water she drank, the new variable, water.10 would also have a missing value for day 2 based on the code above  Typically you want this to happen, so this is a good thing.

If you want any missing values to be recoded with your logic, you need to do something else.  I am not an expert in this area, but I can give you one example.

If your logic contains a value that is equal to some value, you can use the %in% operator, instead of the double equals "==".  For example:

if you were interested in making a variable where the value was 1 when she drank exactly 10 glasses of water and 0 when she drank any other number (again, missing values will be carried over as missing into the new variable):

data$water.10<-ifelse(data$number_of_glasses==10,1,0)

If you wanted those missing values to be zero, you would change this statement to:

data$water.10<-ifelse(data$number_of_glasses %in% 10,1,0)

Looks strange?  I agree.

Im not sure at this point how to carry over missing values if you are using greater or equal to..

The best workaround I can think of is:

data$water.10<-ifelse(data$number_of_glasses>9 | is.na(data$number_of_glasses),1,0)

this would give a 1 to water.10 where the number_of_glasses was greater or equal to 10 OR if it was missing.  Ill ask my pal Rob if he has a better approach.  I hate having to use AND (&) and OR (|) all of the time.

Have a good weekend!  Peace to my pals in Oklahoma.