Pirate Stats - R for Beginners: May 2013

Wednesday, May 29, 2013

Basic R Operators

Good morning pirates!

Here is a quick overview of the most common operators. By operators I mean things like "equal", "not equal", "greater than", etc.

This will be a rough one.

equal: ==
not equal: !=
greater than: >
less than: <
greater or equal to: >=
less or equal to: <=

I think that is about it.

So how do you use these? Let us take the subset function from the previous post:

Using equal (double equal sign):
data.metal<-subset(data, data$heavy_metal_music==1)
That would give you all of the cases in "data" where heavy_metal_music is equal to 1 (e.g. "yes")

Using not equal:
data.metal<-subset(data, data$heavy_metal_music!=1)
That would give you all of the cases in "data" where heavy_metal_music is not equal to 1 (e.g. "no" or anything else entered into that cell -- this is where data quality becomes important!)

I won't give an example of the other operators, as by now you can probably tell, you just change the operator in the statement to the operators listed above.

Yes, my font changed. I copied the data.metal subset statement from a prior post and I decided to not change the font back after the paste. Take that! Im such a rebel.
__________________________________________________________

Ok, so how do you specify if the data are missing or not missing?

As you may know, R treats empty cells (e.g. missing data) as NA. It will put NA into empty cells that otherwise have numbers in them (we will get back to data types later). If the cells have text in them or a non-numeric character (hyphen, slash, colon, etc), R will actually leave that cell truly blank.

To figure out what variable type you have, you can type the following:
class(data$variable)
of course, you would change the word 'variable' to the variable name for which you are interested in seeing the variable type (e.g. the variable class). R will treat integers, numbers, and factors the same -- NA is missing.

So if you want to keep all of the cases in your subset where heavy_metal_music is missing, you can do the following (again, this only works for variables that are numeric -- because R puts NA in the blank cell):

data.metal<-subset(data, is.na(data$heavy_metal_music))

the is.na(variable) part tells R to subset if the heavy_metal_music is NA (or missing)

To specify NOT MISSING:

data.metal<-subset(data, !is.na(data$heavy_metal_music))

Might look familiar -- just like not equal uses the exclamation mark for NOT, is.na uses it the same.

_____________________________________________________________

So what if the variable is a 'character' variable (e.g. it has text in it).

data.metal<-subset(data, data$heavy_metal_music=="")

The double quotes specifies a textual "blank" in the cell. You could say:

data.metal<-subset(data, data$heavy_metal_music!="")
to mean not equal blank.

You have to be careful here though, as sometimes, a space in that cell (for example, if you had data in a cell in excel, but hit the space bar to clear it out instead of backspace or delete) will not be caught by the "" operator. In this case, you need to say " " double quote with a space in between. If you have two spaces, you need a double quote with two spaces in between, etc.

Its kind of a pain, so make sure your data are clean before you put them into R.

Keep it realz.

Tuesday, May 28, 2013

Welcome back - lets learn how to specify variables in your code and subset!

'twas a long weekend that did not feel so long. Regardless, I'm back to work so lets talk about an important function, subsetting. First, we have to be confident in our abilities to call out specific variables in our data.

Since you can be working with multiple datasets at once in R, you always need to specify the dataset and the variable within that dataset. There are some other tricks to get around this (I wont talk about them because they dont always do what you think they do. If you are interested, look into ??attach and ??detach).

Regardless, if your dataset is called "data" and you want to do something to the "number_of_glasses" variable, you need to specify both of them in your code. The dollar sign ($) is what does this for you, as follows:

data$number_of_glasses

Thats all there is to that. Always specify the dataset AND the variable with the dollar sign in between. If you dont, you will get an error saying that the object is not found. For example, if you just typed:

number_of_days

You would get the following error:

Error: object 'number_of_days' not found

Alternatively, if you type:

data$number_of_days
and run it.

You will see the actual values for the number_of_days variable within the "data" dataset.

______________________________________________________________________

Ok I mentioned that first because we are now going to learn to subset, and without the above explanation, the subset function may not make sense to you.

Often, it is useful to make a subset of a dataset. For example, if you want a dataset where the
heavy_metal_music variable is equal to 1, you can do this... very easily.

data.metal<-subset(data, data$heavy_metal_music==1)

Starting from the left:

1) data.metal - this is the new dataset that will contain the subsetted data from your original dataset
2) <- your friendly neighborhood gets operator! As always, this puts the information from the right side of the symbol into whatever you specify on the left side.
3) subset - this is a function built into the base R package... for... drumroll... subsetting!
4) data - this is your old dataset containing all of the information
5) data$heavy_metal_music - this is the dataset and the variable on which you want to subset. now the earlier comments probably make sense.
6) ==1 . I told you early on that R doesnt use the equals sign as the gets operator (thats what <- is for). R uses the double equals sign to mean equals. So you are saying where heavy_metal_music is equal to 1.

In words:

the new dataset "data.metal" gets a subset of the dataset "data", where heavy_metal_music (in the "data" dataset) is equal to 1.

Hope that makes sense.

Next posting will be on other basic R operators. For example, how do we specify greater than, less than, greater or equal to, not equal to, etc.

Following that, I think we can move on to some basic data recodes, using other subsetting functions and if-else statements.

Have a non crappy Tuesday!

Thursday, May 23, 2013

Merging databases

Good morning pals,

Im gonna tell you how to merge you some data.

Here is the first database, (assuming it is called data) is the one that I showed you the other day, with my wife's drinking water habits:

The data I want to merge is as follows and contains only two variables. One is the day, 1-22 (two extra days in this database), and the second variable is ran_out_of_gas (1 for yes, 0 for no). She lets her car gas tank get pretty low, so the risk of running out of gas is high. You may notice that the first variable, Day has a capital "D". This is important to note. This database is called "data1", just because.

Ok, to merge these, you need to know two things: 1) which variable you want to "merge on", and 2) if you want to keep all of the cases or just the cases in the dataset "data" which has 2 fewer cases.

Lets first assume we want all of the cases in data, and we want to remove the two extra cases in data1.

First, read in your two datasets:

data<-read.csv("/Users/timothywiemken/Desktop/data.csv")
data1<-read.csv("/Users/timothywiemken/Desktop/data1.csv")

Next, merge them into a new dataset called "merged"

merged<-merge(data, data1, by.x="day", by.y="Day", all.x=T)

Lets talk about this statement from the left to the right.

1) "merged" - the far left before the gets operator (<-) is the new dataset you want to make after merging things.
2) "merge" - this is the function built into the base R package to merge two datasets (only two at a time! if you need to merge more than two, you need to merge "merged" with your third dataset")
3) "data" - this is the first dataset you want to merge. This is called the "x" dataset.
4) "data1" - this is the second dataset you want to merge. This is called the "y" dataset.
5) "by.x" - this specifies the variable in the "x" dataset (see #3) that you will merge on (needs to be the same variable as what will be in the "y" dataset but it can have more or less cases (e.g. rows). here, we specify "day" in lowercase, as this is the variable in this dataset that is the same as one in the "y" dataset.
6) "by.y" - here we specify "Days" with the capital "D", as this is the variable that matches the variable "day" in the "x" dataset. This one has more cases in it, but that doesnt matter.
7) "all.x=T" - this tells R to keep all of the data in the "x" dataset (see #3) and drop any extra cases in the "y" dataset. If there are more cases in "x" than in "y", you will still have all of the "x" cases. You can switch this to "all.y=T" if you want all of the cases in "y". You can specify "all=T" if you want all of the cases in BOTH datasets. The "T" stands for TRUE. You can type out the word TRUE (all capitals), but R uses "T" and "F" as acceptable shortcuts for writing out TRUE and FALSE.

Thats it! You should be able to merge these two datasets now.

Wednesday, May 22, 2013

Just realized...

... that quicktime player does screen capture with audio.

I have too much text in these posts, so Im gonna start doing a video for most of them to add to the flavor. Ill go back to some of the old ones and update as time allows.

t

A tip for making all of your variable names lower case automatically!

Yo yo yo.

Ive been using this code snippet a lot lately. Most of our databases have variable names that are mixed upper and lower case letters. As you may have guessed by now, Im not a huge fan of upper case - its hard to remember which variables have which case of text.

So.

After you read in your dataset (since you are a master after the "how to read in your data" post)... paste the following code into the console or the R script editor. Once you paste it all, highlight the two lines and hit command+enter (on the mac!) to run it.

test<-tolower(colnames(data))
colnames(data)<-test

What is happening here is as follows:

The first line takes the column names (your variable names!) from the dataset called data (if you named your data something else, you need to change this). The colnames function is built into the base R package, so you dont need any particular package to make this work. The tolower function is also built into the base R package. So this function takes the column names from your dataset, data, and puts everything to lowercase text. you can also use "toupper" if you prefer all upper case text. You may notice that on the far left (before the gets "<-" operator), i have the word "test". What happens here is that R takes all of the lowercase names and puts it into a vector out on its own. You use this in the next line.

The second line takes the variable names that are all lowercase in the "test" vector and applies them to the column names of the dataset, data. Again, if your dataset is named something other than data, you need to change this.

Are you starting to see the utility of always keeping your dataset named "data"? I hope so. It is pretty handy.

This is a little more advanced than I was hoping to get at this point, but it is such a useful function - and I pretty much use it every time I import a dataset, I figured it would be useful to stick this at the beginning.

This blog post wasn't funny you say? No humor? Yeah, you are right. It was a long day, so I have no humor left. I have two humerus's, but I don't think that is what you were hoping for.

A quick note about variable (vector) names in R

I forgot to mention this earlier.

Don't put any characters other than numbers, letters, periods, or underscores in your variable (vector) names. If you do, R will automatically convert them to periods.

For example. If your variable name in the .csv file is Number of Days Without Pooping, after you read in the file, it will change the variable name to Number.of.Days.Without.Pooping.

I also highly recommend only using lower case letters for everything. It makes your life easier. If one variable is called FirstName, another is called firstname, and a third is called Firstname, R will see all of these as different, because, after all -- it is case sensitive.

Did I mention R is case sensitive?

Here it is! Reading data into R

Ok - so this isn't really hard at all. When I was first starting, I made it much more difficult than it should have been. For the sake of nostalgia, Im going to make it more complicated than it needs to be here too. Read the end of this if you dont want the extra crap.

There are three main things you need to know reading data into R. At least things that I have found to make this process easier

Numero Uno
Your database should be in a .csv file. For those of you not familiar with .csv, it stands for comma separated values. If your data are in Excel, you can just choose file - save as - and change the file type to .csv. You will get a bunch of warnings, just click ok to all of them. This comma separated values business just means that every cell in your Excel file is separated by a comma. So R reads this in and knows that the comma is a 'delimiter' - a consistent thing that separates out each of your variables and data points. The main issue with .csv is that you cannot have any strange characters, otherwise R will not read it in. For example, if you have a degrees symbol in your database, this will not work. To be safe, the only characters you should have in your database are commas and decimal points. You can get away with hypens and underscores as well. In fact, you can read in a bunch of other stuff, but do you really need this junk? If you have a variable that is "comments" and it has a bunch of crazy characters, do you really plan to use R to analyze it? You may want a qualitative analysis software instead -- if you have a temperature variable with degrees symbols -- clean that crap up. If you have non-numeric values in your database (like a special character), R will get pissed. Just clean up your data as much as you can before you try to move it to R. It will save you some major headaches.

Numero dos
If you dont know what im talking about with regard to your dataset that you want to read into R, check out the following image. It is a test database I will use in most of my examples. This database is a data collection of factors associated with my wife using multiple glasses for water in one day. Is it necessary to use 5 glasses a day for water? probably not. I want to know what factors are associated with using more than 1 glass a day. There are 5 variables (R terminology for a variable is a vector sometimes) in this dataset: 1) day = 1 to 20 (examining glass usage over 20 days), 2) number_of_glasses = how many glasses did she use for water that day?, 3) bad_day_at_work = one if she had a bad day and zero if she had a good day, 4) heavy_metal_music = one if she listened to slayer on the way home from work and zero if she did not, and 5) slept_poorly = yes if she slept bad the night before and no if she slept ok. Obviously these are fake data, since she listens to slayer every day. A real dataset would have other data types - this has only "numeric" and "character" data. Numeric are the numbers and character is the slept_poorly variable since it is text. More on this later.

What was the point of number 2 here? Ha, number two. I dont think there was a point - just to show you the database Ill be using. This isnt actually one of the important things you need to know to make R using easier. You can download the data here (it is an excel file, so you will need to save as a .csv) https://www.dropbox.com/s/9xtebfcg9ouzyn6/water.xlsx

Numero tres
Use a standard value for any missing values. I highly recommend negative 1 (-1) for any blank spaces. Blanks would also work. Don't use something like 999 because some variables can actually take the value of 999. If you use one standard value for all missing values (like if I didnt know the value for slept_poorly on day 20), it makes your life easier in the future.

Ok that might be all you need to know for now. There are many other ways to get data into R, but this is the one I use most often, so it is the one I will show you to keep things simple.

1)Now, download your data and put it on your desktop (remember to convert to .csv!).
2) Go to R Studio, open a new R script (File - new - R Script)
3) type the following (not exactly the following - see below to what you need to change to make it work on your computer):

data<-read.csv("/Users/timothywiemken/Desktop/water.csv", na.strings=-1)

So you need to change the part in the quotes to the path of the file on your desktop. If you are on a Mac, the path always starts with /Users/ the next part is the name of your computer, followed by Desktop (if it is on your desktop) then the name of the file. Close your quotes after that.

If you are on a PC, this path will be something like "C:/Desktop/water.csv"

Remember R is CASE SENSITIVE, so be sure to capitalize stuff you need to capitalize. On a Mac, to get the path, right click the file and choose "get info". On a PC right click and choose "properties". The path of the file will be in the window that opens. An example on a mac is as follows (under the "where" part) -- this location does not include the file name so be sure to add it after the last slash. This says /Users/timothywiemken/Desktop . Not only would I need to add quotes around it, but I would need to add another slash after Desktop, followed by the file name, water.csv.

If I had use -1 for any missing values, the na.strings=-1 part will tell R that anything that is -1 in the dataset should be converted to missing (e.g. delete that value). You can change this to whatever if you use something other than -1. For example, if you decide to use 999, since I told you not to, you could say: na.strings=999.

Ok back to this, because you may wonder what the hell it all means:

data<-read.csv("/Users/timothywiemken/Desktop/water.csv", na.strings=-1)

the first part, data, is just what I am going to call the dataframe I am reading into R. I tend to call pretty much everything data, data1, data2, data3, etc. You may think I have no imagination. This is debatable. The main reason I call every dataset data, is that it makes copying and pasting code from one project to another. You end up having to type and retype the dataframe name all of the time in R, so it is easier to just always call the data the same thing. The "<-" is the "gets" operator. It is kind of like an equals sign, but in computer science, equals means something else (equals, means the arithmetic equals sign). If you were to say this whole line of code out loud, it would read something like this (without the stuff in parentheses):

The data frame "data" GETS (<-) the CSV file READ (read.csv) from the location "/Users/timothywiemken/Desktop/water.csv". If there are any negative 1's in the data frame, these will be set to missing.

Make sense? I thought so!

Pretty easy? Yup!

A lot of text and pictures for a simple procedure? Indeed.

So your R studio window will looks something like this, once you type in the code and hit command+enter (again, on a mac):

You see on the top right window? It says, Under DATA, data, 20 obs. of 5 variables. We know there are 20 observations (rows - or 20 days in the study), and 5 variables. There ya go. You can click this and it will open the dataset for you to view. You cannot really edit the dataset in this window. However, you can fix the .csv file on your desktop, and re-read in the data.

Anything you do in the console or the R script viewer does not change the actual data frame (the physical .csv file). So no worries there. It is like always working in the "work" library in SAS.

You can make any changes you want and just re-read in your data.

R works almost exclusively in your computers RAM (so you need a lot of free RAM), not in the ROM.

If you don't like R Studio, you can view your dataset by typing in the following code into the console or R script viewer and running it:

View(data)

You can change data to whatever your data frame's name is - if you decide you dont want to use data all of the time.

Ok, that is enough for now. You should be able to read in your data by now. If not, leave a comment and Ill clarify anything!

RIP Jeffrey.

Scared yet?

Good morning world.

Today will be a suck day, but lets try to make it a little better.

If you are a little intimidated by R thus far (yup, this was written like this on purpose), don't be. Nearly everything I have mentioned thus far is solely to get you acquainted with R. We will continue to go over all of these concepts (help, packages, the console, R script editor, etc.) as we begin to do real work with R. I find it easier to get the ball rolling by discussing the topics we already went over -- then do some work for a while -- once we come back to the topics, they will make more sense (trust me... muuuuhaaaahahaha).

Just remember, R is the

Yes. I like cats. You will see them from time to time (or all of the time) on this blog.

I may start to do some videos too if I can find a good (e.g. free) screen capture software for the mac. Maybe this will be easier than reading my incoherent ramblings.

Next topic will be the first real topic that should get you started in R: reading in data.

Adios!

Tuesday, May 21, 2013

My wife is crazy.

Yup, this posting has nothing to do with R - you should probably expect this pretty often since I tend to get distracted easily.

My wife comes downstairs laughing, saying "hahah, I just read your blog and it is soooooo nerdy". Yup. Let us see who is the nerd when my R skills create world peace. You thought world peace would come from Wyld Stallyns music? Sorry Bill S. Preston esq. and Ted Theodore Logan. Your music sucks. R will rule the world.

It may, in fact, align the planets into universal harmony - only time will tell. Or Glenn Danzig. He knows. He might tell.

What is this package business and why do I care?

See, I told you that was the title.

R Packages

As I mentioned previously, after you download R, it does a fair amount of stuff. However, as you begin to want to do more, you need to install various packages, created by R users (many times, very prominent people), to make the software do what you want.

A package is kind of like a SAS macro, packaged into a file that integrates into the R DNA (its like a wonderful retrovirus). Packages can do one function, or can have the ability to do many, many different things (this is the case most of the time).

For example, if you want to export some high resolution figures as a PDF file (a good vector format for publication quality charts), you can install the package "Cairo".

To install packages, type the following into the R console (then hit enter) or into the R script viewer (then hit command+enter -- on mac):

install.packages("Cairo")

Yes, you need the quotes around the word Cairo.

Yes. You may have just realized that R is indeed case sensitive. This is a critical concept that is not the same in all software packages. R sees the word Cairo as different than cairo. If you use the lower case "c", you will get an error. If you dont use the quotes, you will get an error.

When you run that function, you will see a bunch of crap starting to show in the console window -- it is installing the package. Some packages are not available for certain versions of R. I find it critical to not update R to the latest version (this is why I said to avoid version 3 in my first post -- all of the packages need to be re-written in order for them to work on this version -- this is a lot of work for people that create packages so it may take some time to get certain packages to work in new versions). I am a chronic updater who always wants the latest version of everything, so it is difficult for me to wait... it is critical to do this though. You only have to install the package once on your computer.

Once you install the package, you need to open the package to allow R to use the functions within the package.

To open the package, type the following in the console or R script viewer:

library(Cairo)

This time, dont use quotes, but keep the capitalization for Cairo. You can also type the following (it works the same.. Im not sure why they both work or what the difference is):

require (Cairo)

You only have to open the library once per R session. Once you close R and open it back up, you will need to re-open the library with one of the above scripts.

Many packages have their own help files. Once you install the package, try to type the following (change the word Cairo to whatever the package name you want the help for):

help(Cairo)

You will see in the console that there is no specific help file for Cairo. You can then revert to the ??Cairo, as mentioned in the prior post.

help(base) does work, however. This is the help file for the base R application.

There are tons of packages for R, which allow R to do pretty much whatever you want. If it doesn't do what you want, you are probably just doing something wrong. Once you get skilled (ok Napoleon), if a particular package still doesn't do what you want, you can email the package creator -- you can usually find their email address by looking for the help file on google, as outlined in the prior post (the PDF file that I showed). I have done this before with very good results. Please don't overly bother package creators though. They spend a lot of time making these things, and don't need to be harassed by people who may or may not know what they are doing.

The most important part of this blog is to make sure that you always cite the packages you use in any manuscript or whatever you are publishing. To get the appropriate citation for the packages you are using, type the following in the R console (swapping out Cairo with the name of the package you want the citation for):

citation("Cairo")

Yup, this time you need the quotes. Confusing huh? Yeah, I still haven't quite figured out why quotes are needed sometimes and not others. Oh well. You can't 'em all - if you win any of them, be pleased.

Ok - I promise - next post is about reading data into R.

Arrivederci.

Help files and Other good websites

Yo yo yo.

R has very in-depth help files associated with pretty much everything it does. You can access the help files in many ways. First, you can google your question and end up at a website with a PDF of the packages you want help for (usually at cran.r-project.org) or just a general website with information. For example, if I google "R glmulti" (glmulti is a package in R - we will talk about packages in the next blog), you will see something like below. The second option here (the PDF) is the help file for the glmulti package. It gives you all the info you need for the package. Unfortunately R package help files are very convoluted.

Your next option is to type two question marks followed by the function you are interested in getting help for -- in either the console (then hitting enter), or in the R script viewer (see 2 posts ago for the differentiation between these). For example, I may be interested in finding the help file for the table function in R (table is a built in function that gives you the frequencies of a variable... like proc freq for SAS). In this case, I would type ??table in the R console or in the R script viewer as follows:

Once you hit command+enter (if you type it into the R script viewer as I have) or enter (if you type it into the console at the bottom), you will see some info on the bottom right window (see the Search Results R, with a bunch of crap over there? Yup, thats it).

Next, you need to scroll down in that window to actually find the function you want the help for. There are multiple sections here.. THe first section is Vignettes, the second is Code Demonstrations, and the third is Help files. You may have different numbers of options here depending on how many packages you have installed. For example, the first option in my Code Demonstrations (as you can see in the picture above) is graphics::Hershey. The way this is laid out is the package name followed by two colons, followed by the function name. So if you wanted help for the function Hershey, you could type ??Hershey in the console or R script viewer and this would be one of the options. The part before the two colons is the package in which the function resides. So if you dont have the graphics package installed, you wont see this option. Since the Hershey function has the function table in it also, it shows up here. Many functions also contain the table function, so you see a bunch of crap when you run the ??table. If you scroll down into the Help files section, you will see base::table. R, after you download it has a number of "packages" pre-installed. Two of the most commonly used are "base" (the base R application) and "stats". So if you see anything with base or stats before the two colons, these are just the basic functions built into R. If you scroll down, you will eventually see something like this (the last option here is base:: table -- click it and you will see the R help file for table).

I still find the R help files to be... well.. not that helpful. They are just too damn complicated. Over time, you kind of get used to the way things are written and can make your way through them.

I find the website www.statmethods.net (AKA Quick R) to be extremely helpful and very basic (sometimes too basic -- if that is possible). If you have a question, I would start with this website. The UCLA biostats site is also extremely helpful for R, SAS, SPSS, and STATA (http://www.ats.ucla.edu/stat/ ).

These two sites should be a good way to start with R. Next up - what is this package business, and why do I care?

Questions?

Hey pals - to my surprise, it appears that some people are actually reading this blog. If this is the case and it isn't just a bunch of bots trolling blogger, please leave some comments. I'm more than happy to discuss any topics, so as long as they are not too advanced for where I am currently at on the blog. We can talk offline as well if your questions are more advanced.

The R Studio Interface

The R Studio Interface - looks fancy! But not really.

Once you install R and R studio and open it, it will look something like this (I'm on a mac, so PC users may find a slightly different view):

Yours will also not have anything in the right column or the lower column (I'm currently running some analyses). Your view will also be a white background instead of the black background. You can change all of the look and whatnot in the R Studio preferences (on a mac it is the R studio dropdown at the top - then preferences). I think my view is twilight or something like that. Wait. It can be twilight. I wouldn't allow anything with that word after the terrible books and movies. Ok I had to look. The "appearance" setting in the preferences is "Tomorrow Night Bright". It works the best for my constant staring at the computer screen. White backgrounds are just too harsh.

The bottom black panel here is the R console. It works exactly like it does in the regular R or R64 app that you installed before R studio (remember that R studio actually just uses the R app - it is just a front end to run the R application -- R studio is just much prettier). You can type any commands into the console and hit enter (or return) to run them. I dont like the console much - I just use it to view the output of any functions I run (so in this respect, it serves kind of like the output window in SAS and SPSS, but you can also type commands in the console window).

The top black window is the one I use. This is the R script viewer. You can open a new one (I do this every time I start a new project) by clicking file and new R script. Actually, when you first open R studio you will only see the console window, it will just be larger. When you open a new R script, the console will get smaller and the R script window will show. You can resize stuff as you would any table or something in MS Word (the cursor changes when you hover over the gray line separating the two windows). This is where I type all of my commands. I like this because it is easier to save an R script than to save the whole workspace (this term will come back later -- it is essentially the entire environment you are working with). Regardless, type stuff here - you can hit enter as much as you want... to run the commands you type, you can do one of two things: 1) if the command is only on 1 line, you can click somewhere on that line (you dont have to highlight the line or anything) and click command+enter (on a mac). Im not sure what the equivalent is on a PC, but I think it is probably control+enter and 2) if the commands are on multiple lines, just highlight the entire section and hit command+enter. This window functions just like the SAS editor window and the SPSS syntax editor window. You type your commands and run them. The output shows in the lower console window. As you can see on my screenshot above, I have a lot of these 'editor' windows (yup, I still speak in SAS language) open - again, this works just like the SAS editor window.

The white box at the top right shows any dataframes (the terminology R users use often synonymously with data set), vectors (we will come back to this later, but a vector can be thought of as a variable - the variable may or may not be attached to a dataframe), functions, lists, etc. The lower window can be modified to show whatever you want. I use it pretty much solely for viewing graphics. This is a major benefit of R - the graphics are awesome.

I forgot about one thing the other day - if you are a mac user, you will need to install Xquartz to view graphics (download here: http://xquartz.macosforge.org/landing/ ).

So that is the overview of how R studio looks. Next up, help files and some useful websites to get started. Then we will focus on getting data into R and move on to recoding variables. The bane of my existence.

Peace.

Monday, May 20, 2013

Grammar and Spelling Errors

By the way. I refuse to proofread anything on here, so there will be spelling and grammatical errors. Is it really that critical for this type of blog? If you think so, sorry. Unfortunately this will not change my plans for world domination... I mean, my plans for writing this blog. My goal is to just get info out there so everyone can use this wonderful software.

Starting at the beginning. With the dinosaurs. When we rode them around to get to McDonalds.

What version of R to use?

Well, R version 3 just came out. I wouldn't recommend it quite yet. Try to download 2.15.3 if you can find it. After you install it, I HIGHLY recommend installing R Studio after it (just google R studio download). R Studio is a much nicer interface than the basic R program. It looks nicer and makes doing most things just plain easier.

R Studio is just a front-end to R. So it actually uses the version of R that you have installed on your computer. You cant just install R Studio, it wont work.

How does R work?

So R (the "base R") does a lot of stuff. It isn't like SAS and SPSS though. To do many things, you have to install various packages created by other users. We will go over this later.

Are these packages reliable?

Yeah, I think so. The R network requires a lot of legwork for someone creating a package, so anyone willing to do this is more than likely pretty reliable. The benefit of packages is the flexibility of R. Essentially, this allows you to do whatever you want - and any new method is pretty much immediately available in R. SAS and SPSS take years to add some methods.

My favorite things about R
1) Graphics are incredible
2) Flexibility, like an olympic gymnast.
2) Yup, 2 again. R rules.
3) Once the languages "clicks" with you, it becomes pretty intuitive. This can take some time though.
4) User created packages allow for immediate implementation of all kinds of useful functions.
5) Since I mentioned functions - you can write functions in R. A function is just a fancy word for a computer program that does whatever you want. We will talk about this in depth later. I may ask my pal Dr. Kelley write this section (I may have to start calling him Dr. Special K... it has a better ring to it).
6) It's free

Why I hate R
Yup, daily I pretty much hate R too. Its a love-hate thing.
1) It can be very complicated if you want to do fancy stuff
2) It has a decent learning curve (unless you learn how to use it from me)
3) You need a bunch of packages to do a lot of stuff - it can be hard to remember which package does what.
4) All of the documentation online is written by people that make everything overly complicated. This is extremely frustrating.

Take that. Next posting will be how to read in your data. The most important part.

First Post - What to Expect from this Blog

Who are you?

Well pals, my name is Timothy Wiemken. I am an Assistant Professor of Medicine in the University of Louisville Division of Infectious Diseases and am the Assistant Director of Epidemiology and Biostatistics at the University of Louisville Clinical and Translational Research Support Center. Fancy titles huh? Does it mean anything to you? Probably not. Check out our center at www.ctrsc.net.

Why should I listen to what you have to say - I mean, the blog is pirate stats???

Well har-de-har smarty pants. I learned statistical computing using SPSS and Epi Info a number of years ago. I was never quite pleased with either - There were too many clicks in SPSS, and Epi Info - well, 'nuff said. After a while of using SPSS, I learned the SPSS syntax, which eliminated the clicking and re-clicking, but it just was not as flexible as I wanted it to be. I later learned SAS when I was getting my master's at Saint Louis University from an amazing professor, Dr. Q. John Fu. I didn't get proficient in SAS until I used it for a number of years at my first job in infectious diseases, as a data analyst. SAS was great, but it, like SPSS and most other packages, was expensive. SAS is also not particularly flexible unless you are awesome at Macro coding... which I am not. R (AKA: ARRRRR, hence the pirate stats), seemed to eliminate these issues. R is the most flexible of all of the softwares capable of statistical computing (of those I have used), and is completely free. Download it at www.r-project.org.

Why do I need a tutorial on R?

The R language is more of a computer science language, whereas SAS and SPSS are more biostatistical. At least this is what my computer science pal tells me. I am definitely not a computer scientist (not because I don't want to be... mainly because I don't want to go to school again. One PhD is enough). After thinking like SPSS and SAS for so long, it was difficult to begin to use R. One of my friends, Dr. Guy Brock at the UofL School of Public Health was instrumental in getting me started. Dr. Rob Kelley, my computer science pal (also an Assistant Prof in Infectious Diseases) has also been extremely helpful in allowing me to think more like a computer scientist. Thinking differently is critical to begin to use R - so those of you who already know another language may find it difficult to switch (as I did).

So can you actually teach me how to use R?

Maybe - particularly if you aren't a total moron. Actually, I think anyone can learn. How my blog may be different than others is as follows 1) I am not a computer scientist and tend to think more like an epidemiologist -- I dont think theoretically and focus on strictly practical applications of biostatistics and data management, 2) I was proficient in other languages (SPSS and SAS) before and had to re-learn how to think biostats for the R language -- I can provide some comparisons to other packages which I think makes it easier to switch (it did for me), 3) most of the other blogs are overly complicated -- for practical application, you dont have to know all this extra crap -- I will focus on simple application to get you through what you will actually need to use R for any basic to intermediate biostatistical analysis (99% of the application that anyone actually reading this blog will want to see).

So take that -- read on. If you like it, let me know. If you don't, go to Facebook to complain like the rest of the world. If you like it, enjoy being able to use R -- it is totally sweet, free, and amazing. Did I mention it was sweet? You may get diabetes from it.

--- Timothy Wiemken, PhD MPH CIC (certified in infection control and epidemiology)