Wednesday, May 22, 2013

Here it is! Reading data into R

Ok - so this isn't really hard at all.  When I was first starting, I made it much more difficult than it should have been.  For the sake of nostalgia, Im going to make it more complicated than it needs to be here too.  Read the end of this if you dont want the extra crap.

There are three main things you need to know reading data into R.  At least things that I have found to make this process easier

Numero Uno
Your database should be in a .csv file.  For those of you not familiar with .csv, it stands for comma separated values.  If your data are in Excel, you can just choose file - save as - and change the file type to .csv.  You will get a bunch of warnings, just click ok to all of them.  This comma separated values business just means that every cell in your Excel file is separated by a comma.  So R reads this in and knows that the comma is a 'delimiter' - a consistent thing that separates out each of your variables and data points.  The main issue with .csv is that you cannot have any strange characters, otherwise R will not read it in.  For example, if you have a degrees symbol in your database, this will not work.  To be safe, the only characters you should have in your database are commas and decimal points.  You can get away with hypens and underscores as well.  In fact, you can read in a bunch of other stuff, but do you really need this junk?  If you have a variable that is "comments" and it has a bunch of crazy characters, do you really plan to use R to analyze it?  You may want a qualitative analysis software instead -- if you have a temperature variable with degrees symbols -- clean that crap up.  If you have non-numeric values in your database (like a special character), R will get pissed.  Just clean up your data as much as you can before you try to move it to R.  It will save you some major headaches.

Numero dos
If you dont know what im talking about with regard to your dataset that you want to read into R, check out the following image.  It is a test database I will use in most of my examples.  This database is a data collection of factors associated with my wife using multiple glasses for water in one day.  Is it necessary to use 5 glasses a day for water?  probably not.  I want to know what factors are associated with using more than 1 glass a day.  There are 5 variables (R terminology for a variable is a vector sometimes) in this dataset:  1) day = 1 to 20 (examining glass usage over 20 days), 2) number_of_glasses = how many glasses did she use for water that day?, 3) bad_day_at_work = one if she had a bad day and zero if she had a good day, 4) heavy_metal_music = one if she listened to slayer on the way home from work and zero if she did not, and 5) slept_poorly = yes if she slept bad the night before and no if she slept ok.  Obviously these are fake data, since she listens to slayer every day.  A real dataset would have other data types - this has only "numeric" and "character" data.  Numeric are the numbers and character is the slept_poorly variable since it is text. More on this later.

What was the point of number 2 here?  Ha, number two.  I dont think there was a point - just to show you the database Ill be using.  This isnt actually one of the important things you need to know to make R using easier.  You can download the data here (it is an excel file, so you will need to save as a .csv)  https://www.dropbox.com/s/9xtebfcg9ouzyn6/water.xlsx


Numero tres
Use a standard value for any missing values.  I highly recommend negative 1 (-1) for any blank spaces.  Blanks would also work.  Don't use something like 999 because some variables can actually take the value of 999.  If you use one standard value for all missing values (like if I didnt know the value for slept_poorly on day 20), it makes your life easier in the future.

Ok that might be all you need to know for now.  There are many other ways to get data into R, but this is the one I use most often, so it is the one I will show you to keep things simple.

1)Now, download your data and put it on your desktop (remember to convert to .csv!).
2) Go to R Studio, open a new R script (File - new - R Script)
3) type the following (not exactly the following - see below to what you need to change to make it work on your computer):

data<-read.csv("/Users/timothywiemken/Desktop/water.csv", na.strings=-1)

So you need to change the part in the quotes to the path of the file on your desktop.  If you are on a Mac, the path always starts with /Users/ the next part is the name of your computer, followed by Desktop (if it is on your desktop) then the name of the file.  Close your quotes after that.

If you are on a PC, this path will be something like "C:/Desktop/water.csv"

Remember R is CASE SENSITIVE, so be sure to capitalize stuff you need to capitalize.  On a Mac, to get the path, right click the file and choose "get info".  On a PC right click and choose "properties".  The path of the file will be in the window that opens.  An example on a mac is as follows (under the "where" part) -- this location does not include the file name so be sure to add it after the last slash.  This says /Users/timothywiemken/Desktop .  Not only would I need to add quotes around it, but I would need to add another slash after Desktop, followed by the file name, water.csv.



If I had use -1 for any missing values, the na.strings=-1 part will tell R that anything that is -1 in the dataset should be converted to missing (e.g. delete that value).  You can change this to whatever if you use something other than -1.  For example, if you decide to use 999, since I told you not to, you could say: na.strings=999.

Ok back to this, because you may wonder what the hell it all means:

data<-read.csv("/Users/timothywiemken/Desktop/water.csv", na.strings=-1)

the first part, data, is just what I am going to call the dataframe I am reading into R.  I tend to call pretty much everything data, data1, data2, data3, etc.  You may think I have no imagination.  This is debatable.  The main reason I call every dataset data, is that it makes copying and pasting code from one project to another.  You end up having to type and retype the dataframe name all of the time in R, so it is easier to just always call the data the same thing.  The "<-" is the "gets" operator.  It is kind of like an equals sign, but in computer science, equals means something else (equals, means the arithmetic equals sign).  If you were to say this whole line of code out loud, it would read something like this (without the stuff in parentheses):

The data frame "data" GETS (<-) the CSV file READ (read.csv) from the location "/Users/timothywiemken/Desktop/water.csv".  If there are any negative 1's in the data frame, these will be set to missing.

Make sense?  I thought so!

Pretty easy?  Yup!

A lot of text and pictures for a simple procedure?  Indeed.

So your R studio window will looks something like this, once you type in the code and hit command+enter (again, on a mac):




You see on the top right window?  It says, Under DATA, data, 20 obs. of 5 variables.  We know there are 20 observations (rows - or 20 days in the study), and 5 variables.  There ya go.  You can click this and it will open the dataset for you to view.  You cannot really edit the dataset in this window.  However, you can fix the .csv file on your desktop, and re-read in the data.

Anything you do in the console or the R script viewer does not change the actual data frame (the physical .csv file).  So no worries there.  It is like always working in the "work" library in SAS.

You can make any changes you want and just re-read in your data.

R works almost exclusively in your computers RAM (so you need a lot of free RAM), not in the ROM.

If you don't like R Studio, you can view your dataset by typing in the following code into the console or R script viewer and running it:

View(data)

You can change data to whatever your data frame's name is - if you decide you dont want to use data all of the time.

Ok, that is enough for now.  You should be able to read in your data by now.  If not, leave a comment and Ill clarify anything!

RIP Jeffrey.



No comments:

Post a Comment