Wednesday, June 5, 2013

Basic Descriptive Statistics with R - Part 1: Frequencies and Percentages

Good afternoon pals.

Today we are going to look at frequencies and percentages for categorical variables.  If I have some time tomorrow, we will do contingency tables with R (e.g. 2x2 tables).  We will move on to mean, median, IQR and standard deviation after that.


So doing tables is really easy with the built in table function.

If you want the count (frequencies) of number_of_glasses, just do the following:

table(data$number_of_glasses)

R will spit out the results.  The number on the top is the actual value for number_of_glasses.  The number on the bottom is the number of cases (e.g. the number of rows) that contain that particular value:



You can see that one day, my wife drank 1 glass of water, and on three days, she drank 5 glasses of water, etc.  One day she had 15 glasses.

This table function is a good way to screen your data to see if what is entered in makes sense... it is much better than viewing the data in the actual database.  If you have strange values, they will be clearly evident.  Also, if you have blanks in a variable that is a text (or character) variable, the first value will not have a number on the top (its a blank), but it will have a number below.  This means that you have x number of blanks in that variable.

If your variable is numeric like the number_of_glasses variable and you want to know how many missing values there are, you can do the following:

table(is.na(data$number_of_glasses).

The result will be numbers above the text "TRUE" and "FALSE".  Any missing values will show up above the "TRUE" text.  If there are no missing values, you wont even have the "TRUE" listed.

This works great for counts.  Now for the best thing R can do - put ANY results into a new variable, so you can manipulate your results as you please with other functions:

counts<-table(data$number_of_glasses)

If you do that, the data shown in my image above will be input into a new variable called "counts".  You can change the variable to whatever you want.  I always just use counts, because I like it.  This variable will not be in the dataset "data", it will be out on its own.

If you type the word counts in your R script and run it, you will see the output, just like we did above.  The benefit of doing this, is that now you can get the proportions instead of just the counts.   To get the proportions, you need to do the script with counts first, then do:

prop.table(counts)


You can now see that 5% of the time, she had 1 glass of water, and 15% of the time she had 5 glasses of water.

You can also do some arithmetic stuff to this... for example, if you dont like proportions and converting the decimals to percentages in your head, you can multiply them by 100 in the same statement:

prop.table(counts)*100

Not bad ey?  I agree.

Peace.

No comments:

Post a Comment