Merging databases

Good morning pals,

Im gonna tell you how to merge you some data.

Here is the first database, (assuming it is called data) is the one that I showed you the other day, with my wife's drinking water habits:

The data I want to merge is as follows and contains only two variables. One is the day, 1-22 (two extra days in this database), and the second variable is ran_out_of_gas (1 for yes, 0 for no). She lets her car gas tank get pretty low, so the risk of running out of gas is high. You may notice that the first variable, Day has a capital "D". This is important to note. This database is called "data1", just because.

Ok, to merge these, you need to know two things: 1) which variable you want to "merge on", and 2) if you want to keep all of the cases or just the cases in the dataset "data" which has 2 fewer cases.

Lets first assume we want all of the cases in data, and we want to remove the two extra cases in data1.

First, read in your two datasets:

data<-read.csv("/Users/timothywiemken/Desktop/data.csv")
data1<-read.csv("/Users/timothywiemken/Desktop/data1.csv")

Next, merge them into a new dataset called "merged"

merged<-merge(data, data1, by.x="day", by.y="Day", all.x=T)

Lets talk about this statement from the left to the right.

1) "merged" - the far left before the gets operator (<-) is the new dataset you want to make after merging things.
2) "merge" - this is the function built into the base R package to merge two datasets (only two at a time! if you need to merge more than two, you need to merge "merged" with your third dataset")
3) "data" - this is the first dataset you want to merge. This is called the "x" dataset.
4) "data1" - this is the second dataset you want to merge. This is called the "y" dataset.
5) "by.x" - this specifies the variable in the "x" dataset (see #3) that you will merge on (needs to be the same variable as what will be in the "y" dataset but it can have more or less cases (e.g. rows). here, we specify "day" in lowercase, as this is the variable in this dataset that is the same as one in the "y" dataset.
6) "by.y" - here we specify "Days" with the capital "D", as this is the variable that matches the variable "day" in the "x" dataset. This one has more cases in it, but that doesnt matter.
7) "all.x=T" - this tells R to keep all of the data in the "x" dataset (see #3) and drop any extra cases in the "y" dataset. If there are more cases in "x" than in "y", you will still have all of the "x" cases. You can switch this to "all.y=T" if you want all of the cases in "y". You can specify "all=T" if you want all of the cases in BOTH datasets. The "T" stands for TRUE. You can type out the word TRUE (all capitals), but R uses "T" and "F" as acceptable shortcuts for writing out TRUE and FALSE.

Thats it! You should be able to merge these two datasets now.

Pirate Stats - R for Beginners

Thursday, May 23, 2013

Merging databases

No comments:

Post a Comment