merge function in R allows you to combine two data frames, much like the join function that is used in SQL to combine data tables.
Merge, however, does not allow for more than two data frames to be joined at once, requiring several lines of code to join multiple data frames.
This post explains the methodology behind merging multiple data frames in one line of code using base R. We will be using the
Reduce function, part of Funprog in base R v3.4.3. Funprog contains a suite of higher order functions which provide simple alternatives to laborious, long winded coding solutions.
merge is essentially the “join” of the R world. Whilst this post is not about the fine workings of
merge, I will give a brief introduction.
Merge takes two data frames, x and y, and combines them based on one or more shared columns. Rows are combined where the data of these shared columns are equal, meaning we can combine columns from different data frames that refer to the same piece of data. For instance, take the following two data frames:
height <- data.frame("Character" = c("Luke", "Han", "Leia"), "height" = c("1.75m", "1.85m", "1.5m")) gender <- data.frame("Character" = c("Luke", "Han", "Leia"), "gender" = c("m", "m", "f"))
It is clear that the two data frames are referring to the same characters, however it may be more useful to us if the two were combined into a single data frame. This is where
merge comes in.
Merge takes the following structure:
merge(x = height, y = gender, by = “Character”)
Here, we are looking to combine the height and gender data frames where the character columns are equal. To continue the SQL analogy,
x is the left-hand table,
y is the right-hand table, and
merge is the
LEFT JOIN operation. The “
by” component is our “
ON” clause. For example:
SELECT * FROM height LEFT JOIN gender ON height.column1 = gender.column1
merge function gives us the following output:
This is the result we were expecting, but what if we introduce a third data frame?
eyeColour <- data.frame("Character" = c("Luke","Han","Leia"), "eye_colour" = c("Brown", "Blue", "Brown"))
merge does not allow us to simply add our
eyeColour data frame as a third input (we only have
y parameters available). That’s where
Reduce comes in.
Reduce takes a function and sequentially applies it to a given list of inputs, in our case a list of data frames. For example, imagine we have a function
f which accepts two arguments, and a list of objects (
Reduce(x, list(a, b, c)) would perform the following action:
f(a, f(b, c))
where the function x is first applied to data frames b and c, and is then applied to data frame a and the output of the first application of x. This allows us to avoid running and saving x(b, c), like this:
output_1 <- x(b, c) output_2 <- x(a, output_1)
merge we have an example of a function that performs an action on two inputs.
Reduce takes two parameters;
f which stands for function and
x which represents a vector.
Reduce will sequentially apply the function
f to the list
In our example, the function that we want to apply is
merge, and the vector which we want to apply it to is a list of our data frames. First off, let’s try the following:
Reduce(merge, list(height, gender, eyeColour))
Perfect! But what if we wanted to specify the parameters within our
merge function call? Well, we could define our own function which merges two data frames with specified parameters:
Reduce(function(x,y) merge(x = x, y = y, by = "Character"), list(height, gender, eyeColour))
Here, we have specified our
f as a custom function, which takes two parameters and applies the
merge function to them. Within this custom function, we have specified our
by parameter, which may be necessary for longer or more complex uses of
The function that we passed to
Reduce is known in the world of functional programming as a lambda function, or an anonymous function; a single use function that is not named and saved. Functional programming is a principle around which R is built, and can provide many smart and elegant ways to achieve things that would otherwise require large amounts of coding. We may explore more of the functional programming features of R in future blog posts, however for now the following link provides a nice overview of the most used techniques:
by Jon Willis
- Feb 27, 2019 5 reasons why Microsoft became Gartner’s market leader for BI Feb 27, 2019
- Dec 14, 2018 8 insights from the SDR 2017-18 Dashboard Dec 14, 2018
- Nov 23, 2018 What is a Dashboard? Nov 23, 2018
- Aug 31, 2018 Plotly in R: How to make ggplot2 charts interactive with ggplotly Aug 31, 2018
- Aug 16, 2018 Making the most of box plots Aug 16, 2018
- Jul 24, 2018 Plotly in R: How to order a Plotly bar chart Jul 24, 2018
- Apr 11, 2018 Machine learning in the housing sector Apr 11, 2018
- Mar 5, 2018 How Useful Are Traffic Light Scorecards for Performance Management? Mar 5, 2018
- Feb 16, 2018 How to merge multiple data frames using base R Feb 16, 2018
- Feb 8, 2018 The beginner's guide to time series forecasting Feb 8, 2018
- Jan 24, 2018 R Shiny vs. Power BI Jan 24, 2018
- Aug 15, 2016 Fundamentals of a good performance framework Aug 15, 2016