Making the most of box plots

At first glance, the box plot (or box-and-whisker plot) is a fairly unassuming little chart, but it contains a wealth of information about the underlying distribution of a data set. It is ideal for comparing the characteristics of two or more data sets, and with a bit of effort can also be used to highlight any outliers. For these reasons, box plots are widely used in academic papers and analytical reports, but they are not so common in everyday life.

Because box plots are so useful, they are often one of our graphics of choice in the dashboards we create. Their compact design means they are able to communicate important information without taking up too much valuable screen space. However, we sometimes find that our clients and their users are not confident in their interpretation of the box plot. To an extent this is somewhat sector-dependent; clients working in healthcare are often more comfortable with this format, whereas those in social housing may be a little less familiar.

What is a box plot?

A box plot is a concise way of displaying information about the median, range and quartiles of a data set. Usually it will be drawn with an accompanying axis or number line for context. The example box plot below shows the distribution of ages of 100 tenants.

box plot.png

Box plots are particularly good for comparing different distributions at a glance. For example, the box plots below show the distribution of the ages of tenants in four different housing areas. Straight away, we can see that the green area has the oldest tenants on average, while in the blue area the range is smaller, so the ages of tenants are less spread out.

box plot compare.png

The problem with box plots

Even though box plots are taught in secondary schools as part of the GCSE curriculum, many people are not totally sure what the plot is showing them. These plots are included in our Introduction to Housing Analytics course, and we have also found that people have numerous misunderstandings about them.

There have been academic studies devoted to understanding people’s misinterpretations of box plots and similar charts.[1][2] These studies found three main common misunderstandings:

  • That the middle line shows the mean rather than the median
  • That the whiskers do not represent any data points other than the maximum and minimum
  • That the sizes of the box sections represent the frequency of observations

It is easy to see how each of these misunderstandings arises. After all, the mean is a more commonly used measure of central tendency than the median; the whiskers often do not include any indication of the number of values they represent; and it is somewhat intuitive to conclude that a larger box represents more observations – after all, this is the case for column charts, pie charts and many others.

How can we improve box plots?

To improve box plots and make them more accessible to the average report reader or dashboard user, we need to try and address the three points outlined above.

Regarding the confusion between mean and median, Microsoft Excel allows us to add a mean marker to our box plot. By itself and without any explanation this may not be an improvement - in fact, it might make things even more unclear. However, with a small table to accompany the plot, we can clearly show each of the key statistics. Having a table alongside the box plot also means that the dashboard or report caters to both types of readers: those who prefer graphical displays and those who prefer raw numbers.

box plot mean.png

Adding the mean in the example above also has the benefit of informing the reader about the skewness of the data. In this example the mean is higher than the median, suggesting that the data is positively skewed (bunched up towards lower ages and then a long “tail” towards the higher ages).

Another option that Excel gives us can also help to address the second two points. We are given the option of displaying each of the data points on the plot, which gives us this output:

box plot points.png

We can see that this immediately clears up the misconception that the whiskers at the top and bottom only represent the maximum and minimum values, and confirms that in fact each of the four sections of the box plot represents the same number of people. In this example, the number of dots in each section is actually the same, but because some tenants were the same age their dots are laid on top of each other.

In conclusion, don't be afraid to use box plots in reports or dashboards, but do keep in mind that not every reader might be familiar with them so consider adding extra information to the plot in order to make it more understandable and accessible.

by Dominic Nelson