Box/Whisker and Violin Plots

This chart contains two related charts types, Box and Whisker plots and Violin plots. Box and Whisker plots are a statistical chart used to show the five-number summary of a dataset: the minimum, first quartile, median, third quartile, and maximum. Violin plots are used to visualize the distribution of quantitative data, using density curves; they also include a box plot embedded within, showing the data density in relation to the statistics highlighted by box plots.

The chart requires one categorical dimension for grouping the data, and one numeric measure whose distribution will be displayed within each group.

Features

Quantity

Notes

Required Dimensions

Must be categorical; Dimension 1 = X-Axis

Required Measures

Must be quantitative; Measure 1 = Y-Axis

Use a Box Plot chart for getting a clear picture of how your data is distributed within different categories, and for comparing those distributions. They help you see where most of your data lies, how consistent or varied it is, and if there are any unusual data points.

Box Plot vs Violin Plot

You can alternate between the two plot types under the "Graphical Settings" section of the Settings panel on the right side of the chart editor.

Scaling

By default, Box and Violin plots will auto-scale to highlight the core spread of your data and prevent extreme outliers from compressing the view. This ensures that the box and violin shapes, which represent the majority of your data points, remain clear and easy to interpret, even when very large or small outliers are present.

Outliers

Outlier points can be toggled on by clicking the "Show outliers" option in the Settings panel on the right. If toggled on, the chart will re-scale to show the full data range.

Not all outliers will be shown on the chart - only the 10 highest and 10 lowest outliers per group will be plotted.

Due to high query complexity, outlier points are disabled by default. This can be changed by setting the ui/enable_box_plot_outliers_default feature flag to ON.

Color Palette

By default, Box Plot charts will automatically color by the selected categorical dimension. By selecting the palette picker, you can change the categorical palette or choose a solid color, should you not want categorical coloring. Box Plots support color mappings.

Data and Formatting Settings

Sorting

By default, Box Plots will sort by # Records, so that the group with the most values is displayed first. Other sorting options can be chosen from the drop-down menu.

# of Groups

This controls how many groups from the selected categorical dimension will be displayed. The default is set to 50.

Violin Distribution Precision

The violin density curves are drawn by taking a set number of bins and dividing them along the current chart's y-axis scale. This setting controls the number of bins used to draw the violins. The current default is set to 60, and the max is set to 200.

If you find the violin curves appear jagged or imprecise, increasing this setting will improve their visual accuracy, although it will slightly increase query complexity.

Center Line

Choose whether the line within the box plot represents the median or the mean.

Formatting Settings

You can use custom measure formats for the values in your chart. See Customizing Measure and Date Formats.

Example 1: Understanding Box Plots

In our first example, we explore a professional basketball dataset. In the chart below, we've created a Box Plot from a NBA games dataset, choosing home_team as our dimension and road_team_final_score as our measure. This setup allows us to analyze how many points each home team typically gives up to their opponents.

Looking at the chart, you can quickly see how powerful Box Plots are for making inferences. It's immediately obvious which teams are strong at limiting opponent points when playing at home, and which are not. Cleveland is clearly one of the best defensive teams: you can see they have the lowest first quartile (the bottom of the box) and the lowest median (the center line in the box) compared to any other team shown.

Box plots also reveal inconsistency. Consider New Orleans next to Brooklyn. While both teams have similar medians and third quartiles, New Orleans' box plot has much longer whiskers in both directions. This clearly indicates a greater variance in the number of points they give up while playing at home, showing they are a less consistent defensive team than Brooklyn.

Example 2: Understanding Violin Plots

Turn on the Violin Plot to get an even deeper look at this data. The violin shapes reveal the density and clustering within our data—in other words, how many points each home team commonly gives up to opponents.

Looking at Cleveland again, it's now even more obvious that they're a strong defensive team. The violin plot clearly shows significant clusters of opponent scores around and below the 100-point mark, indicating their consistent ability to limit scoring. Conversely, bad defensive teams like San Antonio have the majority of their opponents score clustering above 110 points.

You can also clearly visualize consistency. Consider Miami: their violin plot shows virtually all of their opponents' scores clustered consistently between 90 and 120 points.

Being able to clearly visualize common occurrences in our data gives us critical insight into the underlying patterns and typical behaviors, helping us understand not just the full range of outcomes, but precisely where the data truly concentrates and the varying likelihood of different values.

Example 3: Exploring Outliers

This example shows how outliers affect our Box Plots and allow us to factor in central tendencies with large data ranges. For this example, we'll explore a dataset on airline flights. Our dimension will be dest_state (destination state) and our measure will be airtime, showing the destinations with the longest flight times.

By default, with outliers turned off, our chart automatically scales to focus on the core distribution of your data, ensuring that the box plots remain clearly visible and easy to interpret. As you can see, Hawaii, to no surprise, has by far the highest flight times among the states shown.

But how does this view change if we turn on outliers?

When we enable outliers, you'll immediately notice how the chart's scale adjusts to fully include these extreme values.

For efficiency, Immerse only displays the 10 highest and 10 lowest outliers for each group, if applicable.

Observe Washington; which previously showed higher-than-normal flight times, now clearly displays many extreme outliers. This likely indicates a significant number of overseas flights to this state, pulling the overall range much wider.

In contrast, some states have few or no outliers, showing that their flight times are much more consistently clustered around the main body of the data, with very few unusually long flights for those destinations.

Outliers are invaluable for identifying unusual or extreme data points that fall far outside the typical range, providing a complete picture of your data's spread.

Displaying outliers can impact performance - we recommend enabling outliers only when you specifically need to investigate these extreme values; otherwise, keeping them off will ensure faster loading and a focused view of your core data.

PreviousOverview NextBubble

Last updated 7 months ago