2 Fun with R: A look at the Dutch voter

2.1 Introduction: Fun with R

This chapter is intended to showcase a number of ‘cool’ things you can do with R: read, filter and combine data sets, create bar chart and scatter plots, and even plot information on maps. The chapter is intended as a ‘guided tour’. Don’t worry if you can’t write this code yet — try to recognize the building blocks; we’ll explain each one in detail in the chapters that follow.

In the chapter there are some ✏️ Exercise blocks where we ask you to adapt the code to get a different result. You should be able to complete these without much difficulty. There are also some 🧩 Challenges where we challenge you a bit more to think about or tinker with the code – feel free to skip these if you want and just enjoy the tour.

2.2 Dutch politics

For this tutorial, we will use voting results from the Dutch 2023 parliamentary elections. We will later enrich this with demographic data about the various Dutch municipalities from CBS (the Dutch statistics bureau) and voting results from the election board. Both of these can be downloaded from the respective sources, but to make it easy we use cleaned up versions from this repository.

Voting data

After running the code below with the ► Run Code button, you will see a data frame on the screen. Data frames are the main data object in R. Note that it might take a while for your browser to read all the R packages needed for this page - you will see a dot next to the button while it’s loading, and it will change color once it’s ready to be clicked.

The code below downloads and reads a CSV file with Dutch election results per municipality:

This results in a data frame called votes. A data frame (or tibble) is a rectangular data structure where each row generally represents a unit of analysis or measurement (e.g., a case or respondent), and each column represents a quantity of information about that row, e.g., a measured or computed variable. In this case, each row represents a voting result from a Dutch party in a municipality.

Can you guess what the columns mean? Each column lists a column type (<chr>, <dbl>) as well, what do they stand for?

Info

Note that you cannot break anything in these example code boxes! Every time you hit the ‘Run code’ button, it will start over from scratch. Moreover, if you feel that you have messed up the code, you can press the ↻ Start Over button and the original code will run.

Columns in the voting data - click to expand

Variable	Type	Explanation
gm	character	Unique code for each municipality
municipality	character	Unique name of each municipality
party	character	The name of the political party
votes	numeric (dbl)	The vote share (percentage) within that municipality
count	numeric (dbl)	The absolute number of votes for that party

Cleaning data

R also has powerful commands to clean up and restructure data. For example, the code below shows the results for a single municipality (Amsterdam), ordered by vote share.

Change the code to show the results for the municipality of Rotterdam, ordered by party (starting from A)

Hint:

You can change the name of the municipality directly, as well as the column on which to sort. What happens if you simply remove the desc(..) function, i.e. arrange(*column*) instead of arrange(desc(*column*))

starwars |> filter(______)

🎓 Solution:

Change the column name for the ordering, as well as the name of the municipality. Drop ‘desc’ to default back to ascending order:

votes |>
  filter(municipality == "Rotterdam") |>
  arrange(party)

Do you know the biggest parties? You can check the wiki pages for e.g. PvdA/GL and PVV if you’re curious – the Netherlands has an extremely interesting and dynamic party landscape: Of the top-10 parties, more than half were founded in the last years or decades!

Info

The code box above is an exercise with a single correct solution. You need to change some code and click the ► Run Code button to submit your answer. Don’t worry about making mistakes, you can try as often as you want and start over if needed. Some exercises (like this one) also contain 💡 Show Hint which you can click if you are stuck.

If you submit an incorrect answer, R will do its best to pinpoint where the problem is – but of course it’s only a computer, so it’s not always very helpful… It will also sometimes fail to recognize a submission as correct if it’s different from its own solution. So, if you’re sure you’re getting the right results but R won’t give you full grades for it, just congratulate yourself for being original!

2.3 Simple visualizations

Besides the ability to read and clean up data, R has a very powerful visualization suite called ggplot. Although this can be hard to master, there are many useful resources to learn more and it can be very rewarding (and fun!) to make nice visualizations.

Let’s start with a bar chart of the vote share of the top-10 parties in Amsterdam:

Tidyverse pipelines: |>

In tidyverse, we often apply multiple operations to a data set. In this case, we filter it, slice the top 10 parties, and then plot it. A powerful way to do this in R and especially tidyverse is by using the pipeline operator (|>) between operations: this makes the output of the first operation to automatically become the first (input) argument of the second. So, votes |> filter(...) is the same as filter(votes, ...)

Make it beautiful

The graph above was functional, but not as pretty or informative as it could be. Let’s make two simple tweaks: Order the parties by vote share, and fill the bars in a color that matches the party profile. With apologies to the americans, we color traditional right wing parties as blue (and call them liberal!), and traditional left wing parties as red:

Can you run this graph for Groningen? Why are two parties displayed in grey? Can you give them nice colors?

Info

Don’t worry about completely understanding the ggplot syntax – we will explain it in much more detail later and in any case it can take a while to grow on you. But as a very short primer: the idea behing ggplot is that you link the geom_s appearing in the graph with the columns in your data by supplying an aesthetic mapping: the x axis is mapped to the votes column, the party column is mapped to the fill color, etc. This produces a sensible default mapping which can then be customized by adding (+) additional elements to it such as the scale and guide information in this graph.

2.4 Combining data sets to visualize patterns

The code above showed a simple barplot based on voting results. To understand why people vote for a certain party, we can combine voting records with local demographics - does voting depend on the degree or urbanisation, education, ethnicity or average age in a municipality?

For this, let’s load a data set of municipality-level statistics provided by the Dutch stats bureau CBS:

(The Netherlands might seem like a small country, but there’s still quite a bit of regional and local diversity in terms of both demographics and voting behavior!)

Columns in the demographics data - click to expand

Variable	Explanation
gm	Unique code for each municipality
municipality	Unique name of each municipality
v01_pop	Total population
v57_density	Population density
v43_nl	Share of residents without migration history
v122_disposable	Average disposable income
v132_income	Average income
v142_wealth	Average household wealth
v153_pension	Share of residents receiving pension
v212_distance_hospital	Average distance to nearest hospital
v216_distance_school	Average distance to nearest school
v225_density_restaurants	Density of restaurants
c_65plus	Share of residents aged 65 and older

Combining voting and demographic data

As you might have noticed, both demographics and votes have a gm column that identifies the municipality. This makes it relatively easy to join these two data sets:

As you can see, this shows that in Amsterdam the left-wing/progressive PvdA/GL party did very well, and you can also see that the percentage of Dutch inhabitants without a migration history is relatively low.

You can also see that the the demographics per municipality are repeated for the different parties. This might seem awkward, but since e.g. population doesn’t vary per party it’s the only way to combine municipality-level information with the voting results that are there for each municipality and party. It will also turn out that this is actually a fairly useful representation when we start analysing and visualising it.

Joining on different columns

You can see that it actually joined on two columns, gm and municipality – the default behavior is to join on all columns with identical names. In this case, this is perfectly fine as the municipality column is also identical between the two data sets. If needed, you can specify which columns to join on using by=, but I would generally just rename or delete columns so columns with identical names also have identical meaning and it’s safe to join on them.

A first scatter plot

Now that we have our combined data, we can make a scatter plot of support for the anti-immigrant PVV party and the percentage of Dutch people without immigration history - does living in a place with many migrants make one more likely to vote PVV?

Change the code to show the relation between living in a less densely populated municipality and voting for the BBB (farmer’s party). See the info box below for a description of columns in the data set.

🎓 Solution:

Change the name of the party to "BBB" directly, remember to put it in quotes since it’s a literal text. Change the x column in the mapping to v57_density – no quotes needed as it’s not a literal text for R, but instead it’s a name for a column that R knows about

combined |>
  filter(party == "BBB") |>
  ggplot(mapping = aes(x = v57_density, y = votes)) +
  geom_point()

Have a good look at the resulting graph. What does each dot represent? Is there a relation between population density and voting for the Farmer’s party? Is it a linear relationship?

Why do we need to filter before making the graph? What would each dot represent if we would skip the filter?

Make it beautiful

The plot you made above is informative: we can clearly see a (non-linear) relationship between the population density of a municipality and the support for the farmer’s party.

However, we can make it a lot nicer to look at, and possibly add more information. We can change how it looks by changing labels, changing the theme, and other customizations. In general, it’s good practice to never include ‘technical’ terms like v43_nl in a graph, but rather use a name that makes sense to your readers. A good title and axis labels can make a graph understandable to a reader even without reading the text.

More information is generally added by including additional aesthetics. For example, we could use the size and color of points to represent total population and percentage of the population without an immigration background. This would allow us to see three relations in the same graph: perhaps the farmer’s party is really channeling nativist sentiment? How much information to include in a graph is always a trade-off: more information can also mean more clutter and make the graph harder to interpret.

You don’t need to understand all of the code above. In fact, we made it relatively complex to show off many of the cool things you can do with ggplot, such as using custom scales, alpha values, and adding text for selected cases to highlight them. However, we would like to challenge you to see if the code makes sense, and especially to play around with it a bit.

Can you use other columns?
Can you change the colors?
Can you change the names?

As you can see, the final plot used a lot of additional commands after the initial ggplot and geom_point specification. In general, this is how ggplot works: it makes it really easy to make a decent-looking, informative graph, and you can then add additional functions to customize anything you want until you have exactly the graph you need. For more example graphs (with their source code!), see the R Graph Gallery. For a more in-depth explanation of visualization principles (also with R source code), see the excellent and free book by Kieran Healy on Data Visualization.

2.5 Simple statistics

Even though this part of the book is not focused on statistics, it’s hard to pass an opportunity to test a nice hypothesis.

Let’s do a correlation test first between voting and population density. Note that the statistics commands are not from tidyverse, so they don’t tend to work nicely in a pipeline. Instead, we first filter the data for a single party, and then use $ to refer to a column:

We can also do a multiple regression to see whether the percentage of inhabitants with a migration history might be a confounder:

What do the results mean about support for the BBB party? Are municipality-level statistics informative about individual voter movitations?

Can you model the vote share for another party such as the progressive PvdA/GL?

2.6 Plotting maps

As a final example of a cool visualization, let’s plot a map of the Netherlands by BBB vote share:

Shape files

In ggplot, maps are plotted by treating each geographical region as a shape, and then plotting these shapes.

For the Netherlands, we prepared a shapefile that matches the municipalities from the dataset used above:

As you can see, this is a special data frame with one or more rows for each municipality, represented as a ‘multi-polygon’

Now, we can plot this map using a regular ggplot command, using the geom_sf geometrical object, which can directly use the geom column from the shapefile. We can also specify the fill color of regions, for example using the province column:

Shape files

We provided a custom shape file for Dutch municipalities as the Netherlands is not included in a standard R package, but the shapes were kindly provided by the statistics bureau. You can generally find shape files for other countries online, some built into R packages, some that still need to be converted. For more information and many US-centric examples, check out Data Visualization Chapter 7: Draw Maps

Plotting vote results

Now, let’s join the shape file with the voting results, and make a plot of support for PVV per municipality:

Let’s play around with the code above a bit

Can you make a map for the BBB or PvdA/GL instead?
Can you change the color so it matches the party family?
As a real challenge, can you make a map of disposable income per municipality?