December comes again, and everybody is looking back at the past year. Seeing everyone’s Spotify Wrapped, I felt a little left out, since I don’t stream music. However, I have QuodLibet set up to scrobble listens to ListenBrainz, so let’s use Arrow and dplyr to generate our own retrospective. It won’t be exactly the same though, since I don’t have stuff like genre data easily available.

Setup

We’ll load all the libraries we’re using first.

library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)
Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
library(duckdb)
Loading required package: DBI
library(ggplot2)

Our dataset is stored as partitioned Parquet files, created using another script that grabs data from the ListenBrainz API.

ds <- open_dataset("listens", partitioning = c("year", "month"))
ds
FileSystemDataset with 92 Parquet files
listened_at: int64
artist_name: string
track_name: string
artist_mbids: list<element: string>
recording_mbid: string
release_mbid: string
year: int32
month: int32

Let’s get cracking!

Disclaimer: my R/dplyr is terrible, so apologies for any mistakes or unidiomatic patterns.

Total Rows

First, let’s get an idea of how big the dataset is.

ds %>% count() %>% collect()

This…isn’t big data. It’s not even medium data. It’s pretty small. So the tools we’re going to use here are massive overkill, but that’s okay.

Top Songs of the Year

First of all, let’s set up part of the query that we’re going to need a lot: filtering down to listens from the year 2021.

# This is sketchy because my R is bad
year2021 <- as.numeric(lubridate::ymd_hms("2021-01-01 00:00:00"))
ds2021 <- ds %>% 
  filter(year == 2021) %>%
  select(listened_at, artist_name, track_name, month)

And now a simple GROUP BY gets us top songs of the year:

ds2021 %>%
  group_by(track_name) %>% 
  summarize(listens = n()) %>%
  arrange(desc(listens)) %>%
  head(25) %>%
  collect()

Top Artists of the Year

Getting the top artists is also straightforward:

ds2021 %>%
  group_by(artist_name) %>%
  summarize(listens = n()) %>%
  arrange(desc(listens)) %>%
  head(25) %>%
  collect()

Unfortunately, due to data quality, this isn’t perfect. For instance, we have both “doriko” and “doriko feat.初音ミク”. We can manually address this with case_when:

ds2021 %>%
  mutate(
    artist_name = case_when(
      str_detect(artist_name, fixed("doriko")) ~ "doriko",
      str_detect(artist_name, fixed("ryo")) ~ "ryo (supercell)",
      str_detect(artist_name, fixed("supercell")) ~ "ryo (supercell)",
      str_detect(artist_name, fixed("花たん")) ~ "Hanatan",
      str_detect(artist_name, fixed("梶浦由記")) ~ "Yuki Kajiura",
      str_detect(artist_name, fixed("音阙诗听")) ~ "音阙诗听",
      TRUE ~ artist_name)
    ) %>%
  group_by(artist_name) %>%
  summarize(listens = n()) %>%
  arrange(desc(listens)) %>%
  head(25) %>%
  collect()

This doesn’t scale too well, but it gets us what we’re after. I’m surprised 石进 beats out everyone else!

Top by Month

Now let’s dig down deeper. For this, we’ll need to bring out the big guns: window functions! Arrow itself as of this writing (the development version of v7.0.0) lacks direct support for them. However, that’s no problem: the whole point of Arrow is to facilitate painless, costless data exchange across systems, and so the Arrow R package can seamlessly send data to DuckDB for further analysis, without copying data!

ds2021 %>% 
  group_by(month) %>% 
  count(track_name) %>% 
  to_duckdb() %>% 
  mutate(rank = min_rank(desc(n))) %>% 
  slice_min(n = 3, order_by = rank) %>%
  arrange(month, rank) %>%
  collect()

Just for fun, we can see the generated SQL too:

ds2021 %>% 
  group_by(month) %>% 
  count(track_name) %>% 
  to_duckdb() %>% 
  mutate(rank = min_rank(desc(n))) %>% 
  slice_min(n = 3, order_by = rank) %>%
  arrange(month, rank) %>%
  show_query()
<SQL>
SELECT "month", "track_name", "n", "rank"
FROM (SELECT "month", "track_name", "n", "rank", RANK() OVER (PARTITION BY "month" ORDER BY "rank") AS "q01"
FROM (SELECT "month", "track_name", "n", RANK() OVER (PARTITION BY "month" ORDER BY "n" DESC) AS "rank"
FROM "arrow_002") "q01") "q02"
WHERE ("q01" <= 3)
ORDER BY "month", "rank"

I’m…not certain that’s optimal, but I didn’t have to work it out by hand, so I won’t complain!

And we can do the same for artists:

ds2021 %>%
  mutate(
    artist_name = case_when(
      str_detect(artist_name, fixed("doriko")) ~ "doriko",
      str_detect(artist_name, fixed("ryo")) ~ "ryo (supercell)",
      str_detect(artist_name, fixed("supercell")) ~ "ryo (supercell)",
      str_detect(artist_name, fixed("花たん")) ~ "Hanatan",
      str_detect(artist_name, fixed("梶浦由記")) ~ "Yuki Kajiura",
      str_detect(artist_name, fixed("音阙诗听")) ~ "音阙诗听",
      TRUE ~ artist_name)
    ) %>%
  group_by(month) %>%
  count(artist_name) %>%
  to_duckdb() %>% 
  mutate(rank = min_rank(desc(n))) %>% 
  slice_min(n = 3, order_by = rank) %>%
  arrange(month, rank) %>%
  collect()

Fun!

Listens Over Time

Finally, let’s look at all our data, not just 2021. We’ll take the 10 most-listened songs, then plot how the listen count has accumulated over time.

fixed_tracks <- ds %>%
  mutate(track_name = case_when(
    str_detect(track_name, fixed("Yoru wo Kakeru")) ~ "Run through the Night",
    str_detect(track_name, fixed("夜を駆ける")) ~ "Run through the Night",
    str_detect(track_name, fixed("Grand Escape")) ~ "Grand Escape",
    str_detect(track_name, fixed("メルト")) ~ "Melt",
    str_detect(track_name, fixed("キャットフード")) ~ "Cat Food",
    str_detect(track_name, fixed("ロミオとシンデレラ")) ~ "Romeo and Cinderella",
    str_detect(track_name, fixed("夕日坂")) ~ "Sunset Hill",
    TRUE ~ track_name
    ))
top10 <- fixed_tracks %>%
  group_by(track_name) %>%
  summarize(listens = n()) %>%
  arrange(desc(listens)) %>%
  head(10) %>% 
  collect()
top10

Next, we’ll use a join to filter our original dataset, keeping only the top 10 songs:

top10_over_time <- fixed_tracks %>%
  select(listened_at, track_name) %>%
  inner_join(top10) %>%
  select(listened_at, track_name)
top10_over_time %>% collect()

Finally, we’ll swap over to DuckDB and compute a cumulative sum:

cumulative_listens <- top10_over_time %>%
  select(listened_at, track_name) %>%
  # Get something to cumsum
  mutate(n = 1) %>%
  group_by(track_name) %>%
  to_duckdb() %>%
  arrange(listened_at) %>%
  mutate(listens = cumsum(n)) %>%
  to_arrow() %>%
  mutate(listened_at = cast(listened_at, timestamp("s"))) %>%
  arrange(listened_at) %>%
  collect()
cumulative_listens %>% head(10)

(We send the data back to Arrow to convert to timestamps since the DuckDB integration gives a type error if we do it beforehand.)

ggplot(data = cumulative_listens, aes(x = listened_at, y = listens, group = track_name, colour = track_name)) +
  ggtitle("Cumulative Listens of Top 10 Songs Over Time") +
  xlab("# of listens") +
  ylab("Song") +
  labs(colour = "Song") +
  geom_line(size = 1) +
  scale_color_brewer(palette = "Paired") +
  theme(legend.position = "bottom", legend.box = "vertical") +
  guides(colour = guide_legend(nrow = 2, byrow = TRUE))

We can do the same for artists.

fixed_artists <- ds %>%
  mutate(
    artist_name = case_when(
      str_detect(artist_name, fixed("doriko")) ~ "doriko",
      str_detect(artist_name, fixed("ryo")) ~ "ryo (supercell)",
      str_detect(artist_name, fixed("supercell")) ~ "ryo (supercell)",
      str_detect(artist_name, fixed("花たん")) ~ "Hanatan",
      str_detect(artist_name, fixed("梶浦由記")) ~ "Yuki Kajiura",
      str_detect(artist_name, fixed("Kalafina")) ~ "Yuki Kajiura",
      str_detect(artist_name, fixed("FictionJunction")) ~ "Yuki Kajiura",
      str_detect(artist_name, fixed("音阙诗听")) ~ "interestingcn",
      str_detect(artist_name, fixed("石进")) ~ "Shi Jin",
      str_detect(artist_name, fixed("光田康典")) ~ "Yasunori Mitsuda",
      str_detect(artist_name, fixed("Yasunori Mitsuda, Millenial Fair")) ~ "Yasunori Mitsuda",
      str_detect(artist_name, fixed("植松伸夫")) ~ "Nobuo Uematsu",
      str_detect(artist_name, fixed("Nobuo Uematsu / Ririko Yamabuki")) ~ "Nobuo Uematsu",
      str_detect(artist_name, fixed("スピッツ")) ~ "Spitz",
      str_detect(artist_name, fixed("天門")) ~ "TENMON",
      str_detect(artist_name, fixed("ヨルシカ")) ~ "Yorushika",
      str_detect(artist_name, fixed("Dj Okawari, Emily Styler")) ~ "DJ Okawari",
      TRUE ~ artist_name)
    )
top10_artists <- fixed_artists %>%
  group_by(artist_name) %>%
  summarize(listens = n()) %>%
  arrange(desc(listens)) %>%
  head(10) %>% 
  collect()
top10_artists
top10_artists_over_time <- fixed_artists %>%
  select(listened_at, artist_name) %>%
  inner_join(top10_artists) %>%
  select(listened_at, artist_name)
cumulative_artists <- top10_artists_over_time %>%
  select(listened_at, artist_name) %>%
  mutate(n = 1) %>%
  group_by(artist_name) %>%
  to_duckdb() %>%
  arrange(listened_at) %>%
  mutate(listens = cumsum(n)) %>%
  to_arrow() %>%
  mutate(listened_at = cast(listened_at, timestamp("s"))) %>%
  arrange(listened_at) %>%
  collect()
cumulative_artists %>% head(10)
ggplot(data = cumulative_artists, aes(x = listened_at, y = listens, group = artist_name, colour = artist_name)) +
  ggtitle("Cumulative Listens of Top 10 Artists Over Time") +
  xlab("# of listens") +
  ylab("Artist") +
  labs(colour = "Artist") +
  geom_line(size = 1) +
  scale_color_brewer(palette = "Paired") +
  theme(legend.position = "bottom", legend.box = "vertical") +
  guides(colour = guide_legend(nrow = 2, byrow = TRUE))

---
title: "Roll Your Own “Spotify Wrapped” with Arrow, R, and ListenBrainz"
output: html_notebook
---

December comes again, and everybody is looking back at the past year.
Seeing everyone's Spotify Wrapped, I felt a little left out, since I don't stream music.
However, I have [QuodLibet](https://quodlibet.readthedocs.io/en/latest/) set up to scrobble listens to [ListenBrainz](https://listenbrainz.org/), so let's use [Arrow](https://arrow.apache.org/) and [dplyr](https://dplyr.tidyverse.org/) to generate our own retrospective.
It won't be exactly the same though, since I don't have stuff like genre data easily available.

## Setup

We'll load all the libraries we're using first.

```{r}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)
library(duckdb)
library(ggplot2)
```

Our dataset is stored as partitioned Parquet files, created using another [script](dump.py) that grabs data from the ListenBrainz API.

```{r}
ds <- open_dataset("listens", partitioning = c("year", "month"))
ds
```

Let's get cracking!

_Disclaimer: my R/dplyr is terrible, so apologies for any mistakes or unidiomatic patterns._

## Total Rows

First, let's get an idea of how big the dataset is.

```{r}
ds %>% count() %>% collect()
```
This…isn't big data.
It's not even medium data.
It's pretty small.
So the tools we're going to use here are massive overkill, but that's okay.

## Top Songs of the Year

First of all, let's set up part of the query that we're going to need a lot: filtering down to listens from the year 2021.

```{r}
ds2021 <- ds %>% 
  filter(year == 2021) %>%
  select(listened_at, artist_name, track_name, month)
```

And now a simple GROUP BY gets us top songs of the year:

```{r}
ds2021 %>%
  group_by(track_name) %>% 
  summarize(listens = n()) %>%
  arrange(desc(listens)) %>%
  head(25) %>%
  collect()
```

## Top Artists of the Year

Getting the top artists is also straightforward:

```{r}
ds2021 %>%
  group_by(artist_name) %>%
  summarize(listens = n()) %>%
  arrange(desc(listens)) %>%
  head(25) %>%
  collect()
```

Unfortunately, due to data quality, this isn't perfect.
For instance, we have both "doriko" and "doriko feat.初音ミク".
We can manually address this with `case_when`:

```{r}
ds2021 %>%
  mutate(
    artist_name = case_when(
      str_detect(artist_name, fixed("doriko")) ~ "doriko",
      str_detect(artist_name, fixed("ryo")) ~ "ryo (supercell)",
      str_detect(artist_name, fixed("supercell")) ~ "ryo (supercell)",
      str_detect(artist_name, fixed("花たん")) ~ "Hanatan",
      str_detect(artist_name, fixed("梶浦由記")) ~ "Yuki Kajiura",
      str_detect(artist_name, fixed("音阙诗听")) ~ "音阙诗听",
      TRUE ~ artist_name)
    ) %>%
  group_by(artist_name) %>%
  summarize(listens = n()) %>%
  arrange(desc(listens)) %>%
  head(25) %>%
  collect()
```

This doesn't scale too well, but it gets us what we're after.
I'm surprised [石进](https://zh.wikipedia.org/wiki/%E7%9F%B3%E8%BF%9B) beats out everyone else!

## Top by Month

Now let's dig down deeper.
For this, we'll need to bring out the big guns: window functions!
Arrow itself as of this writing (the development version of v7.0.0) lacks direct support for them.
However, that's no problem: the whole point of Arrow is to facilitate painless, costless data exchange across systems, and so the Arrow R package can seamlessly send data to [DuckDB](https://duckdb.org/) for further analysis, without copying data!

```{r}
ds2021 %>% 
  group_by(month) %>% 
  count(track_name) %>% 
  to_duckdb() %>% 
  mutate(rank = min_rank(desc(n))) %>% 
  slice_min(n = 3, order_by = rank) %>%
  arrange(month, rank) %>%
  collect()
```
Just for fun, we can see the generated SQL too:

```{r}
ds2021 %>% 
  group_by(month) %>% 
  count(track_name) %>% 
  to_duckdb() %>% 
  mutate(rank = min_rank(desc(n))) %>% 
  slice_min(n = 3, order_by = rank) %>%
  arrange(month, rank) %>%
  show_query()
```
I'm…not certain that's optimal, but I didn't have to work it out by hand, so I won't complain!

And we can do the same for artists:

```{r}
ds2021 %>%
  mutate(
    artist_name = case_when(
      str_detect(artist_name, fixed("doriko")) ~ "doriko",
      str_detect(artist_name, fixed("ryo")) ~ "ryo (supercell)",
      str_detect(artist_name, fixed("supercell")) ~ "ryo (supercell)",
      str_detect(artist_name, fixed("花たん")) ~ "Hanatan",
      str_detect(artist_name, fixed("梶浦由記")) ~ "Yuki Kajiura",
      str_detect(artist_name, fixed("音阙诗听")) ~ "音阙诗听",
      TRUE ~ artist_name)
    ) %>%
  group_by(month) %>%
  count(artist_name) %>%
  to_duckdb() %>% 
  mutate(rank = min_rank(desc(n))) %>% 
  slice_min(n = 3, order_by = rank) %>%
  arrange(month, rank) %>%
  collect()
```

Fun!

## Listens Over Time

Finally, let's look at all our data, not just 2021. 
We'll take the 10 most-listened songs, then plot how the listen count has accumulated over time.

```{r}
fixed_tracks <- ds %>%
  mutate(track_name = case_when(
    str_detect(track_name, fixed("Yoru wo Kakeru")) ~ "Run through the Night",
    str_detect(track_name, fixed("夜を駆ける")) ~ "Run through the Night",
    str_detect(track_name, fixed("Grand Escape")) ~ "Grand Escape",
    str_detect(track_name, fixed("メルト")) ~ "Melt",
    str_detect(track_name, fixed("キャットフード")) ~ "Cat Food",
    str_detect(track_name, fixed("ロミオとシンデレラ")) ~ "Romeo and Cinderella",
    str_detect(track_name, fixed("夕日坂")) ~ "Sunset Hill",
    TRUE ~ track_name
    ))
top10 <- fixed_tracks %>%
  group_by(track_name) %>%
  summarize(listens = n()) %>%
  arrange(desc(listens)) %>%
  head(10) %>% 
  collect()
top10
```
Next, we'll use a join to filter our original dataset, keeping only the top 10 songs:

```{r}
top10_over_time <- fixed_tracks %>%
  select(listened_at, track_name) %>%
  inner_join(top10) %>%
  select(listened_at, track_name)
top10_over_time %>% collect()
```

Finally, we'll swap over to DuckDB and compute a cumulative sum:

```{r}
cumulative_listens <- top10_over_time %>%
  select(listened_at, track_name) %>%
  # Get something to cumsum
  mutate(n = 1) %>%
  group_by(track_name) %>%
  to_duckdb() %>%
  arrange(listened_at) %>%
  mutate(listens = cumsum(n)) %>%
  to_arrow() %>%
  mutate(listened_at = cast(listened_at, timestamp("s"))) %>%
  arrange(listened_at) %>%
  collect()
cumulative_listens %>% head(10)
```

(We send the data back to Arrow to convert to timestamps since the DuckDB integration gives a type error if we do it beforehand.)

```{r, fig.width = 10, fig.height = 10}
ggplot(data = cumulative_listens, aes(x = listened_at, y = listens, group = track_name, colour = track_name)) +
  ggtitle("Cumulative Listens of Top 10 Songs Over Time") +
  xlab("# of listens") +
  ylab("Song") +
  labs(colour = "Song") +
  geom_line(size = 1) +
  scale_color_brewer(palette = "Paired") +
  theme(legend.position = "bottom", legend.box = "vertical") +
  guides(colour = guide_legend(nrow = 2, byrow = TRUE))
```

We can do the same for artists.

```{r}
fixed_artists <- ds %>%
  mutate(
    artist_name = case_when(
      str_detect(artist_name, fixed("doriko")) ~ "doriko",
      str_detect(artist_name, fixed("ryo")) ~ "ryo (supercell)",
      str_detect(artist_name, fixed("supercell")) ~ "ryo (supercell)",
      str_detect(artist_name, fixed("花たん")) ~ "Hanatan",
      str_detect(artist_name, fixed("梶浦由記")) ~ "Yuki Kajiura",
      str_detect(artist_name, fixed("Kalafina")) ~ "Yuki Kajiura",
      str_detect(artist_name, fixed("FictionJunction")) ~ "Yuki Kajiura",
      str_detect(artist_name, fixed("音阙诗听")) ~ "interestingcn",
      str_detect(artist_name, fixed("石进")) ~ "Shi Jin",
      str_detect(artist_name, fixed("光田康典")) ~ "Yasunori Mitsuda",
      str_detect(artist_name, fixed("Yasunori Mitsuda, Millenial Fair")) ~ "Yasunori Mitsuda",
      str_detect(artist_name, fixed("植松伸夫")) ~ "Nobuo Uematsu",
      str_detect(artist_name, fixed("Nobuo Uematsu / Ririko Yamabuki")) ~ "Nobuo Uematsu",
      str_detect(artist_name, fixed("スピッツ")) ~ "Spitz",
      str_detect(artist_name, fixed("天門")) ~ "TENMON",
      str_detect(artist_name, fixed("ヨルシカ")) ~ "Yorushika",
      str_detect(artist_name, fixed("Dj Okawari, Emily Styler")) ~ "DJ Okawari",
      TRUE ~ artist_name)
    )
top10_artists <- fixed_artists %>%
  group_by(artist_name) %>%
  summarize(listens = n()) %>%
  arrange(desc(listens)) %>%
  head(10) %>% 
  collect()
top10_artists
```

```{r}
top10_artists_over_time <- fixed_artists %>%
  select(listened_at, artist_name) %>%
  inner_join(top10_artists) %>%
  select(listened_at, artist_name)
cumulative_artists <- top10_artists_over_time %>%
  select(listened_at, artist_name) %>%
  mutate(n = 1) %>%
  group_by(artist_name) %>%
  to_duckdb() %>%
  arrange(listened_at) %>%
  mutate(listens = cumsum(n)) %>%
  to_arrow() %>%
  mutate(listened_at = cast(listened_at, timestamp("s"))) %>%
  arrange(listened_at) %>%
  collect()
cumulative_artists %>% head(10)
```

```{r, fig.width = 10, fig.height = 10}
ggplot(data = cumulative_artists, aes(x = listened_at, y = listens, group = artist_name, colour = artist_name)) +
  ggtitle("Cumulative Listens of Top 10 Artists Over Time") +
  xlab("# of listens") +
  ylab("Artist") +
  labs(colour = "Artist") +
  geom_line(size = 1) +
  scale_color_brewer(palette = "Paired") +
  theme(legend.position = "bottom", legend.box = "vertical") +
  guides(colour = guide_legend(nrow = 2, byrow = TRUE))
```

