Roll Your Own “Spotify Wrapped” with Arrow, R, and ListenBrainz

by Rakka   Tags:

View the original rendered R Notebook.

December comes again, and everybody is looking back at the past year. Seeing everyone’s Spotify Wrapped, I felt a little left out, since I don’t stream music. However, I have QuodLibet set up to scrobble listens to ListenBrainz, so let’s use Arrow and dplyr to generate our own retrospective. It won’t be exactly the same though, since I don’t have stuff like genre data easily available.

Setup

We’ll load all the libraries we’re using first.

library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)
library(duckdb)
library(ggplot2)

Our dataset is stored as partitioned Parquet files, created using another script that grabs data from the ListenBrainz API.

ds <- open_dataset("listens", partitioning = c("year", "month"))
ds
FileSystemDataset with 92 Parquet files
listened_at: int64
artist_name: string
track_name: string
artist_mbids: list<element: string>
recording_mbid: string
release_mbid: string
year: int32
month: int32

Let’s get cracking!

Disclaimer: my R/dplyr is terrible, so apologies for any mistakes or unidiomatic patterns.

Total Rows

First, let’s get an idea of how big the dataset is.

ds %>% count() %>% collect()
int n
53405

This…isn’t big data. It’s not even medium data. It’s pretty small. So the tools we’re going to use here are massive overkill, but that’s okay.

Top Songs of the Year

First of all, let’s set up part of the query that we’re going to need a lot: filtering down to listens from the year 2021.

ds2021 <- ds %>%
  filter(year == 2021) %>%
  select(listened_at, artist_name, track_name, month)

And now a simple GROUP BY gets us top songs of the year:

ds2021 %>%
  group_by(track_name) %>%
  summarize(listens = n()) %>%
  arrange(desc(listens)) %>%
  head(25) %>%
  collect()
chr track_name int listens
夕日坂 400
Tiny Stars 228
メルト (MikuV3 by Cillia) 191
キャットフード 165
ODDS&ENDS 156
劣等上等 149
ロミオとシンデレラ 119
Eternal Snow 107
春はゆく 105
New Future 103
Myself 102
メルト (Otomachi Una by Cillia) 90
Yoru wo Kakeru (夜を駆ける / Run Through the Night) 89
歌に形はないけれど (doriko) 86
I♥U 83
Eternal Snow (Diamond Dust mix) 81
夢路 80
大喜 78
星屑ユートピア 78
羞花 77
Gimme×Gimme 76
面影ワープ 76
一百万个可能(翻唱:Christine Welch) 74
ふたつの影 72
Returns 71

Top Artists of the Year

Getting the top artists is also straightforward:

ds2021 %>%
  group_by(artist_name) %>%
  summarize(listens = n()) %>%
  arrange(desc(listens)) %>%
  head(25) %>%
  collect()
chr artist_name int listens
石进 994
nano.RIPE 460
ヨルシカ 413
The National 400
Changin' My Life 394
doriko 388
ryo feat. 初音ミク 281
Yasunori Mitsuda 233
澁谷かのん(CV.伊達さゆり)、唐可可(CV.Liyuu) 228
Kan Gao 216
TENMON (天門) 212
doriko feat.初音ミク 201
音阙诗听 200
八王子P feat. 初音ミク 193
ryo (supercell) feat. 初音ミク 156
Tomoya Naka 154
こはならむ 154
RAISE A SUILEN 152
Spitz 150
纯白, 花儿不哭 feat. 洛天依 137
Yuki Kajiura 135
Poppin'Party 133
Aimer 131
RADWIMPS 127
K.D. 123

Unfortunately, due to data quality, this isn’t perfect. For instance, we have both “doriko” and “doriko feat.初音ミク”. We can manually address this with case_when:

ds2021 %>%
  mutate(
    artist_name = case_when(
      str_detect(artist_name, fixed("doriko")) ~ "doriko",
      str_detect(artist_name, fixed("ryo")) ~ "ryo (supercell)",
      str_detect(artist_name, fixed("supercell")) ~ "ryo (supercell)",
      str_detect(artist_name, fixed("花たん")) ~ "Hanatan",
      str_detect(artist_name, fixed("梶浦由記")) ~ "Yuki Kajiura",
      str_detect(artist_name, fixed("音阙诗听")) ~ "音阙诗听",
      TRUE ~ artist_name)
    ) %>%
  group_by(artist_name) %>%
  summarize(listens = n()) %>%
  arrange(desc(listens)) %>%
  head(25) %>%
  collect()
chr artist_name int listens
石进 994
doriko 667
ryo (supercell) 535
nano.RIPE 460
音阙诗听 417
ヨルシカ 413
The National 400
Changin' My Life 394
Hanatan 304
Yasunori Mitsuda 233
澁谷かのん(CV.伊達さゆり)、唐可可(CV.Liyuu) 228
Yuki Kajiura 225
Kan Gao 216
TENMON (天門) 212
八王子P feat. 初音ミク 193
こはならむ 154
Tomoya Naka 154
RAISE A SUILEN 152
Spitz 150
纯白, 花儿不哭 feat. 洛天依 137
Poppin'Party 133
Aimer 131
RADWIMPS 127
K.D. 123
Belle 120

This doesn’t scale too well, but it gets us what we’re after. I’m surprised 石进 beats out everyone else!

Top by Month

Now let’s dig down deeper. For this, we’ll need to bring out the big guns: window functions! Arrow itself as of this writing (the development version of v7.0.0) lacks direct support for them. However, that’s no problem: the whole point of Arrow is to facilitate painless, costless data exchange across systems, and so the Arrow R package can seamlessly send data to DuckDB for further analysis, without copying data!

ds2021 %>%
  group_by(month) %>%
  count(track_name) %>%
  to_duckdb() %>%
  mutate(rank = min_rank(desc(n))) %>%
  slice_min(n = 3, order_by = rank) %>%
  arrange(month, rank) %>%
  collect()
int month chr track_name dbl n dbl rank
1 星屑ユートピア 35 1
1 月影とブランコ 25 2
1 夢路 23 3
2 ダブルラリアット [Double Bass Drum Remix 2021] (feat. 巡音ルカ) 31 1
2 キャットフード 19 2
2 面影ワープ 18 3
2 嘘月 18 3
3 劣等上等 86 1
3 花的微笑 29 2
3 みんなみくみくにしてあげる♪ 28 3
4 メルト (MikuV3 by Cillia) 102 1
4 メルト (Otomachi Una by Cillia) 31 2
4 劣等上等 28 3
4 羞花 28 3
5 Myself 37 1
5 Heart Chrome 30 2
5 ODDS&ENDS 30 2
6 New Future 24 1
6 ODDS&ENDS 21 2
6 ふたつの影 21 2
6 Myself 21 2
7 春はゆく 79 1
7 一百万个可能(翻唱:Christine Welch) 64 2
7 寅时 43 3
8 夕日坂 71 1
8 Tiny Stars 71 1
8 ODDS&ENDS 38 3
8 中华缘木娘 (Chinese Yuanmu Girl) 38 3
9 夕日坂 175 1
9 Tiny Stars 107 2
9 ODDS&ENDS 31 3
10 夕日坂 55 1
10 キャットフード 33 2
10 Great Wide Unknown (Bonus Song) 27 3
11 夕日坂 38 1
11 歌に形はないけれど 33 2
11 Tiny Stars 21 3
12 夕日坂 27 1
12 song for you(サニーピースver.) 26 2
12 涙だけ / feat. IA* 15 3

Just for fun, we can see the generated SQL too:

ds2021 %>%
  group_by(month) %>%
  count(track_name) %>%
  to_duckdb() %>%
  mutate(rank = min_rank(desc(n))) %>%
  slice_min(n = 3, order_by = rank) %>%
  arrange(month, rank) %>%
  show_query()
<SQL>
SELECT "month", "track_name", "n", "rank"
FROM (SELECT "month", "track_name", "n", "rank", RANK() OVER (PARTITION BY "month" ORDER BY "rank") AS "q01"
FROM (SELECT "month", "track_name", "n", RANK() OVER (PARTITION BY "month" ORDER BY "n" DESC) AS "rank"
FROM "arrow_002") "q01") "q02"
WHERE ("q01" <= 3)
ORDER BY "month", "rank"

I’m…not certain that’s optimal, but I didn’t have to work it out by hand, so I won’t complain!

And we can do the same for artists:

ds2021 %>%
  mutate(
    artist_name = case_when(
      str_detect(artist_name, fixed("doriko")) ~ "doriko",
      str_detect(artist_name, fixed("ryo")) ~ "ryo (supercell)",
      str_detect(artist_name, fixed("supercell")) ~ "ryo (supercell)",
      str_detect(artist_name, fixed("花たん")) ~ "Hanatan",
      str_detect(artist_name, fixed("梶浦由記")) ~ "Yuki Kajiura",
      str_detect(artist_name, fixed("音阙诗听")) ~ "音阙诗听",
      TRUE ~ artist_name)
    ) %>%
  group_by(month) %>%
  count(artist_name) %>%
  to_duckdb() %>%
  mutate(rank = min_rank(desc(n))) %>%
  slice_min(n = 3, order_by = rank) %>%
  arrange(month, rank) %>%
  collect()
int month chr artist_name dbl n dbl rank
1 nano.RIPE 156 1
1 ヨルシカ 102 2
1 The National 57 3
2 nano.RIPE 72 1
2 ヨルシカ 38 2
2 鬱P 31 3
2 石进 31 3
3 石进 568 1
3 RAISE A SUILEN 82 2
3 ヨルシカ 68 3
4 ryo (supercell) 152 1
4 石进 98 2
4 The National 66 3
5 Changin' My Life 70 1
5 ryo (supercell) 63 2
5 tilt-six 53 3
6 石进 108 1
6 Changin' My Life 73 2
6 ryo (supercell) 66 3
7 音阙诗听 101 1
7 Aimer 87 2
7 Tomoya Naka 82 3
8 doriko 96 1
8 ryo (supercell) 89 2
8 纯白, 花儿不哭 feat. 洛天依 84 3
9 doriko 193 1
9 石进 123 2
9 澁谷かのん(CV.伊達さゆり)、唐可可(CV.Liyuu) 107 3
10 doriko 122 1
10 Kan Gao 91 2
10 Hanatan 77 3
11 まらしぃ 86 1
11 doriko 70 2
11 K.D. 43 3
12 doriko 32 1
12 サニーピース 26 2
12 Yuki Kajiura 15 3
12 傘村トータ 15 3

Fun!

Listens Over Time

Finally, let’s look at all our data, not just 2021. We’ll take the 10 most-listened songs, then plot how the listen count has accumulated over time.

fixed_tracks <- ds %>%
  mutate(track_name = case_when(
    str_detect(track_name, fixed("Yoru wo Kakeru")) ~ "Run through the Night",
    str_detect(track_name, fixed("夜を駆ける")) ~ "Run through the Night",
    str_detect(track_name, fixed("Grand Escape")) ~ "Grand Escape",
    str_detect(track_name, fixed("メルト")) ~ "Melt",
    str_detect(track_name, fixed("キャットフード")) ~ "Cat Food",
    str_detect(track_name, fixed("ロミオとシンデレラ")) ~ "Romeo and Cinderella",
    str_detect(track_name, fixed("夕日坂")) ~ "Sunset Hill",
    TRUE ~ track_name
    ))
top10 <- fixed_tracks %>%
  group_by(track_name) %>%
  summarize(listens = n()) %>%
  arrange(desc(listens)) %>%
  head(10) %>%
  collect()
top10
chr track_name int listens
Romeo and Cinderella 725
Sunset Hill 432
Grand Escape 356
Melt 350
Pink Rabbits 280
Carin at the Liquor Store 266
Run through the Night 253
Cat Food 237
Tiny Stars 228
Give me!! 198

Next, we’ll use a join to filter our original dataset, keeping only the top 10 songs:

top10_over_time <- fixed_tracks %>%
  select(listened_at, track_name) %>%
  inner_join(top10) %>%
  select(listened_at, track_name)
top10_over_time %>% collect()
int listened_at chr track_name
1422810718 Pink Rabbits
1422844759 Pink Rabbits
1423278815 Pink Rabbits
1423358788 Pink Rabbits
1423447698 Pink Rabbits
1423958002 Pink Rabbits
1423972881 Pink Rabbits
1424052333 Pink Rabbits
1424143077 Pink Rabbits
1424198856 Pink Rabbits
(data trimmed…)

Finally, we’ll swap over to DuckDB and compute a cumulative sum:

cumulative_listens <- top10_over_time %>%
  select(listened_at, track_name) %>%
  # Get something to cumsum
  mutate(n = 1) %>%
  group_by(track_name) %>%
  to_duckdb() %>%
  arrange(listened_at) %>%
  mutate(listens = cumsum(n)) %>%
  to_arrow() %>%
  mutate(listened_at = cast(listened_at, timestamp("s"))) %>%
  arrange(listened_at) %>%
  collect()
cumulative_listens %>% head(10)
S3: POSIXct listened_at chr track_name dbl n dbl listens
2015-01-01 12:53:58 Pink Rabbits 1 1
2015-01-01 13:46:05 Pink Rabbits 1 2
2015-01-01 19:56:49 Pink Rabbits 1 3
2015-01-02 23:41:07 Pink Rabbits 1 4
2015-01-03 11:09:45 Pink Rabbits 1 5
2015-01-03 12:52:05 Pink Rabbits 1 6
2015-01-03 18:08:59 Pink Rabbits 1 7
2015-01-04 11:16:51 Pink Rabbits 1 8
2015-01-04 11:50:30 Pink Rabbits 1 9
2015-01-05 16:17:46 Pink Rabbits 1 10

(We send the data back to Arrow to convert to timestamps since the DuckDB integration gives a type error if we do it beforehand.)

ggplot(data = cumulative_listens, aes(x = listened_at, y = listens, group = track_name, colour = track_name)) +
  ggtitle("Cumulative Listens of Top 10 Songs Over Time") +
  xlab("# of listens") +
  ylab("Song") +
  labs(colour = "Song") +
  geom_line(size = 1) +
  scale_color_brewer(palette = "Paired") +
  theme(legend.position = "bottom", legend.box = "vertical") +
  guides(colour = guide_legend(nrow = 2, byrow = TRUE))
Cumulative listens of the top 10 songs in my library over time, since 2014.
“Romeo and Cinderella” is by FAR the winner, though “Sunset Hill” also rocketed up pretty fast. Can you guess when the pandemic started?

We can do the same for artists.

fixed_artists <- ds %>%
  mutate(
    artist_name = case_when(
      str_detect(artist_name, fixed("doriko")) ~ "doriko",
      str_detect(artist_name, fixed("ryo")) ~ "ryo (supercell)",
      str_detect(artist_name, fixed("supercell")) ~ "ryo (supercell)",
      str_detect(artist_name, fixed("花たん")) ~ "Hanatan",
      str_detect(artist_name, fixed("梶浦由記")) ~ "Yuki Kajiura",
      str_detect(artist_name, fixed("Kalafina")) ~ "Yuki Kajiura",
      str_detect(artist_name, fixed("FictionJunction")) ~ "Yuki Kajiura",
      str_detect(artist_name, fixed("音阙诗听")) ~ "interestingcn",
      str_detect(artist_name, fixed("石进")) ~ "Shi Jin",
      str_detect(artist_name, fixed("光田康典")) ~ "Yasunori Mitsuda",
      str_detect(artist_name, fixed("Yasunori Mitsuda, Millenial Fair")) ~ "Yasunori Mitsuda",
      str_detect(artist_name, fixed("植松伸夫")) ~ "Nobuo Uematsu",
      str_detect(artist_name, fixed("Nobuo Uematsu / Ririko Yamabuki")) ~ "Nobuo Uematsu",
      str_detect(artist_name, fixed("スピッツ")) ~ "Spitz",
      str_detect(artist_name, fixed("天門")) ~ "TENMON",
      str_detect(artist_name, fixed("ヨルシカ")) ~ "Yorushika",
      str_detect(artist_name, fixed("Dj Okawari, Emily Styler")) ~ "DJ Okawari",
      TRUE ~ artist_name)
    )
top10_artists <- fixed_artists %>%
  group_by(artist_name) %>%
  summarize(listens = n()) %>%
  arrange(desc(listens)) %>%
  head(10) %>%
  collect()
top10_artists
chr artist_name int listens
Yuki Kajiura 4270
The National 2970
Yasunori Mitsuda 2327
TENMON 2211
Yorushika 1800
Shi Jin 1689
Nobuo Uematsu 1587
RADWIMPS 1504
interestingcn 1488
doriko 993
top10_artists_over_time <- fixed_artists %>%
  select(listened_at, artist_name) %>%
  inner_join(top10_artists) %>%
  select(listened_at, artist_name)
cumulative_artists <- top10_artists_over_time %>%
  select(listened_at, artist_name) %>%
  mutate(n = 1) %>%
  group_by(artist_name) %>%
  to_duckdb() %>%
  arrange(listened_at) %>%
  mutate(listens = cumsum(n)) %>%
  to_arrow() %>%
  mutate(listened_at = cast(listened_at, timestamp("s"))) %>%
  arrange(listened_at) %>%
  collect()
cumulative_artists %>% head(10)
S3: POSIXct listened_at chr artist_name dbl n dbl listens
2014-05-26 12:09:38 Yuki Kajiura 1 1
2014-05-26 12:23:06 Yasunori Mitsuda 1 1
2014-05-26 12:28:42 Yasunori Mitsuda 1 2
2014-05-26 12:35:01 Yuki Kajiura 1 2
2014-05-27 11:20:10 Yuki Kajiura 1 3
2014-05-27 11:30:44 Yuki Kajiura 1 4
2014-05-27 11:43:46 Yasunori Mitsuda 1 3
2014-05-27 11:55:56 Yuki Kajiura 1 5
2014-05-27 12:00:16 Yuki Kajiura 1 6
2014-05-27 12:04:23 Yuki Kajiura 1 7
ggplot(data = cumulative_artists, aes(x = listened_at, y = listens, group = artist_name, colour = artist_name)) +
  ggtitle("Cumulative Listens of Top 10 Artists Over Time") +
  xlab("# of listens") +
  ylab("Artist") +
  labs(colour = "Artist") +
  geom_line(size = 1) +
  scale_color_brewer(palette = "Paired") +
  theme(legend.position = "bottom", legend.box = "vertical") +
  guides(colour = guide_legend(nrow = 2, byrow = TRUE))
Cumulative listens of songs by the top 10 artists in my library over time, since 2014.
This one's less dramatic, though Yuki Kajiura has been a consistent winner from the start. It's interesting watching artists like Yorushika, who started off flat and then suddenly rocketed up.

ListenBrainz Analysis

Added 2021-12-16.

It turns out ListenBrainz does its own Year in Music, some of which is attached below. They don’t do data cleaning, presumably because what you should do is go update the ListenBrainz database and retag your library (which I should be better about doing, for sure).

ListenBrainz's top artists of the year for me.
ListenBrainz's top releases (albums) of the year for me.
ListenBrainz's top recordings (songs) of the year for me.