Roll Your Own “Spotify Wrapped” with Arrow, R, and ListenBrainz

December 10, 2021 by Rakka • Tags:

View the original rendered R Notebook.

December comes again, and everybody is looking back at the past year. Seeing everyone’s Spotify Wrapped, I felt a little left out, since I don’t stream music. However, I have QuodLibet set up to scrobble listens to ListenBrainz, so let’s use Arrow and dplyr to generate our own retrospective. It won’t be exactly the same though, since I don’t have stuff like genre data easily available.

Setup

We’ll load all the libraries we’re using first.

library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)
library(duckdb)
library(ggplot2)

Our dataset is stored as partitioned Parquet files, created using another script that grabs data from the ListenBrainz API.

ds <- open_dataset("listens", partitioning = c("year", "month"))
ds

FileSystemDataset with 92 Parquet files
listened_at: int64
artist_name: string
track_name: string
artist_mbids: list<element: string>
recording_mbid: string
release_mbid: string
year: int32
month: int32

Let’s get cracking!

Disclaimer: my R/dplyr is terrible, so apologies for any mistakes or unidiomatic patterns.

Total Rows

First, let’s get an idea of how big the dataset is.

ds %>% count() %>% collect()

int n
53405

This…isn’t big data. It’s not even medium data. It’s pretty small. So the tools we’re going to use here are massive overkill, but that’s okay.

Top Songs of the Year

First of all, let’s set up part of the query that we’re going to need a lot: filtering down to listens from the year 2021.

ds2021 <- ds %>%
  filter(year == 2021) %>%
  select(listened_at, artist_name, track_name, month)

And now a simple GROUP BY gets us top songs of the year:

ds2021 %>%
  group_by(track_name) %>%
  summarize(listens = n()) %>%
  arrange(desc(listens)) %>%
  head(25) %>%
  collect()

chr track_name	int listens
夕日坂	400
Tiny Stars	228
メルト (MikuV3 by Cillia)	191
キャットフード	165
ODDS&ENDS	156
劣等上等	149
ロミオとシンデレラ	119
Eternal Snow	107
春はゆく	105
New Future	103
Myself	102
メルト (Otomachi Una by Cillia)	90
Yoru wo Kakeru (夜を駆ける / Run Through the Night)	89
歌に形はないけれど (doriko)	86
I♥U	83
Eternal Snow (Diamond Dust mix)	81
夢路	80
大喜	78
星屑ユートピア	78
羞花	77
Gimme×Gimme	76
面影ワープ	76
一百万个可能（翻唱：Christine Welch）	74
ふたつの影	72
Returns	71

Top Artists of the Year

Getting the top artists is also straightforward:

ds2021 %>%
  group_by(artist_name) %>%
  summarize(listens = n()) %>%
  arrange(desc(listens)) %>%
  head(25) %>%
  collect()

chr artist_name	int listens
石进	994
nano.RIPE	460
ヨルシカ	413
The National	400
Changin' My Life	394
doriko	388
ryo feat. 初音ミク	281
Yasunori Mitsuda	233
澁谷かのん(CV.伊達さゆり)、唐可可(CV.Liyuu)	228
Kan Gao	216
TENMON (天門)	212
doriko feat.初音ミク	201
音阙诗听	200
八王子P feat. 初音ミク	193
ryo (supercell) feat. 初音ミク	156
Tomoya Naka	154
こはならむ	154
RAISE A SUILEN	152
Spitz	150
纯白, 花儿不哭 feat. 洛天依	137
Yuki Kajiura	135
Poppin'Party	133
Aimer	131
RADWIMPS	127
K.D.	123

Unfortunately, due to data quality, this isn’t perfect. For instance, we have both “doriko” and “doriko feat.初音ミク”. We can manually address this with case_when:

ds2021 %>%
  mutate(
    artist_name = case_when(
      str_detect(artist_name, fixed("doriko")) ~ "doriko",
      str_detect(artist_name, fixed("ryo")) ~ "ryo (supercell)",
      str_detect(artist_name, fixed("supercell")) ~ "ryo (supercell)",
      str_detect(artist_name, fixed("花たん")) ~ "Hanatan",
      str_detect(artist_name, fixed("梶浦由記")) ~ "Yuki Kajiura",
      str_detect(artist_name, fixed("音阙诗听")) ~ "音阙诗听",
      TRUE ~ artist_name)
    ) %>%
  group_by(artist_name) %>%
  summarize(listens = n()) %>%
  arrange(desc(listens)) %>%
  head(25) %>%
  collect()

chr artist_name	int listens
石进	994
doriko	667
ryo (supercell)	535
nano.RIPE	460
音阙诗听	417
ヨルシカ	413
The National	400
Changin' My Life	394
Hanatan	304
Yasunori Mitsuda	233
澁谷かのん(CV.伊達さゆり)、唐可可(CV.Liyuu)	228
Yuki Kajiura	225
Kan Gao	216
TENMON (天門)	212
八王子P feat. 初音ミク	193
こはならむ	154
Tomoya Naka	154
RAISE A SUILEN	152
Spitz	150
纯白, 花儿不哭 feat. 洛天依	137
Poppin'Party	133
Aimer	131
RADWIMPS	127
K.D.	123
Belle	120

This doesn’t scale too well, but it gets us what we’re after. I’m surprised 石进 beats out everyone else!

Top by Month

Now let’s dig down deeper. For this, we’ll need to bring out the big guns: window functions! Arrow itself as of this writing (the development version of v7.0.0) lacks direct support for them. However, that’s no problem: the whole point of Arrow is to facilitate painless, costless data exchange across systems, and so the Arrow R package can seamlessly send data to DuckDB for further analysis, without copying data!

ds2021 %>%
  group_by(month) %>%
  count(track_name) %>%
  to_duckdb() %>%
  mutate(rank = min_rank(desc(n))) %>%
  slice_min(n = 3, order_by = rank) %>%
  arrange(month, rank) %>%
  collect()

int month	chr track_name	dbl n	dbl rank
1	星屑ユートピア	35	1
1	月影とブランコ	25	2
1	夢路	23	3
2	ダブルラリアット [Double Bass Drum Remix 2021] (feat. 巡音ルカ)	31	1
2	キャットフード	19	2
2	面影ワープ	18	3
2	嘘月	18	3
3	劣等上等	86	1
3	花的微笑	29	2
3	みんなみくみくにしてあげる♪	28	3
4	メルト (MikuV3 by Cillia)	102	1
4	メルト (Otomachi Una by Cillia)	31	2
4	劣等上等	28	3
4	羞花	28	3
5	Myself	37	1
5	Heart Chrome	30	2
5	ODDS&ENDS	30	2
6	New Future	24	1
6	ODDS&ENDS	21	2
6	ふたつの影	21	2
6	Myself	21	2
7	春はゆく	79	1
7	一百万个可能（翻唱：Christine Welch）	64	2
7	寅时	43	3
8	夕日坂	71	1
8	Tiny Stars	71	1
8	ODDS&ENDS	38	3
8	中华缘木娘 (Chinese Yuanmu Girl)	38	3
9	夕日坂	175	1
9	Tiny Stars	107	2
9	ODDS&ENDS	31	3
10	夕日坂	55	1
10	キャットフード	33	2
10	Great Wide Unknown (Bonus Song)	27	3
11	夕日坂	38	1
11	歌に形はないけれど	33	2
11	Tiny Stars	21	3
12	夕日坂	27	1
12	song for you(サニーピースver.)	26	2
12	涙だけ　/ feat. IA*	15	3

Just for fun, we can see the generated SQL too:

ds2021 %>%
  group_by(month) %>%
  count(track_name) %>%
  to_duckdb() %>%
  mutate(rank = min_rank(desc(n))) %>%
  slice_min(n = 3, order_by = rank) %>%
  arrange(month, rank) %>%
  show_query()

<SQL>
SELECT "month", "track_name", "n", "rank"
FROM (SELECT "month", "track_name", "n", "rank", RANK() OVER (PARTITION BY "month" ORDER BY "rank") AS "q01"
FROM (SELECT "month", "track_name", "n", RANK() OVER (PARTITION BY "month" ORDER BY "n" DESC) AS "rank"
FROM "arrow_002") "q01") "q02"
WHERE ("q01" <= 3)
ORDER BY "month", "rank"

I’m…not certain that’s optimal, but I didn’t have to work it out by hand, so I won’t complain!

And we can do the same for artists:

ds2021 %>%
  mutate(
    artist_name = case_when(
      str_detect(artist_name, fixed("doriko")) ~ "doriko",
      str_detect(artist_name, fixed("ryo")) ~ "ryo (supercell)",
      str_detect(artist_name, fixed("supercell")) ~ "ryo (supercell)",
      str_detect(artist_name, fixed("花たん")) ~ "Hanatan",
      str_detect(artist_name, fixed("梶浦由記")) ~ "Yuki Kajiura",
      str_detect(artist_name, fixed("音阙诗听")) ~ "音阙诗听",
      TRUE ~ artist_name)
    ) %>%
  group_by(month) %>%
  count(artist_name) %>%
  to_duckdb() %>%
  mutate(rank = min_rank(desc(n))) %>%
  slice_min(n = 3, order_by = rank) %>%
  arrange(month, rank) %>%
  collect()

int month	chr artist_name	dbl n	dbl rank
1	nano.RIPE	156	1
1	ヨルシカ	102	2
1	The National	57	3
2	nano.RIPE	72	1
2	ヨルシカ	38	2
2	鬱P	31	3
2	石进	31	3
3	石进	568	1
3	RAISE A SUILEN	82	2
3	ヨルシカ	68	3
4	ryo (supercell)	152	1
4	石进	98	2
4	The National	66	3
5	Changin' My Life	70	1
5	ryo (supercell)	63	2
5	tilt-six	53	3
6	石进	108	1
6	Changin' My Life	73	2
6	ryo (supercell)	66	3
7	音阙诗听	101	1
7	Aimer	87	2
7	Tomoya Naka	82	3
8	doriko	96	1
8	ryo (supercell)	89	2
8	纯白, 花儿不哭 feat. 洛天依	84	3
9	doriko	193	1
9	石进	123	2
9	澁谷かのん(CV.伊達さゆり)、唐可可(CV.Liyuu)	107	3
10	doriko	122	1
10	Kan Gao	91	2
10	Hanatan	77	3
11	まらしぃ	86	1
11	doriko	70	2
11	K.D.	43	3
12	doriko	32	1
12	サニーピース	26	2
12	Yuki Kajiura	15	3
12	傘村トータ	15	3

Fun!

Listens Over Time

Finally, let’s look at all our data, not just 2021. We’ll take the 10 most-listened songs, then plot how the listen count has accumulated over time.

fixed_tracks <- ds %>%
  mutate(track_name = case_when(
    str_detect(track_name, fixed("Yoru wo Kakeru")) ~ "Run through the Night",
    str_detect(track_name, fixed("夜を駆ける")) ~ "Run through the Night",
    str_detect(track_name, fixed("Grand Escape")) ~ "Grand Escape",
    str_detect(track_name, fixed("メルト")) ~ "Melt",
    str_detect(track_name, fixed("キャットフード")) ~ "Cat Food",
    str_detect(track_name, fixed("ロミオとシンデレラ")) ~ "Romeo and Cinderella",
    str_detect(track_name, fixed("夕日坂")) ~ "Sunset Hill",
    TRUE ~ track_name
    ))
top10 <- fixed_tracks %>%
  group_by(track_name) %>%
  summarize(listens = n()) %>%
  arrange(desc(listens)) %>%
  head(10) %>%
  collect()
top10

chr track_name	int listens
Romeo and Cinderella	725
Sunset Hill	432
Grand Escape	356
Melt	350
Pink Rabbits	280
Carin at the Liquor Store	266
Run through the Night	253
Cat Food	237
Tiny Stars	228
Give me!!	198

Next, we’ll use a join to filter our original dataset, keeping only the top 10 songs:

top10_over_time <- fixed_tracks %>%
  select(listened_at, track_name) %>%
  inner_join(top10) %>%
  select(listened_at, track_name)
top10_over_time %>% collect()

int listened_at	chr track_name
1422810718	Pink Rabbits
1422844759	Pink Rabbits
1423278815	Pink Rabbits
1423358788	Pink Rabbits
1423447698	Pink Rabbits
1423958002	Pink Rabbits
1423972881	Pink Rabbits
1424052333	Pink Rabbits
1424143077	Pink Rabbits
1424198856	Pink Rabbits
(data trimmed…)

Finally, we’ll swap over to DuckDB and compute a cumulative sum:

cumulative_listens <- top10_over_time %>%
  select(listened_at, track_name) %>%
  # Get something to cumsum
  mutate(n = 1) %>%
  group_by(track_name) %>%
  to_duckdb() %>%
  arrange(listened_at) %>%
  mutate(listens = cumsum(n)) %>%
  to_arrow() %>%
  mutate(listened_at = cast(listened_at, timestamp("s"))) %>%
  arrange(listened_at) %>%
  collect()
cumulative_listens %>% head(10)

S3: POSIXct listened_at	chr track_name	dbl n	dbl listens
2015-01-01 12:53:58	Pink Rabbits	1	1
2015-01-01 13:46:05	Pink Rabbits	1	2
2015-01-01 19:56:49	Pink Rabbits	1	3
2015-01-02 23:41:07	Pink Rabbits	1	4
2015-01-03 11:09:45	Pink Rabbits	1	5
2015-01-03 12:52:05	Pink Rabbits	1	6
2015-01-03 18:08:59	Pink Rabbits	1	7
2015-01-04 11:16:51	Pink Rabbits	1	8
2015-01-04 11:50:30	Pink Rabbits	1	9
2015-01-05 16:17:46	Pink Rabbits	1	10

(We send the data back to Arrow to convert to timestamps since the DuckDB integration gives a type error if we do it beforehand.)

ggplot(data = cumulative_listens, aes(x = listened_at, y = listens, group = track_name, colour = track_name)) +
  ggtitle("Cumulative Listens of Top 10 Songs Over Time") +
  xlab("# of listens") +
  ylab("Song") +
  labs(colour = "Song") +
  geom_line(size = 1) +
  scale_color_brewer(palette = "Paired") +
  theme(legend.position = "bottom", legend.box = "vertical") +
  guides(colour = guide_legend(nrow = 2, byrow = TRUE))

Cumulative listens of the top 10 songs in my library over time, since 2014. — “Romeo and Cinderella” is by FAR the winner, though “Sunset Hill” also rocketed up pretty fast. Can you guess when the pandemic started?

We can do the same for artists.

fixed_artists <- ds %>%
  mutate(
    artist_name = case_when(
      str_detect(artist_name, fixed("doriko")) ~ "doriko",
      str_detect(artist_name, fixed("ryo")) ~ "ryo (supercell)",
      str_detect(artist_name, fixed("supercell")) ~ "ryo (supercell)",
      str_detect(artist_name, fixed("花たん")) ~ "Hanatan",
      str_detect(artist_name, fixed("梶浦由記")) ~ "Yuki Kajiura",
      str_detect(artist_name, fixed("Kalafina")) ~ "Yuki Kajiura",
      str_detect(artist_name, fixed("FictionJunction")) ~ "Yuki Kajiura",
      str_detect(artist_name, fixed("音阙诗听")) ~ "interestingcn",
      str_detect(artist_name, fixed("石进")) ~ "Shi Jin",
      str_detect(artist_name, fixed("光田康典")) ~ "Yasunori Mitsuda",
      str_detect(artist_name, fixed("Yasunori Mitsuda, Millenial Fair")) ~ "Yasunori Mitsuda",
      str_detect(artist_name, fixed("植松伸夫")) ~ "Nobuo Uematsu",
      str_detect(artist_name, fixed("Nobuo Uematsu / Ririko Yamabuki")) ~ "Nobuo Uematsu",
      str_detect(artist_name, fixed("スピッツ")) ~ "Spitz",
      str_detect(artist_name, fixed("天門")) ~ "TENMON",
      str_detect(artist_name, fixed("ヨルシカ")) ~ "Yorushika",
      str_detect(artist_name, fixed("Dj Okawari, Emily Styler")) ~ "DJ Okawari",
      TRUE ~ artist_name)
    )
top10_artists <- fixed_artists %>%
  group_by(artist_name) %>%
  summarize(listens = n()) %>%
  arrange(desc(listens)) %>%
  head(10) %>%
  collect()
top10_artists

chr artist_name	int listens
Yuki Kajiura	4270
The National	2970
Yasunori Mitsuda	2327
TENMON	2211
Yorushika	1800
Shi Jin	1689
Nobuo Uematsu	1587
RADWIMPS	1504
interestingcn	1488
doriko	993

top10_artists_over_time <- fixed_artists %>%
  select(listened_at, artist_name) %>%
  inner_join(top10_artists) %>%
  select(listened_at, artist_name)
cumulative_artists <- top10_artists_over_time %>%
  select(listened_at, artist_name) %>%
  mutate(n = 1) %>%
  group_by(artist_name) %>%
  to_duckdb() %>%
  arrange(listened_at) %>%
  mutate(listens = cumsum(n)) %>%
  to_arrow() %>%
  mutate(listened_at = cast(listened_at, timestamp("s"))) %>%
  arrange(listened_at) %>%
  collect()
cumulative_artists %>% head(10)

S3: POSIXct listened_at	chr artist_name	dbl n	dbl listens
2014-05-26 12:09:38	Yuki Kajiura	1	1
2014-05-26 12:23:06	Yasunori Mitsuda	1	1
2014-05-26 12:28:42	Yasunori Mitsuda	1	2
2014-05-26 12:35:01	Yuki Kajiura	1	2
2014-05-27 11:20:10	Yuki Kajiura	1	3
2014-05-27 11:30:44	Yuki Kajiura	1	4
2014-05-27 11:43:46	Yasunori Mitsuda	1	3
2014-05-27 11:55:56	Yuki Kajiura	1	5
2014-05-27 12:00:16	Yuki Kajiura	1	6
2014-05-27 12:04:23	Yuki Kajiura	1	7

ggplot(data = cumulative_artists, aes(x = listened_at, y = listens, group = artist_name, colour = artist_name)) +
  ggtitle("Cumulative Listens of Top 10 Artists Over Time") +
  xlab("# of listens") +
  ylab("Artist") +
  labs(colour = "Artist") +
  geom_line(size = 1) +
  scale_color_brewer(palette = "Paired") +
  theme(legend.position = "bottom", legend.box = "vertical") +
  guides(colour = guide_legend(nrow = 2, byrow = TRUE))

Cumulative listens of songs by the top 10 artists in my library over time, since 2014. — This one's less dramatic, though Yuki Kajiura has been a consistent winner from the start. It's interesting watching artists like Yorushika, who started off flat and then suddenly rocketed up.

ListenBrainz Analysis

Added 2021-12-16.

It turns out ListenBrainz does its own Year in Music, some of which is attached below. They don’t do data cleaning, presumably because what you should do is go update the ListenBrainz database and retag your library (which I should be better about doing, for sure).

ListenBrainz's top artists of the year for me.

ListenBrainz's top releases (albums) of the year for me.

ListenBrainz's top recordings (songs) of the year for me.