Warmest 100
Justin Warren gave a talk at Pycon on the Warmest 100, which is an attempt at predicting the Triple J Hottest 100 by analysing social media posts.
I’ve often wondered if the listeners only vote for what’s played on the radio the most. So I set out to see if that was true. (It’ll give me a chance to try Python’s pandas library.)
Gathering the data
I needed two lists:
- The Hottest 100 for 2015.
- The Triple J playlist for 2015.
Getting the Hottest 100 results
The list was scraped from the Hottest 100 2015 web page, using BeautifulSoup.
curl 'http://www.abc.net.au/triplej/hottest100/15/countdown/' > results.html
from bs4 import BeautifulSoup
import re
import pandas
def get():
results = []
with open('results.html') as f:
for t in BeautifulSoup(f, 'html.parser').find_all('div', class_='hottest-countdown-track'):
# Extract position, artist, title
pos = t.find(class_='hottest-countdown-position').string
title = t.find(class_='title').string
artists = [t.find(class_='artist').string]
# Sanitize (pull featured artists from title)
matches = re.search(r' {Ft. ([^}]+)}', title)
if (matches):
artists += matches.group(1).split('/')
title = re.sub(r' {Ft. ([^}]+)}', '', title)
# Append to list of dicts
results.append(
{'rank': pos, 'name': title, 'artists': tuple(sorted(artists))})
# Return as a DataFrame
return pandas.DataFrame(results)
I cleaned it up a bit, such as removing {Radio edit} from one song title.
Getting the playlist for all of 2015
Justin pointed me to @triplejplays which is a feed of songs being played, but it turns out Triple J has an api which lets me enter a date range, and gives json back with plenty more data if needed.
Because the api is used for lazy-loading, it only returns 10 songs at a time. So to get the full year’s playlist took 9833 requests.
start='2015-01-22T01:00:00'
end='2016-01-22T01:00:00'
# Loop from start to end date
while [[ $(date -d $start +"%s") -lt $(date -d $end +"%s") ]]; do
echo $start >&2
# The URL and output filename
url="http://music.abcradio.net.au/api/v1/plays/search.json?station=triplej&from=$start&order=asc"
output_fn="raw/$start"
# Download the list
curl $url > $output_fn
# Get the time of the last song played
start=$(jq -r '.items[].played_time' $output_fn |tail -n1)
# Add one second so it doesn't include the last song next time
start=$(date -u -d "$start + 1 second" +"%Y-%m-%dT%H:%M:%S")
done
It pulled in 1.3G of json data. I used jq to extract just the parts we needed, (title and artist):
for i in raw/*; do echo $i; jq -r '[.items[].recording | {name: .title, artists: [.artists[].name]}]' $i > clean/$(basename $i); done
Did a bit of manual data sanitizing at this point, like replacing “DOWNTOWN” with “Downtown”. The data wasn’t great, and I didn’t spend much time cleaning it up.
Eventually we have the playlist - 16261 songs played 98330 times.
Analysis
I’ll load the hottest and warmest lists and merge them.
>>> import hottest, warmest
>>> h = hottest.get()
>>> w = warmest.get()
>>> m = h.merge(w, on='name', how='left', suffixes=('', '_plays'))[['rank_votes', 'name', 'plays', 'rank_plays']]
Now we can look at the top 10 voted songs:
>>> m.head(10)
rank_votes name plays rank_plays
0 1 Hoops 145 50.5
1 2 King Kunta 138 76.0
2 3 Lean On 171 13.5
3 4 The Less I Know The Better 153 32.0
4 5 Let It Happen 109 227.5
5 6 The Trouble With Us 115 190.5
6 7 Do You Remember 181 6.0
7 8 The Buzz 122 148.0
8 9 Can't Feel My Face 119 166.5
9 10 Magnets 139 70.0
From this it looks like there is not a strong relationship between plays and votes. A song like Let It Happen was voted 5th even though it was the 227th most played song. Maybe it was played during peak times?
On the other hand, Lean On and Do You Remember are near the top in both plays and votes.
Let’s look at the bottom 10:
>>> m.tail(10)
rank_votes name plays rank_plays
90 91 Be Your Shadow 120 159.0
91 92 No One 232 3.0
92 93 Indian Summer 110 220.0
93 94 Ghost 236 2.0
94 95 Nobody Really Cares... 124 137.5
95 96 Rumour Mill 124 137.5
96 97 Twilight Driving 121 153.0
97 98 Be Together 120 159.0
98 99 True Friends 90 335.5
99 100 Hell Boy 115 190.5
Here we see Golden Features No One and Halsey’s Ghost were second and third most played songs, but rated lowly in the vote.
“People only vote for what gets played on the radio” - this is somewhat true. There are no songs in the Hottest 100 that were not played on Triple J. But Courtney Barnett’s Depreston made the list after being played only 23 times. Big Jet Plane made it after being played only 12 times. So it’s not fair to say that people only vote for songs that were on high rotation throughout the year.