There was a time I listened to the Top 30 charts every week . Not so much anymore. But apparently the top hits are still interesting for scientists. A Belgian team for instance, from the University of Antwerp, has analysed the last 20 years of dance hit songs through machine learning. The goal of Drs. Dorien Herremans, David Martens, and Kenneth Sörensen was to identify which, if any, musical features guarantees a song will be a hit, and if there was an evolution over time.
Worked out as a textbook example of data science, the team proceeded first to capture two chart listings, and then crossed this with data on each song from The Echo Nest, a music streaming service with data on over 36 million songs. This database was also used to construct the Million Song dataset, a gigantic dataset that comes in really really handy when you need to test performance.
Eight characteristics were selected for further study, including “duration,” “tempo,” etc. First, top songs were placed on a time scale, to see whether there was any significant evolution. You can see this in the charts below.
It seems clear that the trend is for shorter songs that are a bit more uptempo. Mostly, songs are getting louder (that ugly phenomenon is called the Loudness War), and less “danceable.” It’s not clear to me what that the latter attribute is based on, as The Echo Nest has not released its formula for it. You can play around with more parameters on the authors’ website widget.
However, that is just putting data in fancy graphs, so-called “descriptive” data science. The real data science work by the team was then to split the data into a training set and a test set, and apply models to the training set and verify, with the test set, which one of the five tested models gave the best predictive results for a dance song becoming a hit (top 10) or not (top 40 but not in the top 10), and this (given the evolution over time) for the last five years in the dataset.
Multiple models gave good results, with Logistic Regression being the best. It gave 83% true positives (a predicted hit being indeed a hit) and 32% true negatives (a non-hit being indeed a non-hit). That is quite good in the data science world, where everything above 70% is currently considered “interesting.” In this case, an interesting possibility to identify hits, based on musical characteristics only.
The authors suggest possibilities to enlarge the model, like lyrical analysis or enlarging the set to other musical styles. Best of all, they offer the possibility for anyone to upload audio and test the “hit potential.” The website then gives back a probability for it being (or becoming) a hit. Sooo… I just had to try it with one of the Skeptoid episodes. I uploaded episode 400, “It’s just Science,” where Brian Dunning raps (or at least tries to) about the virtues of science. The result?
Episode 400 has a 82% chance of becoming a dance hit! So dance away, skeptics!
Thanks to the Skeptoid Research mailinglist for digging up the original source.
Do us a favor: if you like what we do here, take a moment to support Skeptoid. Thank you!