How many mistletoes does a sucecssful christmas song need?

Hearing "mistletoe" in almost every other christmas song made me feel a bit bizzare, while I was exposed to some christmas culture growing up back home in Saudi Arabia, I didn't understand the relevance of mistletoes to Christmas. I eventually learned that and, like any sane person, my first instinct was to quantify that... okay fine; like any sane data scientist.

What this is:

I scraped the lyrics of 69 Popular Christmas Songs and compared christmas word occurance to "average" word occurance (Scraped from the playlist 500 greatest songs) to find the most unique christmas lyrics and how often they occur.

The results:

Turns out, 'mistletoe' was only the 14th most christimas-unique word according to this analysis; you can hover just over the 70 mark on the x axis and find it.

The Code:

Scraping the Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
PATH = "D:\Program Files\Selenium"
browser = webdriver.Chrome()

Fetching the Christmas Playlist

In [2]:
browser.get("https://www.countryliving.com/life/entertainment/g29326536/best-christmas-songs/")
songs = browser.find_elements_by_class_name("listicle-slide-hed-text")
In [3]:
len(songs)
Out[3]:
69
In [4]:
songLyrics = pd.DataFrame(columns=['title', 'artist', 'lyrics'])
for song in songs:
    s = song.text.split(" by ")
    if(len(s) == 1):
        s = song.text.split(" from ")
    songLyrics = songLyrics.append({'title': s[0].replace("\"", ""), 'artist': s[1]}, ignore_index=True)
songLyrics.head(5)
Out[4]:
title artist lyrics
0 All I Want for Christmas Is You Mariah Carey NaN
1 Last Christmas Wham! NaN
2 I Saw Mommy Kissing Santa Claus Jackson 5 NaN
3 Rudolph the Red-Nosed Reindeer Harry Connick Jr. NaN
4 It's Beginning to Look a Lot Like Christmas Michael Bublé NaN

Fetching the Lyrics

In [5]:
browser.get("https://genius.com/")
In [7]:
def fillLyrics(song):
    try:
        searchBar = browser.find_element_by_name("q")
        if(song.title[0] == "("):
            songTitle = song.title.replace("(", "")
        else:
            songTitle = song.title
        searchBar.send_keys(songTitle, " ", song.artist)
        searchBar.send_keys(Keys.RETURN)
        time.sleep(3)
        browser.find_element_by_class_name("mini_card").click()
        text = browser.find_element_by_tag_name("p").text
        #Some regex to clean up the lyrics
        text = re.sub("(\[.+])", "", text) #ignoring [Chorus], [Verse 1], ..etc
        text = re.sub("(,|\.|!|)", "", text)
        text = text.replace("\n", " ")
        songLyrics.loc[songLyrics.title == song.title, "lyrics"] = text
    except:
      print(song.title, " by ", song.artist, " failed")
In [18]:
songLyrics.apply(lambda x: fillLyrics(x), axis=1)
The Most Wonderful Day of the Year  by  Rudolph the Red-Nosed Reindeer  failed
Out[18]:
0     None
1     None
2     None
3     None
4     None
      ...
64    None
65    None
66    None
67    None
68    None
Length: 69, dtype: object
In [116]:
songLyrics.sample(4)
Out[116]:
title artist lyrics
54 Blue Christmas Elvis Presley I'll have a blue Christmas without you (Ooh o...
68 Merry Christmas, Baby Otis Redding Merry Christmas baby Sure do treat me nice Me...
41 Merry Christmas, Happy Holidays *NSYNC Oooh-ooh Merry Christmas Happy holidays Merry...
10 Santa Claus Is Comin' to Town Bruce Springsteen It's all cold down on the beach The wind's wh...

Fetching a 500 song playlist for "average" word usage

Link mentions Rolling Stones, but playlist is by various artists and the most comprehensive playlist I could find.

In [33]:
browser.get("https://genius.com/Rolling-stone-the-500-greatest-songs-of-all-time-annotated")
In [43]:
links = browser.find_element_by_tag_name("p").find_elements_by_tag_name("a")
songLinks = []
for i in range(len(links)):
    if not("artists" in links[i].get_attribute("href")):
        songLinks.append(i)

curSong = 0
In [96]:
top500 = pd.DataFrame(columns = ["songName", "lyrics"])
In [98]:
for i in songLinks:
    try:
        links = browser.find_element_by_tag_name("p").find_elements_by_tag_name("a")
        songName = links[i].text
        links[i].click()
        text = browser.find_element_by_tag_name("p").text
        #Some regex to clean up the lyrics
        text = re.sub("(\[.+])", "", text) #ignoring [Chorus], [Verse 1], ..etc
        text = re.sub("(,|\.|!|)", "", text)
        text = text.replace("\n", " ")
        top500 = top500.append({'songName': songName, 'lyrics': text}, ignore_index=True)
        browser.back()
    except:
        print(i, " failed")
18  failed
180  failed
241  failed
407  failed
423  failed
481  failed
505  failed
543  failed
575  failed
623  failed
643  failed
669  failed
776  failed
794  failed
843  failed
845  failed
847  failed
849  failed
851  failed
853  failed
855  failed
857  failed
859  failed
861  failed
863  failed
865  failed
867  failed
869  failed
871  failed
873  failed
875  failed
877  failed
879  failed
881  failed
883  failed
885  failed
887  failed
889  failed
891  failed
In [117]:
top500.sample(5)
Out[117]:
songName lyrics
416 Blue Suede Shoes Well it's one for the money two for the show ...
282 Help Me Help me I think I'm falling in love again Whe...
415 Piano Man It's nine o'clock on a Saturday The regular...
255 Wild Thing Wild thing you make my heart sing You make ...
482 Miss You I've been holding out so long I've been sle...

Breaking down lyrics to build the dictionaries

In [163]:
wordCount = pd.DataFrame(columns = ["word", "christmasCt", "regCt", "ratio"])
In [164]:
def countWords(lyrics, dictionary):
    try:
        words = lyrics.split(" ")
        global wordCount
        for curWord in words:
            curWord = curWord.lower()

            if (wordCount.loc[wordCount.word == curWord].shape[0] == 0):
                wordCount = wordCount.append({'word': curWord, 'christmasCt': 0, 'regCt': 0}, ignore_index=True)
            wordCount.loc[wordCount.word == curWord, dictionary] += 1
    except:
        print("Failure")
In [353]:
songLyrics.apply(lambda x: countWords(x.lyrics, "christmasCt"), axis=1) # Counting words in the christmas Playlist
Failure
Failure
Failure
Failure
Failure
Failure
Out[353]:
0     None
1     None
2     None
3     None
4     None
      ...
64    None
65    None
66    None
67    None
68    None
Length: 69, dtype: object
In [173]:
top500.apply(lambda x: countWords(x.lyrics, "regCt"), axis=1) # Counting words in top 500 playlist
Out[173]:
0      None
1      None
2      None
3      None
4      None
       ...
480    None
481    None
482    None
483    None
484    None
Length: 485, dtype: object
In [346]:
wordCount["christmasPerSong"] = wordCount.christmasCt / 69
In [347]:
wordCount["regPerSong"] = wordCount.regCt / top500.shape[0]
In [338]:
wordCount.loc[wordCount.word == "christmas"]
Out[338]:
word christmasCt regCt ratio christmasPerSong regPerSong christmasPerReg christmasRtg
7 christmas 356 1 NaN 3.59596 0.00206186 1744.040404 620878
In [355]:
def div(a, b):
    if(b == 0): # Preventing division by zero
        b = 1/485
    return a / b
In [349]:
wordCount["christmasPerReg"] = wordCount.apply(lambda x: div(x.christmasPerSong, x.regPerSong), axis=1)
In [350]:
wordCount["christmasRtg"] = wordCount.christmasPerReg * wordCount.christmasCt
In [351]:
wordCount.sort_values(by="christmasRtg", ascending=False).head(15)
Out[351]:
word christmasCt regCt ratio christmasPerSong regPerSong christmasPerReg christmasRtg
7 christmas 356 1 NaN 5.15942 0.00206186 2502.318841 890826
42 santa 115 0 NaN 1.66667 0 808.333333 92958.3
222 merry 96 0 NaN 1.3913 0 674.782609 64779.1
43 claus 72 0 NaN 1.04348 0 506.086957 36438.3
411 pum 45 0 NaN 0.652174 0 316.304348 14233.7
78 click 28 1 NaN 0.405797 0.00206186 196.811594 5510.72
58 snow 55 4 NaN 0.797101 0.00824742 96.648551 5315.67
410 ra 24 0 NaN 0.347826 0 168.695652 4048.7
2416 pear 24 0 NaN 0.347826 0 168.695652 4048.7
2415 partridge 24 0 NaN 0.347826 0 168.695652 4048.7
2418 turtle 22 1 NaN 0.318841 0.00206186 154.637681 3402.03
1023 laohu 22 0 NaN 0.318841 0 154.637681 3402.03
2805 navidad 21 0 NaN 0.304348 0 147.608696 3099.78
2804 feliz 21 0 NaN 0.304348 0 147.608696 3099.78
711 wonderland 20 0 NaN 0.289855 0 140.579710 2811.59
In [352]:
wordCount.loc[wordCount.regCt == 0]
Out[352]:
word christmasCt regCt ratio christmasPerSong regPerSong christmasPerReg christmasRtg
17 presents 14 0 NaN 0.202899 0 98.405797 1377.68
40 fireplace 2 0 NaN 0.0289855 0 14.057971 28.1159
42 santa 115 0 NaN 1.66667 0 808.333333 92958.3
43 claus 72 0 NaN 1.04348 0 506.086957 36438.3
63 mistletoe 15 0 NaN 0.217391 0 105.434783 1581.52
... ... ... ... ... ... ... ... ...
2988 mistletoeing 2 0 NaN 0.0289855 0 14.057971 28.1159
2989 glowing 2 0 NaN 0.0289855 0 14.057971 28.1159
2990 (ding-dong-ding) 11 0 NaN 0.15942 0 77.318841 850.507
2995 (ding-dong-ding-dong) 2 0 NaN 0.0289855 0 14.057971 28.1159
3004 christams 1 0 NaN 0.0144928 0 7.028986 7.02899

1199 rows × 8 columns

In [337]:
wordCount.to_csv("christmasVReg.csv")
In [18]:
browser.close()