How many mistletoes does a sucecssful christmas song need?¶

Hearing "mistletoe" in almost every other christmas song made me feel a bit bizzare, while I was exposed to some christmas culture growing up back home in Saudi Arabia, I didn't understand the relevance of mistletoes to Christmas. I eventually learned that and, like any sane person, my first instinct was to quantify that... okay fine; like any sane data scientist.

What this is:¶

I scraped the lyrics of 69 Popular Christmas Songs and compared christmas word occurance to "average" word occurance (Scraped from the playlist 500 greatest songs) to find the most unique christmas lyrics and how often they occur.

The results:¶

Turns out, 'mistletoe' was only the 14th most christimas-unique word according to this analysis; you can hover just over the 70 mark on the x axis and find it.

The Code:¶

Scraping the Data¶

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
PATH = "D:\Program Files\Selenium"
browser = webdriver.Chrome()

Fetching the Christmas Playlist¶

browser.get("https://www.countryliving.com/life/entertainment/g29326536/best-christmas-songs/")
songs = browser.find_elements_by_class_name("listicle-slide-hed-text")

len(songs)

69

songLyrics = pd.DataFrame(columns=['title', 'artist', 'lyrics'])
for song in songs:
    s = song.text.split(" by ")
    if(len(s) == 1):
        s = song.text.split(" from ")
    songLyrics = songLyrics.append({'title': s[0].replace("\"", ""), 'artist': s[1]}, ignore_index=True)
songLyrics.head(5)

Fetching the Lyrics¶

browser.get("https://genius.com/")

def fillLyrics(song):
    try:
        searchBar = browser.find_element_by_name("q")
        if(song.title[0] == "("):
            songTitle = song.title.replace("(", "")
        else:
            songTitle = song.title
        searchBar.send_keys(songTitle, " ", song.artist)
        searchBar.send_keys(Keys.RETURN)
        time.sleep(3)
        browser.find_element_by_class_name("mini_card").click()
        text = browser.find_element_by_tag_name("p").text
        #Some regex to clean up the lyrics
        text = re.sub("(\[.+])", "", text) #ignoring [Chorus], [Verse 1], ..etc
        text = re.sub("(,|\.|!|)", "", text)
        text = text.replace("\n", " ")
        songLyrics.loc[songLyrics.title == song.title, "lyrics"] = text
    except:
      print(song.title, " by ", song.artist, " failed")

songLyrics.apply(lambda x: fillLyrics(x), axis=1)

The Most Wonderful Day of the Year  by  Rudolph the Red-Nosed Reindeer  failed

0     None
1     None
2     None
3     None
4     None
      ...
64    None
65    None
66    None
67    None
68    None
Length: 69, dtype: object

songLyrics.sample(4)

Fetching a 500 song playlist for "average" word usage¶

Link mentions Rolling Stones, but playlist is by various artists and the most comprehensive playlist I could find.

browser.get("https://genius.com/Rolling-stone-the-500-greatest-songs-of-all-time-annotated")

links = browser.find_element_by_tag_name("p").find_elements_by_tag_name("a")
songLinks = []
for i in range(len(links)):
    if not("artists" in links[i].get_attribute("href")):
        songLinks.append(i)

curSong = 0

top500 = pd.DataFrame(columns = ["songName", "lyrics"])

for i in songLinks:
    try:
        links = browser.find_element_by_tag_name("p").find_elements_by_tag_name("a")
        songName = links[i].text
        links[i].click()
        text = browser.find_element_by_tag_name("p").text
        #Some regex to clean up the lyrics
        text = re.sub("(\[.+])", "", text) #ignoring [Chorus], [Verse 1], ..etc
        text = re.sub("(,|\.|!|)", "", text)
        text = text.replace("\n", " ")
        top500 = top500.append({'songName': songName, 'lyrics': text}, ignore_index=True)
        browser.back()
    except:
        print(i, " failed")

18  failed
180  failed
241  failed
407  failed
423  failed
481  failed
505  failed
543  failed
575  failed
623  failed
643  failed
669  failed
776  failed
794  failed
843  failed
845  failed
847  failed
849  failed
851  failed
853  failed
855  failed
857  failed
859  failed
861  failed
863  failed
865  failed
867  failed
869  failed
871  failed
873  failed
875  failed
877  failed
879  failed
881  failed
883  failed
885  failed
887  failed
889  failed
891  failed

top500.sample(5)

Breaking down lyrics to build the dictionaries¶

wordCount = pd.DataFrame(columns = ["word", "christmasCt", "regCt", "ratio"])

def countWords(lyrics, dictionary):
    try:
        words = lyrics.split(" ")
        global wordCount
        for curWord in words:
            curWord = curWord.lower()

            if (wordCount.loc[wordCount.word == curWord].shape[0] == 0):
                wordCount = wordCount.append({'word': curWord, 'christmasCt': 0, 'regCt': 0}, ignore_index=True)
            wordCount.loc[wordCount.word == curWord, dictionary] += 1
    except:
        print("Failure")

songLyrics.apply(lambda x: countWords(x.lyrics, "christmasCt"), axis=1) # Counting words in the christmas Playlist

Failure
Failure
Failure
Failure
Failure
Failure

0     None
1     None
2     None
3     None
4     None
      ...
64    None
65    None
66    None
67    None
68    None
Length: 69, dtype: object

top500.apply(lambda x: countWords(x.lyrics, "regCt"), axis=1) # Counting words in top 500 playlist

0      None
1      None
2      None
3      None
4      None
       ...
480    None
481    None
482    None
483    None
484    None
Length: 485, dtype: object

wordCount["christmasPerSong"] = wordCount.christmasCt / 69

wordCount["regPerSong"] = wordCount.regCt / top500.shape[0]

wordCount.loc[wordCount.word == "christmas"]

def div(a, b):
    if(b == 0): # Preventing division by zero
        b = 1/485
    return a / b

wordCount["christmasPerReg"] = wordCount.apply(lambda x: div(x.christmasPerSong, x.regPerSong), axis=1)

wordCount["christmasRtg"] = wordCount.christmasPerReg * wordCount.christmasCt

wordCount.sort_values(by="christmasRtg", ascending=False).head(15)

wordCount.loc[wordCount.regCt == 0]

wordCount.to_csv("christmasVReg.csv")

browser.close()

	title	artist	lyrics
0	All I Want for Christmas Is You	Mariah Carey	NaN
1	Last Christmas	Wham!	NaN
2	I Saw Mommy Kissing Santa Claus	Jackson 5	NaN
3	Rudolph the Red-Nosed Reindeer	Harry Connick Jr.	NaN
4	It's Beginning to Look a Lot Like Christmas	Michael Bublé	NaN

	title	artist	lyrics
54	Blue Christmas	Elvis Presley	I'll have a blue Christmas without you (Ooh o...
68	Merry Christmas, Baby	Otis Redding	Merry Christmas baby Sure do treat me nice Me...
41	Merry Christmas, Happy Holidays	*NSYNC	Oooh-ooh Merry Christmas Happy holidays Merry...
10	Santa Claus Is Comin' to Town	Bruce Springsteen	It's all cold down on the beach The wind's wh...

	songName	lyrics
416	Blue Suede Shoes	Well it's one for the money two for the show ...
282	Help Me	Help me I think I'm falling in love again Whe...
415	Piano Man	It's nine o'clock on a Saturday The regular...
255	Wild Thing	Wild thing you make my heart sing You make ...
482	Miss You	I've been holding out so long I've been sle...

	word	christmasCt	regCt	ratio	christmasPerSong	regPerSong	christmasPerReg	christmasRtg
7	christmas	356	1	NaN	5.15942	0.00206186	2502.318841	890826
42	santa	115	0	NaN	1.66667	0	808.333333	92958.3
222	merry	96	0	NaN	1.3913	0	674.782609	64779.1
43	claus	72	0	NaN	1.04348	0	506.086957	36438.3
411	pum	45	0	NaN	0.652174	0	316.304348	14233.7
78	click	28	1	NaN	0.405797	0.00206186	196.811594	5510.72
58	snow	55	4	NaN	0.797101	0.00824742	96.648551	5315.67
410	ra	24	0	NaN	0.347826	0	168.695652	4048.7
2416	pear	24	0	NaN	0.347826	0	168.695652	4048.7
2415	partridge	24	0	NaN	0.347826	0	168.695652	4048.7
2418	turtle	22	1	NaN	0.318841	0.00206186	154.637681	3402.03
1023	laohu	22	0	NaN	0.318841	0	154.637681	3402.03
2805	navidad	21	0	NaN	0.304348	0	147.608696	3099.78
2804	feliz	21	0	NaN	0.304348	0	147.608696	3099.78
711	wonderland	20	0	NaN	0.289855	0	140.579710	2811.59

	word	christmasCt	regCt	ratio	christmasPerSong	regPerSong	christmasPerReg	christmasRtg
17	presents	14	0	NaN	0.202899	0	98.405797	1377.68
40	fireplace	2	0	NaN	0.0289855	0	14.057971	28.1159
42	santa	115	0	NaN	1.66667	0	808.333333	92958.3
43	claus	72	0	NaN	1.04348	0	506.086957	36438.3
63	mistletoe	15	0	NaN	0.217391	0	105.434783	1581.52
...	...	...	...	...	...	...	...	...
2988	mistletoeing	2	0	NaN	0.0289855	0	14.057971	28.1159
2989	glowing	2	0	NaN	0.0289855	0	14.057971	28.1159
2990	(ding-dong-ding)	11	0	NaN	0.15942	0	77.318841	850.507
2995	(ding-dong-ding-dong)	2	0	NaN	0.0289855	0	14.057971	28.1159
3004	christams	1	0	NaN	0.0144928	0	7.028986	7.02899