Analyzing data from POD's first survey to the iSchool

The polling and open data initiative at UW (POD) sent out its first poll to students pursuing a degree in informatics, receiving 42 replies. I analyzed some of the data regarding the different tracks under the iSchool, plans after college, and grad school interest.

Some of the data were visualized using matplotlib to find trends, while visuals to be published at POD used Flourish.

Cleaning the data

In [398]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
In [399]:
poll = pd.read_excel("poll1.xlsx")
poll.head(8)
Out[399]:
Unnamed: 0 Unnamed: 1 Unnamed: 2 Instruction Text Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 ... Unnamed: 18 Unnamed: 19 Unnamed: 20 On a scale from 1 to 5, 1 being not important and 5 Unnamed: 22 Unnamed: 23 Unnamed: 24 Unnamed: 25 Unnamed: 26 Unnamed: 27
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN being most important, rate the following facto... NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN on their impact on your answer to the previous... NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN Question What is your graduation year? What gender do you identify with? Are you a transfer student? Which Informatics degree option are you pursuing? Are you double majoring? (If yes, please specify) Are you minoring? (If yes, please specify) ... If you answered “work in information technol... If you answered “pursue a Master's degree”... Favorite Informatics class? (format: INFO XXX) Professor/quality of teaching Amount of work Interesting peers Relevance to job and career opportunities Interest in course content Class length and schedule Check this box if you'd like to receive a one-...
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... question #13, which would you rather work at? "pursue a PhD" to question #13, what is your t... NaN NaN NaN NaN NaN NaN NaN with pictures of Rachel Kinkley's dog as our t...
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN graduate degree? NaN NaN NaN NaN NaN NaN NaN for completing this poll!
6 Participant ID Start Date Finish Date NaN GradYear Gender transfer track doubleMajor minor ... privacy gradSchool favClass Professor/quality of teaching Amount of work Interesting peers Relevance to job and career opportunities Interest in course content Class length and schedule NaN
7 20219549 2020-11-21 21:07:00 2020-11-21 21:10:00 NaN 2022 Female No, I started at UW Data Science No No ... No preference Not applicable INFO 340 3 3 2 4 5 3 NaN

8 rows × 28 columns

In [437]:
table = poll[7:]
table.columns = poll.iloc[6]
table.loc[:, "total"] = np.ones(table.shape[0])
table.loc[:, "researchInt"] = 0
table.loc[table.research == "Yes", "researchInt"] = 1
table.head()
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py:844: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py:965: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
Out[437]:
6 Participant ID Start Date Finish Date NaN GradYear Gender transfer track doubleMajor minor ... favClass Professor/quality of teaching Amount of work Interesting peers Relevance to job and career opportunities Interest in course content Class length and schedule NaN total researchInt
7 20219549 2020-11-21 21:07:00 2020-11-21 21:10:00 NaN 2022 Female No, I started at UW Data Science No No ... INFO 340 3 3 2 4 5 3 NaN 1.0 0
8 20218830 2020-11-21 01:12:00 2020-11-21 01:15:00 NaN 2023 Male No, I started at UW Data Science No Undecided ... INFO 201 3 1 2 4 5 1 NaN 1.0 0
9 20218718 2020-11-20 22:14:00 2020-11-20 22:20:00 NaN 2021 Male No, I started at UW Custom No No ... INFO 441 3 3 4 5 4 3 999 1.0 0
10 20218435 2020-11-20 18:12:00 2020-11-20 18:14:00 NaN 2021 Male No, I started at UW Custom No No ... INFO 441 3 2 4 5 5 2 NaN 1.0 0
11 20217086 2020-11-20 10:44:00 2020-11-20 11:21:00 NaN 2021 Male Yes, I am a transfer student (transferred from... Information Assurance and Cybersecurity Yes: International Studies No ... INFO450 5 4 2 4 4 4 NaN 1.0 1

5 rows × 30 columns

In [424]:
# Adding seperate categorical variables for each option under "plans"
table.loc[:, "workInTech"] = table.plans.str.contains("Work in information technology").astype(int)
table.loc[:, "nonprofit"] = table.plans.str.contains("Work in a nonprofit organization").astype(int)
table.loc[:, "travel"] = table.plans.str.contains("Travel abroad").astype(int)
table.loc[:, "educate"] = table.plans.str.contains("Work in education").astype(int)
table.loc[:, "unrelated"] = table.plans.str.contains("Work in a field unrelated to informatics").astype(int)
table.loc[:, "masters"] = table.plans.str.contains("Pursue a Master").astype(int)
table.loc[:, "phd"] = table.plans.str.contains("Pursue a PhD").astype(int)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py:965: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
In [ ]:

Graduation Year vs GradSchool Interest

In [402]:
table.masters.sum()
Out[402]:
19
In [403]:
table.phd.sum()
Out[403]:
3
In [404]:
# Nobody signified phd but not masters; so we can use "msaters" as interest in Grad School
table.loc[(table.phd == 1) & (table.masters == 0)]
Out[404]:
6 Participant ID Start Date Finish Date NaN GradYear Gender transfer track doubleMajor minor ... NaN total researchInt workInTech nonprofit travel educate unrelated masters phd

0 rows × 37 columns

In [405]:
mastersInterest = table.groupby("GradYear").masters.sum()
_ = plt.bar(mastersInterest.index.astype(str), mastersInterest)
In [406]:
table.groupby("GradYear").researchInt.sum()
Out[406]:
GradYear
2021    13
2022     2
2023     0
2024     0
Name: researchInt, dtype: int64
In [407]:
table.loc[table.track == "Custom"].groupby("GradYear").total.sum()
Out[407]:
GradYear
2021    11.0
2022     1.0
2023     3.0
2024     2.0
Name: total, dtype: float64

As expected, upperclassmen have more research experience, and are more interested in grad school than underclassmen.

Research Areas

In [408]:
researchers = table.loc[table.researchInt == 1]
In [409]:
researchAreas = researchers.groupby(researchers.researchField).total.sum()
pieLabels = researchAreas.index
_ = plt.pie(researchAreas, labels=pieLabels)
In [410]:
print(researchers.shape[0], "total students with research experience")
15 total students with research experience

Custom tracked students

In [411]:
customTracked = researchers.loc[researchers.track == "Custom"]
researchAreasC = customTracked.groupby(customTracked.researchField).total.sum()
pieLabelsC = researchAreasC.index
_ = plt.pie(researchAreasC, labels=pieLabelsC)
In [412]:
researchAreasC
Out[412]:
researchField
Human-Computer Interaction    5.0
Information Architecture      1.0
Name: total, dtype: float64

research vs grad school plans

In [422]:
researchVGrad = table.groupby("researchInt").sum()[["total", "masters"]]
researchVGrad
Out[422]:
6 total masters
researchInt
0 27.0 8
1 15.0 11

Correlation between interset in pursuing grad school and research experience seem to align, without neccessarily implying causation one way or the other.

Plans after the iSchool

In [426]:
planSums = table[["workInTech", "nonprofit", "travel", "educate", "unrelated", "masters", "phd", "researchInt"]].sum()
planSums
Out[426]:
6
workInTech     39
nonprofit       8
travel         16
educate         8
unrelated       7
masters        19
phd             3
researchInt    15
dtype: int64
In [427]:
trackSums = table.groupby("track").total.sum()
trackSums
Out[427]:
track
Biomedical & Health Informatics             1.0
Custom                                     17.0
Data Science                                8.0
Human-Computer Interaction                  7.0
Information Assurance and Cybersecurity     6.0
Undecided                                   3.0
Name: total, dtype: float64
In [433]:
summaryTable = table.groupby("track").sum().T
summaryTable["sums"] = planSums
summaryTable
Out[433]:
track Biomedical & Health Informatics Custom Data Science Human-Computer Interaction Information Assurance and Cybersecurity Undecided sums
6
total 1.0 17.0 8.0 7.0 6.0 3.0 NaN
researchInt 0.0 6.0 2.0 3.0 3.0 1.0 15.0
workInTech 1.0 17.0 7.0 6.0 6.0 2.0 39.0
nonprofit 1.0 4.0 2.0 0.0 1.0 0.0 8.0
travel 1.0 7.0 3.0 1.0 2.0 2.0 16.0
educate 0.0 4.0 1.0 2.0 0.0 1.0 8.0
unrelated 1.0 2.0 1.0 2.0 1.0 0.0 7.0
masters 1.0 6.0 4.0 2.0 4.0 2.0 19.0
phd 0.0 1.0 1.0 0.0 1.0 0.0 3.0
In [434]:
tracks = summaryTable.T
tracks["sums"] = trackSums
tracks
Out[434]:
6 total researchInt workInTech nonprofit travel educate unrelated masters phd sums
track
Biomedical & Health Informatics 1.0 0.0 1.0 1.0 1.0 0.0 1.0 1.0 0.0 1.0
Custom 17.0 6.0 17.0 4.0 7.0 4.0 2.0 6.0 1.0 17.0
Data Science 8.0 2.0 7.0 2.0 3.0 1.0 1.0 4.0 1.0 8.0
Human-Computer Interaction 7.0 3.0 6.0 0.0 1.0 2.0 2.0 2.0 0.0 7.0
Information Assurance and Cybersecurity 6.0 3.0 6.0 1.0 2.0 0.0 1.0 4.0 1.0 6.0
Undecided 3.0 1.0 2.0 0.0 2.0 1.0 0.0 2.0 0.0 3.0
sums NaN 15.0 39.0 8.0 16.0 8.0 7.0 19.0 3.0 NaN
In [436]:
table.to_csv("table.csv")
summaryTable.to_csv("trackPlansSummaryTable.csv")
tracks.to_csv("tracksSummary.csv")