Let's start by importing the libraries

In [1]:
#Importing NumPY and Pandas Library
import numpy as np
import pandas as pd
In [2]:
#Importing Data Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Read the dataset

In [3]:
result = pd.read_csv("StudentsPerformance.csv")

Let's start exploring the data

In [4]:
result.head()
Out[4]:
gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score
0 female group B bachelor's degree standard none 72 72 74
1 female group C some college standard completed 69 90 88
2 female group B master's degree standard none 90 95 93
3 male group A associate's degree free/reduced none 47 57 44
4 male group C some college standard none 76 78 75

Lets change the column name according to our convenience

In [5]:
result.columns = map(str.upper, result.columns)
result.head()
Out[5]:
GENDER RACE/ETHNICITY PARENTAL LEVEL OF EDUCATION LUNCH TEST PREPARATION COURSE MATH SCORE READING SCORE WRITING SCORE
0 female group B bachelor's degree standard none 72 72 74
1 female group C some college standard completed 69 90 88
2 female group B master's degree standard none 90 95 93
3 male group A associate's degree free/reduced none 47 57 44
4 male group C some college standard none 76 78 75

Checking the type of data in each column

In [6]:
result.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   GENDER                       1000 non-null   object
 1   RACE/ETHNICITY               1000 non-null   object
 2   PARENTAL LEVEL OF EDUCATION  1000 non-null   object
 3   LUNCH                        1000 non-null   object
 4   TEST PREPARATION COURSE      1000 non-null   object
 5   MATH SCORE                   1000 non-null   int64 
 6   READING SCORE                1000 non-null   int64 
 7   WRITING SCORE                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB

Let's check for null values in each column

In [7]:
result.isna().sum()
Out[7]:
GENDER                         0
RACE/ETHNICITY                 0
PARENTAL LEVEL OF EDUCATION    0
LUNCH                          0
TEST PREPARATION COURSE        0
MATH SCORE                     0
READING SCORE                  0
WRITING SCORE                  0
dtype: int64

Statistical Analysis on the data

In [8]:
result.describe()
Out[8]:
MATH SCORE READING SCORE WRITING SCORE
count 1000.00000 1000.000000 1000.000000
mean 66.08900 69.169000 68.054000
std 15.16308 14.600192 15.195657
min 0.00000 17.000000 10.000000
25% 57.00000 59.000000 57.750000
50% 66.00000 70.000000 69.000000
75% 77.00000 79.000000 79.000000
max 100.00000 100.000000 100.000000

The lowest marks in Math is 0, in Reading is 17, and writing is 10 !

The highest in all the 3 subjects is 100 !

Lets analyse the marks of the students in Math, Reading and Writing

In [9]:
sns.pairplot(result, hue = 'GENDER', palette = 'coolwarm')
Out[9]:
<seaborn.axisgrid.PairGrid at 0x20eaa10aac8>

Heatmap for our data

In [10]:
# Matrix form for correlation data
result.corr()
Out[10]:
MATH SCORE READING SCORE WRITING SCORE
MATH SCORE 1.000000 0.817580 0.802642
READING SCORE 0.817580 1.000000 0.954598
WRITING SCORE 0.802642 0.954598 1.000000
In [11]:
sns.heatmap(result.corr(), cmap = 'PuRd', annot = True)
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x20eaad2d288>

We can interpret that reading and writing score are highly correlated, while the writing score and math score is the least, compared to others!

Lets add an "AVERAGE" column to the dataset

In [12]:
result["AVERAGE"] = (result["MATH SCORE"] + result["READING SCORE"] + result["WRITING SCORE"])/3
In [13]:
result.head()
Out[13]:
GENDER RACE/ETHNICITY PARENTAL LEVEL OF EDUCATION LUNCH TEST PREPARATION COURSE MATH SCORE READING SCORE WRITING SCORE AVERAGE
0 female group B bachelor's degree standard none 72 72 74 72.666667
1 female group C some college standard completed 69 90 88 82.333333
2 female group B master's degree standard none 90 95 93 92.666667
3 male group A associate's degree free/reduced none 47 57 44 49.333333
4 male group C some college standard none 76 78 75 76.333333

Lets find out the relation between Reading and Writing Score

In [14]:
sns.lmplot(x='READING SCORE',y='WRITING SCORE',data=result,hue='GENDER')
Out[14]:
<seaborn.axisgrid.FacetGrid at 0x20eaa10ac88>

Let's find out how Gender affects Ethnicity and Math Scores

In [15]:
result.pivot_table(values='MATH SCORE',index='GENDER',columns='RACE/ETHNICITY')
Out[15]:
RACE/ETHNICITY group A group B group C group D group E
GENDER
female 58.527778 61.403846 62.033333 65.248062 70.811594
male 63.735849 65.930233 67.611511 69.413534 76.746479
In [16]:
pvresult = result.pivot_table(values='MATH SCORE',index='GENDER',columns='RACE/ETHNICITY')
sns.heatmap(pvresult, annot = True)
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x20eab1ec308>

We can see that Group E Males have the highest average Math Scores, while Group A Females have the least!

Similarly lets analyse for Reading Scores

In [17]:
result.pivot_table(values='READING SCORE',index='GENDER',columns='RACE/ETHNICITY')
Out[17]:
RACE/ETHNICITY group A group B group C group D group E
GENDER
female 69.000000 71.076923 71.944444 74.046512 75.840580
male 61.735849 62.848837 65.424460 66.135338 70.295775
In [18]:
pvresult = result.pivot_table(values='READING SCORE',index='GENDER',columns='RACE/ETHNICITY')
sns.heatmap(pvresult,cmap='YlOrRd',linecolor='white',linewidths=1, annot = True)
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x20eab2e64c8>

We can see that Group E females on an average, fair better than others in Reading!

Now, for Writing Scores

In [19]:
result.pivot_table(values='WRITING SCORE',index='GENDER',columns='RACE/ETHNICITY')
Out[19]:
RACE/ETHNICITY group A group B group C group D group E
GENDER
female 67.861111 70.048077 71.777778 75.023256 75.536232
male 59.150943 60.220930 62.712230 65.413534 67.394366
In [20]:
pvresult = result.pivot_table(values='WRITING SCORE',index='GENDER',columns='RACE/ETHNICITY')
sns.heatmap(pvresult, cmap = 'YlGnBu',linecolor='black',linewidths=1, annot = True)
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x20eab3a1b88>

We can interpret that Group A Males have the least writing skills while Group E females have the most!

Lets analyse the average scores now

In [21]:
result.pivot_table(values='AVERAGE',index='GENDER',columns='RACE/ETHNICITY')
Out[21]:
RACE/ETHNICITY group A group B group C group D group E
GENDER
female 65.129630 67.509615 68.585185 71.439276 74.062802
male 61.540881 63.000000 65.249400 66.987469 71.478873
In [22]:
pvresult = result.pivot_table(values='AVERAGE',index='GENDER',columns='RACE/ETHNICITY')
sns.heatmap(pvresult, cmap = 'Reds', annot = True)
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x20eab45b088>

On an average, we can say that Group E Females score the best, while Group A males, least!

Lets see the percentage distribution of Males and Females in the Dataset

In [23]:
(result.GENDER.value_counts()/len(result)) * 100
Out[23]:
female    51.8
male      48.2
Name: GENDER, dtype: float64
In [24]:
gender = result['GENDER'].value_counts()
labels = result.GENDER.unique()
plt.pie(gender,labels=labels,autopct="%1.1f%%",shadow=True,explode=(0.04,0.04),startangle=90)
plt.title('GENDER DISTRIBUTION',fontsize=15)
plt.show()
In [25]:
result.GENDER.value_counts()
Out[25]:
female    518
male      482
Name: GENDER, dtype: int64
In [26]:
sns.countplot(x='GENDER', data=result, palette = 'magma')
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x20eab5523c8>

We can see that the gender distribution is almost 50-50 !

In [27]:
gender = result.groupby("GENDER")
gender.mean()
Out[27]:
MATH SCORE READING SCORE WRITING SCORE AVERAGE
GENDER
female 63.633205 72.608108 72.467181 69.569498
male 68.728216 65.473029 63.311203 65.837483
In [28]:
gender.describe().transpose()
Out[28]:
GENDER female male
MATH SCORE count 518.000000 482.000000
mean 63.633205 68.728216
std 15.491453 14.356277
min 0.000000 27.000000
25% 54.000000 59.000000
50% 65.000000 69.000000
75% 74.000000 79.000000
max 100.000000 100.000000
READING SCORE count 518.000000 482.000000
mean 72.608108 65.473029
std 14.378245 13.931832
min 17.000000 23.000000
25% 63.250000 56.000000
50% 73.000000 66.000000
75% 83.000000 75.000000
max 100.000000 100.000000
WRITING SCORE count 518.000000 482.000000
mean 72.467181 63.311203
std 14.844842 14.113832
min 10.000000 15.000000
25% 64.000000 53.000000
50% 74.000000 64.000000
75% 82.000000 73.750000
max 100.000000 100.000000
AVERAGE count 518.000000 482.000000
mean 69.569498 65.837483
std 14.541809 13.698840
min 9.000000 23.000000
25% 60.666667 56.000000
50% 70.333333 66.333333
75% 78.666667 76.250000
max 100.000000 100.000000

Therefore, it is safe to assume that Males are slightly better than Females in Math, while Females outscore Males in Reading and Writing !

Finding out the percentage of students who have taken Test Preparation Course Prior to taking Tests

In [29]:
(result['TEST PREPARATION COURSE'].value_counts()/len(result)) * 100
Out[29]:
none         64.2
completed    35.8
Name: TEST PREPARATION COURSE, dtype: float64
In [30]:
test = result['TEST PREPARATION COURSE'].value_counts()
labels = result["TEST PREPARATION COURSE"].unique()
plt.pie(test,labels=labels,autopct="%1.1f%%",shadow=True,explode=(0.04,0.04),startangle=90)
plt.title('TEST PREPARATION COURSE',fontsize=15)
plt.show()
In [31]:
tpc = result.groupby("TEST PREPARATION COURSE")
tpc.mean()
Out[31]:
MATH SCORE READING SCORE WRITING SCORE AVERAGE
TEST PREPARATION COURSE
completed 69.695531 73.893855 74.418994 72.669460
none 64.077882 66.534268 64.504673 65.038941

We can say that Test Preparation Course has definitely improved the scores of students!

Now, lets see how Test Preparation Course has helped students in improving their Test Scores, Gender wise

In [32]:
fig, ax = plt.subplots(1, 3, figsize=(16,4))
sns.violinplot(x="TEST PREPARATION COURSE", y='MATH SCORE', data=result,hue='GENDER',split=True,palette='PuRd', ax = ax[0])
sns.violinplot(x="TEST PREPARATION COURSE", y='READING SCORE', data=result,hue='GENDER',split = True, 
               palette='Purples', ax = ax[1])
sns.violinplot(x="TEST PREPARATION COURSE", y='WRITING SCORE', data=result,hue='GENDER',split = True, 
               palette='RdPu', ax = ax[2])
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x20eab67bac8>

Lets see how Test Preparation Course has helped to improve the average marks of the students

In [33]:
sns.boxplot(x="TEST PREPARATION COURSE", y="AVERAGE", hue = "GENDER", data = result)
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x20eab8f09c8>

We can see that definitely, Test Preparation Course has helped improve their scores!

Does a Parent's Level of Education influence the student's performance? Lets find out!

In [34]:
p_edu = result.groupby("PARENTAL LEVEL OF EDUCATION")
p_edu.mean()
Out[34]:
MATH SCORE READING SCORE WRITING SCORE AVERAGE
PARENTAL LEVEL OF EDUCATION
associate's degree 67.882883 70.927928 69.896396 69.569069
bachelor's degree 69.389831 73.000000 73.381356 71.923729
high school 62.137755 64.704082 62.448980 63.096939
master's degree 69.745763 75.372881 75.677966 73.598870
some college 67.128319 69.460177 68.840708 68.476401
some high school 63.497207 66.938547 64.888268 65.108007
In [35]:
fig, ax = plt.subplots(3, 1, figsize=(16,16))

sns.boxplot(x = 'PARENTAL LEVEL OF EDUCATION', y = 'MATH SCORE', data = result, ax = ax[0], palette = "magma")

sns.boxplot(x = 'PARENTAL LEVEL OF EDUCATION', y = 'READING SCORE', data = result, ax = ax[1], palette = "plasma")

sns.boxplot(x = 'PARENTAL LEVEL OF EDUCATION', y = 'WRITING SCORE', data = result, ax = ax[2], palette = "inferno")
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x20eab77f388>

Now, lets see how Parental Level of Education has affected the average scores

In [36]:
sns.boxplot(x="TEST PREPARATION COURSE", y='AVERAGE', data=result,hue='GENDER', palette='inferno')
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x20eacf2f3c8>

Yeah! Parental Level of Education does improve the scores of students!

Lets find the count of students belonging to a particular Race/Ethnicity

In [37]:
# Lets find the percentage distribution
(result["RACE/ETHNICITY"].value_counts()/len(result)) * 100
Out[37]:
group C    31.9
group D    26.2
group B    19.0
group E    14.0
group A     8.9
Name: RACE/ETHNICITY, dtype: float64
In [38]:
sns.countplot(x='RACE/ETHNICITY', data=result, palette = 'Reds')
sns.despine()

A majority of the students belong to Group C, while Group A has the least number of students!

In [39]:
sns.boxplot(x = 'RACE/ETHNICITY', y = 'AVERAGE', data = result, palette = "magma")
Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x20eaca883c8>

Therefore, we can see that Group E students have a higher average than others!

Lets see how the distribution Parental Level Of Education varies with Race/Ethnicity

In [40]:
plt.figure(figsize = (16,5))
sns.countplot(x="PARENTAL LEVEL OF EDUCATION", hue="RACE/ETHNICITY", data=result, palette='viridis')
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x20eacb56248>

Lets find out the percentage of students who receive standard and reduced Lunch

In [41]:
(result["LUNCH"].value_counts()/len(result)) * 100
Out[41]:
standard        64.5
free/reduced    35.5
Name: LUNCH, dtype: float64
In [47]:
lunch = result['LUNCH'].value_counts()
labels = result["LUNCH"].unique()
plt.pie(test,labels=labels,autopct="%1.1f%%",shadow=True,explode=(0.04,0.04),startangle=90)
plt.title('LUNCH DISTRIBUTION',fontsize=15)
plt.show()
In [43]:
# Plotting the figures
fig, ax = plt.subplots(3, 1, figsize=(16,16))
sns.swarmplot(x="RACE/ETHNICITY", y='MATH SCORE', data=result,hue='LUNCH',palette='Purples', ax = ax[0])
sns.swarmplot(x="RACE/ETHNICITY", y='READING SCORE', data=result,hue='LUNCH', palette='Blues', ax = ax[1])
sns.swarmplot(x="RACE/ETHNICITY", y='WRITING SCORE', data=result,hue='LUNCH', palette='Greens', ax = ax[2])
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x20eacc9bfc8>

Lets see if Lunch affects the scores of students

In [44]:
p_edu = result.groupby("LUNCH")
p_edu.mean()
Out[44]:
MATH SCORE READING SCORE WRITING SCORE AVERAGE
LUNCH
free/reduced 58.921127 64.653521 63.022535 62.199061
standard 70.034109 71.654264 70.823256 70.837209

Students with Standard Lunch seem to score better than those with Free/Reduced Lunch !

Lets see how type of Lunch differs due to Race/Ethnicity

In [45]:
sns.countplot(x="RACE/ETHNICITY", hue="LUNCH", data=result, palette='Oranges')
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x20eacd7ef08>

Group C receives the majority of free/reduced Lunches while Group A receives the least

Is Free/Reduced Lunch Gender Biased? Lets find out!

In [46]:
sns.countplot(x="LUNCH", data=result,hue = 'GENDER', palette='YlGnBu')
Out[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x20eacd6f708>

The number of females receiving Standard or Free/Reduced Lunch is higher in both the cases!