1 Installing Packages

1.1 Install data.table

Before we start work with tables, let’s install the data.table package which we shall use heavily:

# Installs data.table
# install.packages("data.table")
# Loads the package
library(data.table)

1.2 Installing dplyr

Before we start manipulating data let’s install the dpylr package:

# Installs dplyr
# install.packages("dplyr")
# Loads the package
library(dplyr)

1.3 Installing ggplot2

We will use this package for data visualisation:

# Installs ggplot2
# install.packages("ggplot2")
# Loads the package
library(ggplot2)

2 Import/Export Data

2.1 Read Tables

Within RStudio in the top right corner one can import a dataset from the Enivronment Pane. Alternatively, one can navigate to File > Import Dataset. Aside from using RStudio’s native UI there are built-in functions and packages that enable you to import datasets into R. Let’s try importing the data present in a particular path and make use of the View(...) function of RStudio:

# Import CSV (relative path) into a dataframe
titanic_df <- read.csv("../Data/train.csv")
# Converts it to a data table
titanic_df <- data.table(titanic_df)
# Check the data
titanic_df

If you recieved an error it’s probably because you didn’t download the github repository. There are two causes for the error: - you probably didn’t replicate the folder structure that I have - or, you didn’t set a working directory The following code will help you resolve the error:

# Importing data using the absolute path
# titanic_df <- read.csv("your-path-here/train.csv")
# Setting a working directory
# setwd("your-path-here/BeginR/Code")
# View the data
# View(titanic_df)

2.2 Write tables

Let’s write a table to our laptops now:

# Create a table
df <- data.frame(names = c("X","Y","Z"),
                 gender = c("Male","Female","Female"),
                 score = c(67, 99, 85))
# Writing to a table without the rownames
write.csv(x = df, file = "../Code/something.csv", row.names = FALSE)

3 Understanding the Data

3.1 Structure of data

Let’s explore the structure of the data:

# Structure of the data
str(titanic_df)
Classes ‘data.table’ and 'data.frame':  891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
 $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
 - attr(*, ".internal.selfref")=<externalptr> 

Most variable names are not illuminating. So let’s understand the data description from the source:

Variable Name Description
Survived Survived (1) or died (0)
Pclass Passenger’s class
Name Passenger’s name
Sex Passenger’s sex
Age Passenger’s age
SibSp Number of siblings/spouses aboard
Parch Number of parents/children aboard
Ticket Ticket number
Fare Fare
Cabin Cabin
Embarked Port of embarkation

3.2 Summary statistics of data

# Displays the summary of all variables in the dataframe
summary(titanic_df)
  PassengerId       Survived          Pclass     
 Min.   :  1.0   Min.   :0.0000   Min.   :1.000  
 1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000  
 Median :446.0   Median :0.0000   Median :3.000  
 Mean   :446.0   Mean   :0.3838   Mean   :2.309  
 3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000  
 Max.   :891.0   Max.   :1.0000   Max.   :3.000  
                                                 
                                    Name         Sex           Age       
 Abbing, Mr. Anthony                  :  1   female:314   Min.   : 0.42  
 Abbott, Mr. Rossmore Edward          :  1   male  :577   1st Qu.:20.12  
 Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :28.00  
 Abelson, Mr. Samuel                  :  1                Mean   :29.70  
 Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:38.00  
 Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  
 (Other)                              :885                NA's   :177    
     SibSp           Parch             Ticket         Fare                Cabin    
 Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00              :687  
 1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91   B96 B98    :  4  
 Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45   C23 C25 C27:  4  
 Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20   G6         :  4  
 3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00   C22 C26    :  3  
 Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33   D          :  3  
                                  (Other) :852                    (Other)    :186  
 Embarked
  :  2   
 C:168   
 Q: 77   
 S:644   
         
         
         

4 Data Manipulation and Viz

4.1 Missing values

Note here that Age has 177 NA’s. Also, Embarked has 2 observations with "" as entries. This implies that there is missing data. In most cases, there will be some sort of imputation that happens for variables. First let’s learn to retrieve a subset of the data using the filter(...) function:

# Filter the "" in Embarked
filter(titanic_df, Embarked == "")

Let’s see if we can remove these two observations from our data:

# Filter the "" in Embarked
titanic_df <- filter(titanic_df, !Embarked == "")
titanic_df

The row count decreased by 2 so we have correctly removed the missing observations for Embarked. But in the case of Age we cannot simply remove 177 observations from our data. Similarly, there seem to be missing values in Cabin:

# Displays the first 5 rows
head(titanic_df)

How many missing values are there in both Age and Cabin combined?

filter(titanic_df, Cabin == "")

Wow, 687 rows without any information for Cabin, this means that the data in this column is extremely sparse. Well, let’s look at other insights our data can provide because imputing missing values is out of scope for this workshop.

4.2 Investigation I

Are there several family names and do the ticket fares differ for them?

Let’s create another variable called Surname using srtsplit(...) to split Name and apply that function on every row in the table using sapply(...):

# Splits every Name based on ',' or '.'
titanic_df$Surname <- sapply(titanic_df$Name, 
                             function(x) {
                                 strsplit(as.character(x), 
                                          split = '[,.]')[[1]][1]
                                 }
                             )
# Display the table
titanic_df

Let’s group the surnames using the group_by(...) function and then use the summarise(...) function to generate an understanding on classism and family names:

# Group the data based on surnames
grouped_surnames <- group_by(titanic_df, Surname)
# Create two columns MeanFare and Total count
summarise(grouped_surnames, MeanFare = mean(Fare), Total = n())

Well frankly, this investigation hasn’t revealed anything useful. More often than not, it’s important to ask the right questions. Let’s try something simpler.

4.3 Investigation II

Do families sink or swim together?

We’re going to make a family size variable based on number of siblings/spouse(s) (maybe someone has more than one spouse?) and number of children/parents.

# Create a family size variable including the passenger themselves
titanic_df <- mutate(titanic_df, Fsize = SibSp + Parch + 1)
# Display the result
titanic_df

What does our family size variable look like? To help us understand how it may relate to survival, let’s plot it using the geom_bar(...) function:

# Use ggplot2 to visualize the relationship between family size & survival
# geom_bar() plots a bar chart
ggplot(titanic_df, aes(x = Fsize, fill = factor(Survived))) + 
    geom_bar(position = 'dodge') + 
    scale_x_continuous(breaks = 1:11)

We can see that there’s a survival penalty to singletons and those with family sizes above 4.

4.4 Investigation III

Before we investigate another phenomenon let’s review what a box-whisker plot is.

Box-Whisker plot

Box-Whisker plot

Do the Fares of the passengers vary due to Passenger Class and Port of Embarkment?

Here I will also introduce the pipe %>% operator. Think of it as a machine that takes the result of the function to the left of it and passes it as an argument to the function on the right.

# Group the data on Embarked and Pclass
titanic_df %>%
    group_by(Embarked, Pclass) %>%
    summarise(meanF = mean(Fare), medianF = median(Fare), maxF = max(Fare), Count = n())

Well, visualizing numerical data is powerful tool in understanding the spread of the data and more importantly the outliers. Let’s plot these values using the geom_boxplot(...) function:

# geom_boxplot() plots a boxplot
ggplot(titanic_df, aes(x = Embarked, y = Fare, fill = factor(Pclass))) +
  geom_boxplot()

We can see there are some clear outliers here. Moreover, Q2 and Q3 seem to be the cheapest tickets.

4.5 Investigation IV

What factors could possibly influence survival rates ?

We examine the correlation (standardised covariance) which is calculated by the formula \(Cov(X,Y) = E(XY) - E(X)E(Y)\) for numerical vectors:

# Convert categorical variables to numerical, select those columns and calculate correlation
cor_df <- titanic_df %>%
    mutate(SexN = as.numeric(Sex), EmbarkedN = as.numeric(Embarked)) %>%
    select(Survived, EmbarkedN, Pclass, SexN, SibSp, Parch, Fsize, Fare) %>%
    cor()
# Display the data
cor_df
             Survived   EmbarkedN      Pclass       SexN       SibSp       Parch
Survived   1.00000000 -0.16971768 -0.33554886 -0.5415849 -0.03404000  0.08315078
EmbarkedN -0.16971768  1.00000000  0.16468071  0.1103200  0.06889991  0.04044863
Pclass    -0.33554886  0.16468071  1.00000000  0.1277409  0.08165562  0.01682449
SexN      -0.54158492  0.11031996  0.12774090  1.0000000 -0.11634817 -0.24750798
SibSp     -0.03404000  0.06889991  0.08165562 -0.1163482  1.00000000  0.41454164
Parch      0.08315078  0.04044863  0.01682449 -0.2475080  0.41454164  1.00000000
Fsize      0.01827747  0.06730499  0.06422053 -0.2031915  0.89065367  0.78298776
Fare       0.25529046 -0.22631118 -0.54819329 -0.1799575  0.16088685  0.21753204
                Fsize       Fare
Survived   0.01827747  0.2552905
EmbarkedN  0.06730499 -0.2263112
Pclass     0.06422053 -0.5481933
SexN      -0.20319145 -0.1799575
SibSp      0.89065367  0.1608869
Parch      0.78298776  0.2175320
Fsize      1.00000000  0.2186582
Fare       0.21865817  1.0000000

Next we create a melt(...) the correlation matrix into the long form:

# Melts the dataframe
output <- melt(cor_df)
# Display
output

Let’s visualise the correlation using the geom_tile(...) function:

# geom_tile() plots a correlation matrix
ggplot(output, aes(x=Var1, y=Var2, fill=value)) + 
    geom_tile()

Correlation is not Causation but:

  • There seems to be a high positice correlation between SibSp, Parch and Fsize. What is the reason for this ?
  • There is a negative correlation between SexN and Survival. It suggests that one sex was more likely to survive. But which one is it ? Can you infer from str(titanic_df)
  • What other things can you see ?

4.6 Investigation V

Does the Fare follow a normal distribution ?

Most random variables in nature follow a normal distribution. Let’s find out density plot of the Fare variable using the geom_density(...) function:

# geom_density() plots a density plot
ggplot(titanic_df, aes(x = Fare)) + geom_density()

Here’s a some code to get the density values which uses all the tables operations we learnt so far. Make sure to experiment with every line in the code below:

density_values <- titanic_df %>%                    
    group_by(Fare) %>%                               # Groups the data by Fare
    summarise(Frequency = n()) %>%                   # Generetes frequency of Fare
    mutate(Density = Frequency/sum(Frequency)) %>%   # Creates a column Density
    select(Fare, Density)                                  # Outputs the Density column only
# Display density_values
density_values
---
title: "Exploratory Data Analysis (EDA) and Table Operations"
author: "Ic3fr0g"
date: '`r Sys.Date()`'
output:
  html_document:
    code_folding: show
    fig_caption: yes
    fig_height: 4.5
    fig_width: 7
    highlight: tango
    number_sections: yes
    theme: cosmo
    toc: yes
  html_notebook:
    code_folding: show
    fig_caption: yes
    fig_height: 4.5
    fig_width: 7
    highlight: tango
    number_sections: yes
    theme: cosmo
    toc: yes
---


# Installing Packages {.tabset .tabset-fade .tabset-pills}

## Install data.table
Before we start work with tables, let's install the `data.table` package which we shall use heavily:
```{r}
# Installs data.table
# install.packages("data.table")
# Loads the package
library(data.table)
```

## Installing dplyr
Before we start manipulating data let's install the `dpylr` package:
```{r}
# Installs dplyr
# install.packages("dplyr")
# Loads the package
library(dplyr)
```

## Installing ggplot2
We will use this package for data visualisation:
```{r}
# Installs ggplot2
# install.packages("ggplot2")
# Loads the package
library(ggplot2)
```

# Import/Export Data {.tabset .tabset-fade .tabset-pills}

## Read Tables
Within RStudio in the top right corner one can import a dataset from the Enivronment Pane. Alternatively, one can navigate to `File > Import Dataset`. Aside from using RStudio's native UI there are built-in functions and packages that enable you to import datasets into R. Let's try importing the data present in a particular path and make use of the `View(...)` function of RStudio: 
```{r}
# Import CSV (relative path) into a dataframe
titanic_df <- read.csv("../Data/train.csv")
# Converts it to a data table
titanic_df <- data.table(titanic_df)
# Check the data
titanic_df
```
If you recieved an error it's probably because you didn't download the github repository. There are two causes for the error:
- you probably didn't replicate the folder structure that I have
- or, you didn't set a working directory
The following code will help you resolve the error:
```{r}
# Importing data using the absolute path
# titanic_df <- read.csv("your-path-here/train.csv")
# Setting a working directory
# setwd("your-path-here/BeginR/Code")
# View the data
# View(titanic_df)
```

## Write tables
Let's write a table to our laptops now:
```{r}
# Create a table
df <- data.frame(names = c("X","Y","Z"),
                 gender = c("Male","Female","Female"),
                 score = c(67, 99, 85))

# Writing to a table without the rownames
write.csv(x = df, file = "../Code/something.csv", row.names = FALSE)
```


# Understanding the Data {.tabset .tabset-fade .tabset-pills}

## Structure of data
Let's explore the structure of the data:
```{r}
# Structure of the data
str(titanic_df)
```
Most variable names are not illuminating. So let's understand the data description from the source:

Variable Name | Description
--------------|-------------
Survived      | Survived (1) or died (0)
Pclass        | Passenger's class
Name          | Passenger's name
Sex           | Passenger's sex
Age           | Passenger's age
SibSp         | Number of siblings/spouses aboard
Parch         | Number of parents/children aboard
Ticket        | Ticket number
Fare          | Fare
Cabin         | Cabin
Embarked      | Port of embarkation

## Summary statistics of data
```{r}
# Displays the summary of all variables in the dataframe
summary(titanic_df)
```


# Data Manipulation and Viz {.tabset .tabset-fade .tabset-pills}

## Missing values
Note here that `Age` has 177 `NA`'s. Also, `Embarked` has 2 observations with `""` as entries. This implies that there is missing data. In most cases, there will be some sort of imputation that happens for variables. First let's learn to retrieve a subset of the data using the `filter(...)` function:
```{r}
# Filter the "" in Embarked
filter(titanic_df, Embarked == "")
```
Let's see if we can remove these two observations from our data:
```{r}
# Filter the "" in Embarked
titanic_df <- filter(titanic_df, !Embarked == "")
titanic_df
```
The row count decreased by 2 so we have correctly removed the missing observations for `Embarked`. But in the case of `Age` we cannot simply remove 177 observations from our data. Similarly, there seem to be missing values in `Cabin`:
```{r}
# Displays the first 5 rows
head(titanic_df)
```
How many missing values are there in both `Age` and `Cabin` combined?
```{r}
filter(titanic_df, Cabin == "")
```
Wow, 687 rows without any information for `Cabin`, this means that the data in this column is extremely sparse.
Well, let's look at other insights our data can provide because imputing missing values is out of scope for this workshop.

## Investigation I
**Are there several family names and do the ticket fares differ for them?**

Let's create another variable called `Surname` using `srtsplit(...)` to split `Name` and apply that function on every row in the table using `sapply(...)`:
```{r}
# Splits every Name based on ',' or '.'
titanic_df$Surname <- sapply(titanic_df$Name, 
                             function(x) {
                                 strsplit(as.character(x), 
                                          split = '[,.]')[[1]][1]
                                 }
                             )
# Display the table
titanic_df
```
Let's group the surnames using the `group_by(...)` function and then use the `summarise(...)` function to generate an understanding on classism and family names:
```{r}
# Group the data based on surnames
grouped_surnames <- group_by(titanic_df, Surname)
# Create two columns MeanFare and Total count
summarise(grouped_surnames, MeanFare = mean(Fare), Total = n())
```
Well frankly, this investigation hasn't revealed anything useful. More often than not, it's important to ask the right questions. Let's try something simpler.

## Investigation II
**Do families sink or swim together?**

We're going to make a `family size` variable based on number of siblings/spouse(s) (maybe someone has more than one spouse?) and number of children/parents.
```{r}
# Create a family size variable including the passenger themselves
titanic_df <- mutate(titanic_df, Fsize = SibSp + Parch + 1)
# Display the result
titanic_df
```
What does our family size variable look like? To help us understand how it may relate to survival, let's plot it using the `geom_bar(...)` function:
```{r}
# Use ggplot2 to visualize the relationship between family size & survival
# geom_bar() plots a bar chart
ggplot(titanic_df, aes(x = Fsize, fill = factor(Survived))) + 
    geom_bar(position = 'dodge') + 
    scale_x_continuous(breaks = 1:11)
```
We can see that there’s a survival penalty to singletons and those with family sizes above 4.

## Investigation III
Before we investigate another phenomenon let's review what a box-whisker plot is.

![Box-Whisker plot](../Images/box_plot.png)

**Do the Fares of the passengers vary due to Passenger Class and Port of Embarkment?**

Here I will also introduce the pipe `%>%` operator. Think of it as a machine that takes the result of the function to the left of it and passes it as an argument to the function on the right.
```{r}
# Group the data on Embarked and Pclass
titanic_df %>%
    group_by(Embarked, Pclass) %>%
    summarise(meanF = mean(Fare), medianF = median(Fare), maxF = max(Fare), Count = n())
```
Well, visualizing numerical data is powerful tool in understanding the spread of the data and more importantly the outliers. Let's plot these values using the `geom_boxplot(...)` function:
```{r}
# geom_boxplot() plots a boxplot
ggplot(titanic_df, aes(x = Embarked, y = Fare, fill = factor(Pclass))) +
  geom_boxplot()
```
We can see there are some clear outliers here. Moreover, Q2 and Q3 seem to be the cheapest tickets.

## Investigation IV
**What factors could possibly influence survival rates ?**

We examine the correlation (standardised covariance) which is calculated by the formula $Cov(X,Y) = E(XY) - E(X)E(Y)$ for numerical vectors:
```{r}
# Convert categorical variables to numerical, select those columns and calculate correlation
cor_df <- titanic_df %>%
    mutate(SexN = as.numeric(Sex), EmbarkedN = as.numeric(Embarked)) %>%
    select(Survived, EmbarkedN, Pclass, SexN, SibSp, Parch, Fsize, Fare) %>%
    cor()
# Display the data
cor_df
```
Next we create a `melt(...)` the correlation matrix into the long form:
```{r}
# Melts the dataframe
output <- melt(cor_df)
# Display
output
```
Let's visualise the correlation using the `geom_tile(...)` function:
```{r}
# geom_tile() plots a correlation matrix
ggplot(output, aes(x=Var1, y=Var2, fill=value)) + 
    geom_tile()
```

**Correlation is not Causation** but:

 - There seems to be a high positice correlation between `SibSp`, `Parch` and `Fsize`. What is the reason for this ?
 - There is a negative correlation between `SexN` and `Survival`. It suggests that one sex was more likely to survive. But which one is it ? Can you infer from `str(titanic_df)`
 - What other things can you see ?

## Investigation V
**Does the `Fare` follow a normal distribution ?**

Most random variables in nature follow a normal distribution. Let's find out density plot of the `Fare` variable using the `geom_density(...)` function:
```{r}
# geom_density() plots a density plot
ggplot(titanic_df, aes(x = Fare)) + geom_density()
```
Here's a some code to get the density values which uses all the tables operations we learnt so far. Make sure to experiment with every line in the code below:
```{r}
density_values <- titanic_df %>%                    
    group_by(Fare) %>%                               # Groups the data by Fare
    summarise(Frequency = n()) %>%                   # Generetes frequency of Fare
    mutate(Density = Frequency/sum(Frequency)) %>%   # Creates a column Density
    select(Fare, Density)                                  # Outputs the Density column only
# Display density_values
density_values

```
