1 Recap

1.1 Vectors

Vectors are a series of numbers, characters or boolean values.

# Assignment
numeric_vector <- 1:10    # OR assign('numeric_vector', 1:10)
# Squared numbers
squared_numbers <- numeric_vector^2
# All squared numbers but the last 3
squared_numbers[-(8:10)]
## [1]  1  4  9 16 25 36 49

The length of a vector can be determined like so:

length(squared_numbers)
## [1] 10

1.2 Matrices

Matrices are two dimensional arrays containing elements of the same data-type:

# List of albums
albums <- c('Infinite', 'The Slim Shady LP', 'The Marshall Mathers LP', 
            'The Eminem Show', 'Encore', 'Relapse', 'Recovery', 
            'The Marshall Mathers LP 2', 'Revival')
# Year of Release
years <- c(1996, 1999, 2000, 2002, 2004, 2009, 2010, 2013, 2017)
# Eminem Matrix album
eminem_album_releases <- matrix(c(albums,years), nrow = 9, ncol = 2)
# Colnames of the matrix
colnames(eminem_album_releases) <- c("Albums","Release Year")
# Display
eminem_album_releases
##       Albums                      Release Year
##  [1,] "Infinite"                  "1996"      
##  [2,] "The Slim Shady LP"         "1999"      
##  [3,] "The Marshall Mathers LP"   "2000"      
##  [4,] "The Eminem Show"           "2002"      
##  [5,] "Encore"                    "2004"      
##  [6,] "Relapse"                   "2009"      
##  [7,] "Recovery"                  "2010"      
##  [8,] "The Marshall Mathers LP 2" "2013"      
##  [9,] "Revival"                   "2017"

Let’s get the second album of Eminem:

# Will the Real Slim Shady please stand up ?
eminem_album_releases[2, 'Albums']
##              Albums 
## "The Slim Shady LP"

2 Summary Statistics

2.1 Normal Distribution

In R, you can generate a sequence of random numbers that are normally distributed with a mean of 0 and standard deviation of 1 by default:

# Generate 20 random numbers
x.norm <- rnorm(n = 20)
# Display 'x.norm'
x.norm
##  [1] -0.48996676 -0.04776829  0.46557437 -2.32318211  0.31044294
##  [6] -0.61539061  0.56106372  0.96918444  0.72457129 -0.88147999
## [11] -0.29552758 -1.07698244 -1.66163702  0.27533420 -0.71154931
## [16]  1.12751870  0.29776926  1.11767235 -0.16258690  0.79116556

Let’s sort this data, using sort(...):

# Sorting
x.norm <- sort(x.norm)
# Display
x.norm
##  [1] -2.32318211 -1.66163702 -1.07698244 -0.88147999 -0.71154931
##  [6] -0.61539061 -0.48996676 -0.29552758 -0.16258690 -0.04776829
## [11]  0.27533420  0.29776926  0.31044294  0.46557437  0.56106372
## [16]  0.72457129  0.79116556  0.96918444  1.11767235  1.12751870

2.2 Mean, Median and Standard Deviation

The mean or the average value of x.norm can be calculated by:

# Average
mean(x.norm)
## [1] -0.08128871

The median refers to the middle of the all observations:

# Median
median(x.norm)
## [1] 0.113783

The standard deviation, which calculates how far the observations stray from the mean,can also be calculated using the formula: \[\sqrt{\frac{\sum_{i=1}^N (x_i - \bar{x})^2}{N-1}}\] The same can be done in R like so:

# Standard deviation formula in R
sqrt(sum((x.norm - mean(x.norm))^2)/(length(x.norm) - 1))
## [1] 0.9349146

Mom’s spaghetti!!! There’s always a better way to do things, we can use the sd(...) function:

# SD from the function
sd(x.norm)
## [1] 0.9349146

2.3 Minimum, Maximum, Quantiles and Summary

Let’s see the functions min(...) and max(...) in action:

# Returns the smallest and biggest element
print(c(min(x.norm), max(x.norm)))
## [1] -2.323182  1.127519

Quantiles are cutpoints between the observations that splits the data into equal parts. There are several quantile systems, but we generally refer to the 4-quantile system:

# Find the quantiles
q <- quantile(x.norm)
# Display
q
##         0%        25%        50%        75%       100% 
## -2.3231821 -0.6394303  0.1137830  0.6019406  1.1275187

Using what we learnt from vector logic, let’s see how the quantiles splits x.norm:

# First quantile
x.norm[x.norm < q[2]]
## [1] -2.3231821 -1.6616370 -1.0769824 -0.8814800 -0.7115493
# Second quantile
x.norm[x.norm > q[2] & x.norm < q[3]]
## [1] -0.61539061 -0.48996676 -0.29552758 -0.16258690 -0.04776829

Well, in this case as well there seems to better things to calculate the basic statistics using summary(...):

# Generic function to describe an object
summary(x.norm)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.32318 -0.63943  0.11378 -0.08129  0.60194  1.12752

2.4 Plotting Statistics

Let’s plot 20 random points to understand the normal distribution:

plot(rnorm(20))

Hmm, I think it will easier to appreciate the properties of the normal distribution (0 mean and 1 standard deviation) with more points:

plot(rnorm(1000))

For people interested in seeing the density function for these points:

w <- rnorm(1000) 
hist(w, col = "red", freq = F, xlim = c(-5,5))
curve(dnorm, -5, 5, add = T, col = "blue")

Let’s introduce a function abline(...) which enables us to add horizontal and vertical lines to an existing plot:

plot(x.norm)
# 'h' means horizontal line
# 'col' means color
abline(h = mean(x.norm), col = 'red')
abline(h = median(x.norm), col = 'blue')
abline(h = mean(x.norm) + sd(x.norm), col = 'green')
abline(h = mean(x.norm) - sd(x.norm), col = 'green')

Let’s try to plot the quantiles from before:

plot(x.norm)
abline(h = median(x.norm), col = 'blue')
abline(h = summary(x.norm)[2], col = 'red')
abline(h = summary(x.norm)[5], col = 'red')

3 Factors

3.1 Categorical Variables

The term factor refers to a statistical data type used to store categorical variables. The difference between a categorical variable and a continuous variable is that a categorical variable can belong to a limited number of categories. A continuous variable, on the other hand, can correspond to an infinite number of values. Let’s invoke the factor(...) function:

# Language Codes
lang <- c("en","fr","hi","hi","ru","ru","ru","ru","fr","hi","en","cn")
# Factor of Language codes
langf <- factor(lang)
langf
##  [1] en fr hi hi ru ru ru ru fr hi en cn
## Levels: cn en fr hi ru

There are two types of categorical variables: a nominal categorical variable and an ordinal categorical variable. A nominal variable is a categorical variable without an implied order. This means that it is impossible to say that ‘one is worth more than the other’. While, the ordinal variable will have some inherent ordering. We’ve just seen an example of a nominal variable, let’s see an example of the ordinal variable:

# Speed traps measuring car speeds
speed <- c("high","med","high","med","low")
# The ordering is given by specifying the levels
speedf <- factor(speed, ordered = TRUE, levels = c("low","med","high"))
speedf
## [1] high med  high med  low 
## Levels: low < med < high

3.2 Levels

Notice also that there are no quotes around the values. That’s because they’re not strings; they’re actually integer references to one of the factor’s levels. But what is a level? It is simply the unique values in the vector:

levels(speedf)
## [1] "low"  "med"  "high"

Let’s try comparing ordered factors. In our previous example, we want to test whether the second car is going slower than the third car:

# Ordered factors can compare strings
speedf[2] < speedf[3]
## [1] TRUE

3.3 Frequency tables

Try calling summary(...) on every object from now on:

# Returns frequencies
summary(langf)
## cn en fr hi ru 
##  1  2  2  3  4

3.4 Tests for Factors

At this point, there are built-in functions that are similar for nearly every data structure, so don’t be surprised with a is.factor(...):

# Performs a check for factor
is.factor(langf)
## [1] TRUE

Or, something like this:

# Returns a vector
as.vector(langf)
##  [1] "en" "fr" "hi" "hi" "ru" "ru" "ru" "ru" "fr" "hi" "en" "cn"

Or, this:

# Returns a column
as.matrix(langf)
##       [,1]
##  [1,] "en"
##  [2,] "fr"
##  [3,] "hi"
##  [4,] "hi"
##  [5,] "ru"
##  [6,] "ru"
##  [7,] "ru"
##  [8,] "ru"
##  [9,] "fr"
## [10,] "hi"
## [11,] "en"
## [12,] "cn"

4 Dataframes

4.1 Explore a Dataset

There are several built-in datasets within R. Arguably, the ‘Hello World’ example of these datasets is mtcars:

# Motor Trend Car Road Tests
mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Let’s have a look at the top rows from this dataset using head(...):

# Top rows
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Let’s have a look at the bottom rows from this dataset using tail(...):

# Bottom rows
tail(mtcars)
##                 mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

Invoking class(mtcars) will tell us it’s an object of type data.frame. But let’s go a step further and explore the structure of the data frame using str(...):

# Returns the structure
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

4.2 Creating Dataframes

Calling the data.frame(...) function enables us to create a dataframe. Notice the vector names become the column names automatically. Each data argument to the data.frame(...) function takes the form of either value or tag = value:

# Definition of vectors
name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", 
          "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)

# Create a data frame from the vectors
planets_df <- data.frame(name, type, diameter, rotation, rings)
planets_df
##      name               type diameter rotation rings
## 1 Mercury Terrestrial planet    0.382    58.64 FALSE
## 2   Venus Terrestrial planet    0.949  -243.02 FALSE
## 3   Earth Terrestrial planet    1.000     1.00 FALSE
## 4    Mars Terrestrial planet    0.532     1.03 FALSE
## 5 Jupiter          Gas giant   11.209     0.41  TRUE
## 6  Saturn          Gas giant    9.449     0.43  TRUE
## 7  Uranus          Gas giant    4.007    -0.72  TRUE
## 8 Neptune          Gas giant    3.883     0.67  TRUE

Let’s explore the structure once again:

# Returns the structure of the planets
str(planets_df)
## 'data.frame':    8 obs. of  5 variables:
##  $ name    : Factor w/ 8 levels "Earth","Jupiter",..: 4 8 1 3 2 6 7 5
##  $ type    : Factor w/ 2 levels "Gas giant","Terrestrial planet": 2 2 2 2 1 1 1 1
##  $ diameter: num  0.382 0.949 1 0.532 11.209 ...
##  $ rotation: num  58.64 -243.02 1 1.03 0.41 ...
##  $ rings   : logi  FALSE FALSE FALSE FALSE TRUE TRUE ...

4.3 Dataframe Access

Dataframes share a lot in common with matrices. Think of dataframes as extended matrices, without the necessity to contain elements of the same data type. However, there are a few quirks. Let’s try accessing some columns:

# First column by index
planets_df[[1]]
## [1] Mercury Venus   Earth   Mars    Jupiter Saturn  Uranus  Neptune
## Levels: Earth Jupiter Mars Mercury Neptune Saturn Uranus Venus
# First column by name
planets_df[['type']]
## [1] Terrestrial planet Terrestrial planet Terrestrial planet
## [4] Terrestrial planet Gas giant          Gas giant         
## [7] Gas giant          Gas giant         
## Levels: Gas giant Terrestrial planet
# By using the '$' atomic vector
planets_df$diameter
## [1]  0.382  0.949  1.000  0.532 11.209  9.449  4.007  3.883

4.4 Tests for Dataframes

TLDR, is.data.frame(...):

# Checks if 'planets_df' is a dataframe
is.data.frame(planets_df)
## [1] TRUE

You can also coerce other objects into dataframes:

# Coerces matrix to dataframe
as.data.frame(matrix(1:10, nrow = 5, ncol = 2, dimnames = list(NULL, c("X","Y"))))
##   X  Y
## 1 1  6
## 2 2  7
## 3 3  8
## 4 4  9
## 5 5 10

5 Practice! Practice! Practice!

5.1 Access by Indices

What’s the diameter of Mercury?

# Print out diameter of Mercury (row 1, column 3)
planets_df[1,3]
## [1] 0.382

Show me all the data for Mars!

# Print out data for Mars (entire fourth row)
planets_df[4, ]
##   name               type diameter rotation rings
## 4 Mars Terrestrial planet    0.532     1.03 FALSE

5.2 Access by names

What are the diameters of all planets?

# planets_df[['diameters']] is also valid
planets_df$diameter
## [1]  0.382  0.949  1.000  0.532 11.209  9.449  4.007  3.883

What type of planets are the first 3 planets?

# planets_df[1:3,"type"] is also valid
planets_df$type[1:3]
## [1] Terrestrial planet Terrestrial planet Terrestrial planet
## Levels: Gas giant Terrestrial planet

5.3 Access by logical vectors

Which planets have positive rotation?

# planets_df[,'rotation'] > 0 is also valid
positive_rotation <- planets_df$rotation > 0
# Plantes with positive rotation
planets_df[positive_rotation,]
##      name               type diameter rotation rings
## 1 Mercury Terrestrial planet    0.382    58.64 FALSE
## 3   Earth Terrestrial planet    1.000     1.00 FALSE
## 4    Mars Terrestrial planet    0.532     1.03 FALSE
## 5 Jupiter          Gas giant   11.209     0.41  TRUE
## 6  Saturn          Gas giant    9.449     0.43  TRUE
## 8 Neptune          Gas giant    3.883     0.67  TRUE

Names of the planets that have positive rotation and rings:

# Give yourself a cookie if you got this right!!!
planets_df[planets_df$rotation > 0 & planets_df$rings, 'name']
## [1] Jupiter Saturn  Neptune
## Levels: Earth Jupiter Mars Mercury Neptune Saturn Uranus Venus

6 Lists

6.1 Creating Lists

Lists, as opposed to vectors, can hold components of different types. These objects can be matrices, vectors, data frames, even other lists, etc. It is not even required that these objects are related to each other in any way.

# Create some vectors
heroes <- c('Ironman', 'Thor', 'Captain America', 'Wonder Woman', 'Batman')
actors <- c('Sir Robert Downey Jr', 'Chris Hemsworth', 'Chris Evans', 'Gal Gadot', 'Christian Bale')
level <- c(100, 1000, 10, 10000, 100000)
# Initialise the list
super_heroes <- list(power_level = data.frame(heroes,level),
                     serial_no = 1:5, 
                     actor_names = actors)
super_heroes
## $power_level
##            heroes level
## 1         Ironman 1e+02
## 2            Thor 1e+03
## 3 Captain America 1e+01
## 4    Wonder Woman 1e+04
## 5          Batman 1e+05
## 
## $serial_no
## [1] 1 2 3 4 5
## 
## $actor_names
## [1] "Sir Robert Downey Jr" "Chris Hemsworth"      "Chris Evans"         
## [4] "Gal Gadot"            "Christian Bale"

6.2 Access for lists

We can access the elements in a list through indexing through double squared brackets [[...]]:

# Displays the second element in the list
super_heroes[[2]]
## [1] 1 2 3 4 5

The list can be accessed through the named elements as well. Let’s use names(...) to see the named elements of the list:

# Named elements of the list
names(super_heroes)
## [1] "power_level" "serial_no"   "actor_names"

There are two fundamental ways to access a list through names, here we use the element name:

# Access using element names
super_heroes[["serial_no"]]
## [1] 1 2 3 4 5

And here we invoke the atomic vector

# Access with an atomic vector
super_heroes$power_level
##            heroes level
## 1         Ironman 1e+02
## 2            Thor 1e+03
## 3 Captain America 1e+01
## 4    Wonder Woman 1e+04
## 5          Batman 1e+05

6.3 Test for list

The function is.list(...) tests for lists and returns a boolean value:

# Returns a boolean value
is.list(super_heroes)
## [1] TRUE

One can also coerce an object into a list using the function as.list(...):

# Create a dataframe
The_Simpsons <- data.frame(first_name = c("Bart","Lisa","Homer","Marge","Baby"), 
                           last_name = rep("Simpson", 5))
The_Simpsons
##   first_name last_name
## 1       Bart   Simpson
## 2       Lisa   Simpson
## 3      Homer   Simpson
## 4      Marge   Simpson
## 5       Baby   Simpson
# Coercing a dataframe into a list
as.list(The_Simpsons)
## $first_name
## [1] Bart  Lisa  Homer Marge Baby 
## Levels: Baby Bart Homer Lisa Marge
## 
## $last_name
## [1] Simpson Simpson Simpson Simpson Simpson
## Levels: Simpson

But more often we will use str(...) to describe objects like lists:

# Displays the entire structure of the list
str(super_heroes)
## List of 3
##  $ power_level:'data.frame': 5 obs. of  2 variables:
##   ..$ heroes: Factor w/ 5 levels "Batman","Captain America",..: 3 4 2 5 1
##   ..$ level : num [1:5] 1e+02 1e+03 1e+01 1e+04 1e+05
##  $ serial_no  : int [1:5] 1 2 3 4 5
##  $ actor_names: chr [1:5] "Sir Robert Downey Jr" "Chris Hemsworth" "Chris Evans" "Gal Gadot" ...