Vectors are a series of numbers, characters or boolean values.
# Assignment
numeric_vector <- 1:10 # OR assign('numeric_vector', 1:10)
# Squared numbers
squared_numbers <- numeric_vector^2
# All squared numbers but the last 3
squared_numbers[-(8:10)][1] 1 4 9 16 25 36 49
The length of a vector can be determined like so:
length(squared_numbers)[1] 10
Matrices are two dimensional arrays containing elements of the same data-type:
# List of albums
albums <- c('Infinite', 'The Slim Shady LP', 'The Marshall Mathers LP',
'The Eminem Show', 'Encore', 'Relapse', 'Recovery',
'The Marshall Mathers LP 2', 'Revival')
# Year of Release
years <- c(1996, 1999, 2000, 2002, 2004, 2009, 2010, 2013, 2017)
# Eminem Matrix album
eminem_album_releases <- matrix(c(albums,years), nrow = 9, ncol = 2)
# Colnames of the matrix
colnames(eminem_album_releases) <- c("Albums","Release Year")
# Display
eminem_album_releases Albums Release Year
[1,] "Infinite" "1996"
[2,] "The Slim Shady LP" "1999"
[3,] "The Marshall Mathers LP" "2000"
[4,] "The Eminem Show" "2002"
[5,] "Encore" "2004"
[6,] "Relapse" "2009"
[7,] "Recovery" "2010"
[8,] "The Marshall Mathers LP 2" "2013"
[9,] "Revival" "2017"
Let’s get the second album of Eminem:
# Will the Real Slim Shady please stand up ?
eminem_album_releases[2, 'Albums'] Albums
"The Slim Shady LP"
In R, you can generate a sequence of random numbers that are normally distributed with a mean of 0 and standard deviation of 1 by default:
# Generate 20 random numbers
x.norm <- rnorm(n = 20)
# Display 'x.norm'
x.norm [1] -0.75269984 0.16712765 -0.68080459 -0.21067393 0.86269931 -1.05480882 0.69760359 -0.60681224 -0.11318004
[10] 0.81313333 -0.98551445 2.27346538 -2.26660936 -0.12874622 0.61728827 -1.14924134 1.27515620 0.90347140
[19] 0.42464996 -0.02620376
Let’s sort this data, using sort(...):
# Sorting
x.norm <- sort(x.norm)
# Display
x.norm [1] -2.26660936 -1.14924134 -1.05480882 -0.98551445 -0.75269984 -0.68080459 -0.60681224 -0.21067393 -0.12874622
[10] -0.11318004 -0.02620376 0.16712765 0.42464996 0.61728827 0.69760359 0.81313333 0.86269931 0.90347140
[19] 1.27515620 2.27346538
The mean or the average value of x.norm can be calculated by:
# Average
mean(x.norm)[1] 0.002965025
The median refers to the middle of the all observations:
# Median
median(x.norm)[1] -0.0696919
The standard deviation, which calculates how far the observations stray from the mean,can also be calculated using the formula: \[\sqrt{\frac{\sum_{i=1}^N (x_i - \bar{x})^2}{N-1}}\] The same can be done in R like so:
# Standard deviation formula in R
sqrt(sum((x.norm - mean(x.norm))^2)/(length(x.norm) - 1))[1] 1.028719
Mom’s spaghetti!!! There’s always a better way to do things, we can use the sd(...) function:
# SD from the function
sd(x.norm)[1] 1.028719
Let’s see the functions min(...) and max(...) in action:
# Returns the smallest and biggest element
print(c(min(x.norm), max(x.norm)))[1] -2.266609 2.273465
Quantiles are cutpoints between the observations that splits the data into equal parts. There are several quantile systems, but we generally refer to the 4-quantile system:
# Find the quantiles
q <- quantile(x.norm)
# Display
q 0% 25% 50% 75% 100%
-2.2666094 -0.6987784 -0.0696919 0.7264860 2.2734654
Using what we learnt from vector logic, let’s see how the quantiles splits x.norm:
# First quantile
x.norm[x.norm < q[2]][1] -2.2666094 -1.1492413 -1.0548088 -0.9855144 -0.7526998
# Second quantile
x.norm[x.norm > q[2] & x.norm < q[3]][1] -0.6808046 -0.6068122 -0.2106739 -0.1287462 -0.1131800
Well, in this case as well there seems to better things to calculate the basic statistics using summary(...):
# Generic function to describe an object
summary(x.norm) Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.266609 -0.698778 -0.069692 0.002965 0.726486 2.273465
Let’s plot 20 random points to understand the normal distribution:
plot(rnorm(20))Hmm, I think it will easier to appreciate the properties of the normal distribution (0 mean and 1 standard deviation) with more points:
plot(rnorm(1000))For people interested in seeing the density function for these points:
w <- rnorm(1000)
hist(w, col = "red", freq = F, xlim = c(-5,5))
curve(dnorm, -5, 5, add = T, col = "blue")Let’s introduce a function abline(...) which enables us to add horizontal and vertical lines to an existing plot:
plot(x.norm)
# 'h' means horizontal line
# 'col' means color
abline(h = mean(x.norm), col = 'red')abline(h = median(x.norm), col = 'blue')
abline(h = mean(x.norm) + sd(x.norm), col = 'green')abline(h = mean(x.norm) - sd(x.norm), col = 'green')Let’s try to plot the quantiles from before:
plot(x.norm)
abline(h = median(x.norm), col = 'blue')abline(h = summary(x.norm)[2], col = 'red')
abline(h = summary(x.norm)[5], col = 'red')The term factor refers to a statistical data type used to store categorical variables. The difference between a categorical variable and a continuous variable is that a categorical variable can belong to a limited number of categories. A continuous variable, on the other hand, can correspond to an infinite number of values. Let’s invoke the factor(...) function:
# Language Codes
lang <- c("en","fr","hi","hi","ru","ru","ru","ru","fr","hi","en","cn")
# Factor of Language codes
langf <- factor(lang)
langf [1] en fr hi hi ru ru ru ru fr hi en cn
Levels: cn en fr hi ru
There are two types of categorical variables: a nominal categorical variable and an ordinal categorical variable. A nominal variable is a categorical variable without an implied order. This means that it is impossible to say that ‘one is worth more than the other’. While, the ordinal variable will have some inherent ordering. We’ve just seen an example of a nominal variable, let’s see an example of the ordinal variable:
# Speed traps measuring car speeds
speed <- c("high","med","high","med","low")
# The ordering is given by specifying the levels
speedf <- factor(speed, ordered = TRUE, levels = c("low","med","high"))
speedf[1] high med high med low
Levels: low < med < high
Notice also that there are no quotes around the values. That’s because they’re not strings; they’re actually integer references to one of the factor’s levels. But what is a level? It is simply the unique values in the vector:
levels(speedf)[1] "low" "med" "high"
Let’s try comparing ordered factors. In our previous example, we want to test whether the second car is going slower than the third car:
# Ordered factors can compare strings
speedf[2] < speedf[3][1] TRUE
Try calling summary(...) on every object from now on:
# Returns frequencies
summary(langf)cn en fr hi ru
1 2 2 3 4
At this point, there are built-in functions that are similar for nearly every data structure, so don’t be surprised with a is.factor(...):
# Performs a check for factor
is.factor(langf)[1] TRUE
Or, something like this:
# Returns a vector
as.vector(langf) [1] "en" "fr" "hi" "hi" "ru" "ru" "ru" "ru" "fr" "hi" "en" "cn"
Or, this:
# Returns a column
as.matrix(langf) [,1]
[1,] "en"
[2,] "fr"
[3,] "hi"
[4,] "hi"
[5,] "ru"
[6,] "ru"
[7,] "ru"
[8,] "ru"
[9,] "fr"
[10,] "hi"
[11,] "en"
[12,] "cn"
There are several built-in datasets within R. Arguably, the ‘Hello World’ example of these datasets is mtcars:
# Motor Trend Car Road Tests
mtcarsLet’s have a look at the top rows from this dataset using head(...):
# Top rows
head(mtcars)Let’s have a look at the bottom rows from this dataset using tail(...):
# Bottom rows
tail(mtcars)Invoking class(mtcars) will tell us it’s an object of type data.frame. But let’s go a step further and explore the structure of the data frame using str(...):
# Returns the structure
str(mtcars)'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Calling the data.frame(...) function enables us to create a dataframe. Notice the vector names become the column names automatically. Each data argument to the data.frame(...) function takes the form of either value or tag = value:
# Definition of vectors
name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet",
"Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
# Create a data frame from the vectors
planets_df <- data.frame(name, type, diameter, rotation, rings)
planets_dfLet’s explore the structure once again:
# Returns the structure of the planets
str(planets_df)'data.frame': 8 obs. of 5 variables:
$ name : Factor w/ 8 levels "Earth","Jupiter",..: 4 8 1 3 2 6 7 5
$ type : Factor w/ 2 levels "Gas giant","Terrestrial planet": 2 2 2 2 1 1 1 1
$ diameter: num 0.382 0.949 1 0.532 11.209 ...
$ rotation: num 58.64 -243.02 1 1.03 0.41 ...
$ rings : logi FALSE FALSE FALSE FALSE TRUE TRUE ...
Dataframes share a lot in common with matrices. Think of dataframes as extended matrices, without the necessity to contain elements of the same data type. However, there are a few quirks. Let’s try accessing some columns:
# First column by index
planets_df[[1]][1] Mercury Venus Earth Mars Jupiter Saturn Uranus Neptune
Levels: Earth Jupiter Mars Mercury Neptune Saturn Uranus Venus
# First column by name
planets_df[['type']][1] Terrestrial planet Terrestrial planet Terrestrial planet Terrestrial planet Gas giant
[6] Gas giant Gas giant Gas giant
Levels: Gas giant Terrestrial planet
# By using the '$' atomic vector
planets_df$diameter[1] 0.382 0.949 1.000 0.532 11.209 9.449 4.007 3.883
TLDR, is.data.frame(...):
# Checks if 'planets_df' is a dataframe
is.data.frame(planets_df)[1] TRUE
You can also coerce other objects into dataframes:
# Coerces matrix to dataframe
as.data.frame(matrix(1:10, nrow = 5, ncol = 2, dimnames = list(NULL, c("X","Y"))))What’s the diameter of Mercury?
# Print out diameter of Mercury (row 1, column 3)
planets_df[1,3][1] 0.382
Show me all the data for Mars!
# Print out data for Mars (entire fourth row)
planets_df[4, ]What are the diameters of all planets?
# planets_df[['diameters']] is also valid
planets_df$diameter[1] 0.382 0.949 1.000 0.532 11.209 9.449 4.007 3.883
What type of planets are the first 3 planets?
# planets_df[1:3,"type"] is also valid
planets_df$type[1:3][1] Terrestrial planet Terrestrial planet Terrestrial planet
Levels: Gas giant Terrestrial planet
Which planets have positive rotation?
# planets_df[,'rotation'] > 0 is also valid
positive_rotation <- planets_df$rotation > 0
# Plantes with positive rotation
planets_df[positive_rotation,]Names of the planets that have positive rotation and rings:
# Give yourself a cookie if you got this right!!!
planets_df[planets_df$rotation > 0 & planets_df$rings, 'name'][1] Jupiter Saturn Neptune
Levels: Earth Jupiter Mars Mercury Neptune Saturn Uranus Venus
Lists, as opposed to vectors, can hold components of different types. These objects can be matrices, vectors, data frames, even other lists, etc. It is not even required that these objects are related to each other in any way.
heroes <- c('Ironman', 'Thor', 'Captain America', 'Wonder Woman', 'Batman')
actors <- c('Sir Robert Downey Jr', 'Chris Hemsworth', 'Chris Evans', 'Gal Gadot', 'Christian Bale')
level <- c(100, 1000, 10, 10000, 100000)
# Initialise the list
super_heroes <- list(power_level = data.frame(heroes,level),
serial_no = 1:5,
actor_names = actors)
super_heroes$power_level
$serial_no
[1] 1 2 3 4 5
$actor_names
[1] "Sir Robert Downey Jr" "Chris Hemsworth" "Chris Evans" "Gal Gadot"
[5] "Christian Bale"
We can access the elements in a list through indexing through double squared brackets [[...]]:
# Displays the second element in the list
super_heroes[[2]][1] "Sir Robert Downey Jr" "Chris Hemsworth" "Chris Evans" "Gal Gadot"
[5] "Christian Bale"
The list can be accessed through the named elements as well. Let’s use names(...) to see the named elements of the list:
# Named elements of the list
names(super_heroes)[1] "serial_no" "actor_names" "power_level"
There are two fundamental ways to access a list through names, here we use the element name:
# Through element names
super_heroes[["serial_no"]][1] 1 2 3 4 5
And here we invoke the atomic vector
super_heroes$power_levelThe function is.list(...) tests for lists and returns a boolean value:
is.list(super_heroes)[1] TRUE
One can also coerce an object into a list using the function as.list(...):
# Create a dataframe
The_Simpsons <- data.frame(first_name = c("Bart","Lisa","Homer","Marge","Baby"),
last_name = rep("Simpson", 5))
The_Simpsons# Coercing a dataframe into a list
as.list(The_Simpsons)$first_name
[1] Bart Lisa Homer Marge Baby
Levels: Baby Bart Homer Lisa Marge
$last_name
[1] Simpson Simpson Simpson Simpson Simpson
Levels: Simpson
But more often we will use str(...) to describe objects like lists:
# Displays the entire structure of the list
str(super_heroes)List of 3
$ power_level:'data.frame': 5 obs. of 2 variables:
..$ heroes: Factor w/ 5 levels "Batman","Captain America",..: 3 4 2 5 1
..$ level : num [1:5] 1e+02 1e+03 1e+01 1e+04 1e+05
$ serial_no : int [1:5] 1 2 3 4 5
$ actor_names: chr [1:5] "Sir Robert Downey Jr" "Chris Hemsworth" "Chris Evans" "Gal Gadot" ...