Vectors are a series of numbers, characters or boolean values.
# Assignment
numeric_vector <- 1:10 # OR assign('numeric_vector', 1:10)
# Squared numbers
squared_numbers <- numeric_vector^2
# All squared numbers but the last 3
squared_numbers[-(8:10)]## [1] 1 4 9 16 25 36 49
The length of a vector can be determined like so:
length(squared_numbers)## [1] 10
Matrices are two dimensional arrays containing elements of the same data-type:
# List of albums
albums <- c('Infinite', 'The Slim Shady LP', 'The Marshall Mathers LP',
'The Eminem Show', 'Encore', 'Relapse', 'Recovery',
'The Marshall Mathers LP 2', 'Revival')
# Year of Release
years <- c(1996, 1999, 2000, 2002, 2004, 2009, 2010, 2013, 2017)
# Eminem Matrix album
eminem_album_releases <- matrix(c(albums,years), nrow = 9, ncol = 2)
# Colnames of the matrix
colnames(eminem_album_releases) <- c("Albums","Release Year")
# Display
eminem_album_releases## Albums Release Year
## [1,] "Infinite" "1996"
## [2,] "The Slim Shady LP" "1999"
## [3,] "The Marshall Mathers LP" "2000"
## [4,] "The Eminem Show" "2002"
## [5,] "Encore" "2004"
## [6,] "Relapse" "2009"
## [7,] "Recovery" "2010"
## [8,] "The Marshall Mathers LP 2" "2013"
## [9,] "Revival" "2017"
Let’s get the second album of Eminem:
# Will the Real Slim Shady please stand up ?
eminem_album_releases[2, 'Albums']## Albums
## "The Slim Shady LP"
In R, you can generate a sequence of random numbers that are normally distributed with a mean of 0 and standard deviation of 1 by default:
# Generate 20 random numbers
x.norm <- rnorm(n = 20)
# Display 'x.norm'
x.norm## [1] -0.48996676 -0.04776829 0.46557437 -2.32318211 0.31044294
## [6] -0.61539061 0.56106372 0.96918444 0.72457129 -0.88147999
## [11] -0.29552758 -1.07698244 -1.66163702 0.27533420 -0.71154931
## [16] 1.12751870 0.29776926 1.11767235 -0.16258690 0.79116556
Let’s sort this data, using sort(...):
# Sorting
x.norm <- sort(x.norm)
# Display
x.norm## [1] -2.32318211 -1.66163702 -1.07698244 -0.88147999 -0.71154931
## [6] -0.61539061 -0.48996676 -0.29552758 -0.16258690 -0.04776829
## [11] 0.27533420 0.29776926 0.31044294 0.46557437 0.56106372
## [16] 0.72457129 0.79116556 0.96918444 1.11767235 1.12751870
The mean or the average value of x.norm can be calculated by:
# Average
mean(x.norm)## [1] -0.08128871
The median refers to the middle of the all observations:
# Median
median(x.norm)## [1] 0.113783
The standard deviation, which calculates how far the observations stray from the mean,can also be calculated using the formula: \[\sqrt{\frac{\sum_{i=1}^N (x_i - \bar{x})^2}{N-1}}\] The same can be done in R like so:
# Standard deviation formula in R
sqrt(sum((x.norm - mean(x.norm))^2)/(length(x.norm) - 1))## [1] 0.9349146
Mom’s spaghetti!!! There’s always a better way to do things, we can use the sd(...) function:
# SD from the function
sd(x.norm)## [1] 0.9349146
Let’s see the functions min(...) and max(...) in action:
# Returns the smallest and biggest element
print(c(min(x.norm), max(x.norm)))## [1] -2.323182 1.127519
Quantiles are cutpoints between the observations that splits the data into equal parts. There are several quantile systems, but we generally refer to the 4-quantile system:
# Find the quantiles
q <- quantile(x.norm)
# Display
q## 0% 25% 50% 75% 100%
## -2.3231821 -0.6394303 0.1137830 0.6019406 1.1275187
Using what we learnt from vector logic, let’s see how the quantiles splits x.norm:
# First quantile
x.norm[x.norm < q[2]]## [1] -2.3231821 -1.6616370 -1.0769824 -0.8814800 -0.7115493
# Second quantile
x.norm[x.norm > q[2] & x.norm < q[3]]## [1] -0.61539061 -0.48996676 -0.29552758 -0.16258690 -0.04776829
Well, in this case as well there seems to better things to calculate the basic statistics using summary(...):
# Generic function to describe an object
summary(x.norm)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.32318 -0.63943 0.11378 -0.08129 0.60194 1.12752
Let’s plot 20 random points to understand the normal distribution:
plot(rnorm(20))Hmm, I think it will easier to appreciate the properties of the normal distribution (0 mean and 1 standard deviation) with more points:
plot(rnorm(1000))For people interested in seeing the density function for these points:
w <- rnorm(1000)
hist(w, col = "red", freq = F, xlim = c(-5,5))
curve(dnorm, -5, 5, add = T, col = "blue")Let’s introduce a function abline(...) which enables us to add horizontal and vertical lines to an existing plot:
plot(x.norm)
# 'h' means horizontal line
# 'col' means color
abline(h = mean(x.norm), col = 'red')
abline(h = median(x.norm), col = 'blue')
abline(h = mean(x.norm) + sd(x.norm), col = 'green')
abline(h = mean(x.norm) - sd(x.norm), col = 'green')Let’s try to plot the quantiles from before:
plot(x.norm)
abline(h = median(x.norm), col = 'blue')
abline(h = summary(x.norm)[2], col = 'red')
abline(h = summary(x.norm)[5], col = 'red')The term factor refers to a statistical data type used to store categorical variables. The difference between a categorical variable and a continuous variable is that a categorical variable can belong to a limited number of categories. A continuous variable, on the other hand, can correspond to an infinite number of values. Let’s invoke the factor(...) function:
# Language Codes
lang <- c("en","fr","hi","hi","ru","ru","ru","ru","fr","hi","en","cn")
# Factor of Language codes
langf <- factor(lang)
langf## [1] en fr hi hi ru ru ru ru fr hi en cn
## Levels: cn en fr hi ru
There are two types of categorical variables: a nominal categorical variable and an ordinal categorical variable. A nominal variable is a categorical variable without an implied order. This means that it is impossible to say that ‘one is worth more than the other’. While, the ordinal variable will have some inherent ordering. We’ve just seen an example of a nominal variable, let’s see an example of the ordinal variable:
# Speed traps measuring car speeds
speed <- c("high","med","high","med","low")
# The ordering is given by specifying the levels
speedf <- factor(speed, ordered = TRUE, levels = c("low","med","high"))
speedf## [1] high med high med low
## Levels: low < med < high
Notice also that there are no quotes around the values. That’s because they’re not strings; they’re actually integer references to one of the factor’s levels. But what is a level? It is simply the unique values in the vector:
levels(speedf)## [1] "low" "med" "high"
Let’s try comparing ordered factors. In our previous example, we want to test whether the second car is going slower than the third car:
# Ordered factors can compare strings
speedf[2] < speedf[3]## [1] TRUE
Try calling summary(...) on every object from now on:
# Returns frequencies
summary(langf)## cn en fr hi ru
## 1 2 2 3 4
At this point, there are built-in functions that are similar for nearly every data structure, so don’t be surprised with a is.factor(...):
# Performs a check for factor
is.factor(langf)## [1] TRUE
Or, something like this:
# Returns a vector
as.vector(langf)## [1] "en" "fr" "hi" "hi" "ru" "ru" "ru" "ru" "fr" "hi" "en" "cn"
Or, this:
# Returns a column
as.matrix(langf)## [,1]
## [1,] "en"
## [2,] "fr"
## [3,] "hi"
## [4,] "hi"
## [5,] "ru"
## [6,] "ru"
## [7,] "ru"
## [8,] "ru"
## [9,] "fr"
## [10,] "hi"
## [11,] "en"
## [12,] "cn"
There are several built-in datasets within R. Arguably, the ‘Hello World’ example of these datasets is mtcars:
# Motor Trend Car Road Tests
mtcars## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Let’s have a look at the top rows from this dataset using head(...):
# Top rows
head(mtcars)## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Let’s have a look at the bottom rows from this dataset using tail(...):
# Bottom rows
tail(mtcars)## mpg cyl disp hp drat wt qsec vs am gear carb
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
Invoking class(mtcars) will tell us it’s an object of type data.frame. But let’s go a step further and explore the structure of the data frame using str(...):
# Returns the structure
str(mtcars)## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Calling the data.frame(...) function enables us to create a dataframe. Notice the vector names become the column names automatically. Each data argument to the data.frame(...) function takes the form of either value or tag = value:
# Definition of vectors
name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet",
"Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
# Create a data frame from the vectors
planets_df <- data.frame(name, type, diameter, rotation, rings)
planets_df## name type diameter rotation rings
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
## 3 Earth Terrestrial planet 1.000 1.00 FALSE
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
Let’s explore the structure once again:
# Returns the structure of the planets
str(planets_df)## 'data.frame': 8 obs. of 5 variables:
## $ name : Factor w/ 8 levels "Earth","Jupiter",..: 4 8 1 3 2 6 7 5
## $ type : Factor w/ 2 levels "Gas giant","Terrestrial planet": 2 2 2 2 1 1 1 1
## $ diameter: num 0.382 0.949 1 0.532 11.209 ...
## $ rotation: num 58.64 -243.02 1 1.03 0.41 ...
## $ rings : logi FALSE FALSE FALSE FALSE TRUE TRUE ...
Dataframes share a lot in common with matrices. Think of dataframes as extended matrices, without the necessity to contain elements of the same data type. However, there are a few quirks. Let’s try accessing some columns:
# First column by index
planets_df[[1]]## [1] Mercury Venus Earth Mars Jupiter Saturn Uranus Neptune
## Levels: Earth Jupiter Mars Mercury Neptune Saturn Uranus Venus
# First column by name
planets_df[['type']]## [1] Terrestrial planet Terrestrial planet Terrestrial planet
## [4] Terrestrial planet Gas giant Gas giant
## [7] Gas giant Gas giant
## Levels: Gas giant Terrestrial planet
# By using the '$' atomic vector
planets_df$diameter## [1] 0.382 0.949 1.000 0.532 11.209 9.449 4.007 3.883
TLDR, is.data.frame(...):
# Checks if 'planets_df' is a dataframe
is.data.frame(planets_df)## [1] TRUE
You can also coerce other objects into dataframes:
# Coerces matrix to dataframe
as.data.frame(matrix(1:10, nrow = 5, ncol = 2, dimnames = list(NULL, c("X","Y"))))## X Y
## 1 1 6
## 2 2 7
## 3 3 8
## 4 4 9
## 5 5 10
What’s the diameter of Mercury?
# Print out diameter of Mercury (row 1, column 3)
planets_df[1,3]## [1] 0.382
Show me all the data for Mars!
# Print out data for Mars (entire fourth row)
planets_df[4, ]## name type diameter rotation rings
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
What are the diameters of all planets?
# planets_df[['diameters']] is also valid
planets_df$diameter## [1] 0.382 0.949 1.000 0.532 11.209 9.449 4.007 3.883
What type of planets are the first 3 planets?
# planets_df[1:3,"type"] is also valid
planets_df$type[1:3]## [1] Terrestrial planet Terrestrial planet Terrestrial planet
## Levels: Gas giant Terrestrial planet
Which planets have positive rotation?
# planets_df[,'rotation'] > 0 is also valid
positive_rotation <- planets_df$rotation > 0
# Plantes with positive rotation
planets_df[positive_rotation,]## name type diameter rotation rings
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## 3 Earth Terrestrial planet 1.000 1.00 FALSE
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
Names of the planets that have positive rotation and rings:
# Give yourself a cookie if you got this right!!!
planets_df[planets_df$rotation > 0 & planets_df$rings, 'name']## [1] Jupiter Saturn Neptune
## Levels: Earth Jupiter Mars Mercury Neptune Saturn Uranus Venus
Lists, as opposed to vectors, can hold components of different types. These objects can be matrices, vectors, data frames, even other lists, etc. It is not even required that these objects are related to each other in any way.
# Create some vectors
heroes <- c('Ironman', 'Thor', 'Captain America', 'Wonder Woman', 'Batman')
actors <- c('Sir Robert Downey Jr', 'Chris Hemsworth', 'Chris Evans', 'Gal Gadot', 'Christian Bale')
level <- c(100, 1000, 10, 10000, 100000)
# Initialise the list
super_heroes <- list(power_level = data.frame(heroes,level),
serial_no = 1:5,
actor_names = actors)
super_heroes## $power_level
## heroes level
## 1 Ironman 1e+02
## 2 Thor 1e+03
## 3 Captain America 1e+01
## 4 Wonder Woman 1e+04
## 5 Batman 1e+05
##
## $serial_no
## [1] 1 2 3 4 5
##
## $actor_names
## [1] "Sir Robert Downey Jr" "Chris Hemsworth" "Chris Evans"
## [4] "Gal Gadot" "Christian Bale"
We can access the elements in a list through indexing through double squared brackets [[...]]:
# Displays the second element in the list
super_heroes[[2]]## [1] 1 2 3 4 5
The list can be accessed through the named elements as well. Let’s use names(...) to see the named elements of the list:
# Named elements of the list
names(super_heroes)## [1] "power_level" "serial_no" "actor_names"
There are two fundamental ways to access a list through names, here we use the element name:
# Access using element names
super_heroes[["serial_no"]]## [1] 1 2 3 4 5
And here we invoke the atomic vector
# Access with an atomic vector
super_heroes$power_level## heroes level
## 1 Ironman 1e+02
## 2 Thor 1e+03
## 3 Captain America 1e+01
## 4 Wonder Woman 1e+04
## 5 Batman 1e+05
The function is.list(...) tests for lists and returns a boolean value:
# Returns a boolean value
is.list(super_heroes)## [1] TRUE
One can also coerce an object into a list using the function as.list(...):
# Create a dataframe
The_Simpsons <- data.frame(first_name = c("Bart","Lisa","Homer","Marge","Baby"),
last_name = rep("Simpson", 5))
The_Simpsons## first_name last_name
## 1 Bart Simpson
## 2 Lisa Simpson
## 3 Homer Simpson
## 4 Marge Simpson
## 5 Baby Simpson
# Coercing a dataframe into a list
as.list(The_Simpsons)## $first_name
## [1] Bart Lisa Homer Marge Baby
## Levels: Baby Bart Homer Lisa Marge
##
## $last_name
## [1] Simpson Simpson Simpson Simpson Simpson
## Levels: Simpson
But more often we will use str(...) to describe objects like lists:
# Displays the entire structure of the list
str(super_heroes)## List of 3
## $ power_level:'data.frame': 5 obs. of 2 variables:
## ..$ heroes: Factor w/ 5 levels "Batman","Captain America",..: 3 4 2 5 1
## ..$ level : num [1:5] 1e+02 1e+03 1e+01 1e+04 1e+05
## $ serial_no : int [1:5] 1 2 3 4 5
## $ actor_names: chr [1:5] "Sir Robert Downey Jr" "Chris Hemsworth" "Chris Evans" "Gal Gadot" ...