We discussed Scalars, Variable assignments and Helper functions in brief. We will now start exploring the building blocks of R. Vectors, Matrics, Dataframes - Basic Blocks, and Lists and Structures - slightly more complex, form the core of most data structures in R. Before we start, let’s do a quick recap -
# Assigning a variable "x" the value 3
x <- 3
# Assigning a variable "sword" the value katana
sword <- "katana"
# Displays the vignette for the function
help(sum)
# Displays the datatype of the variable
class(sword)## [1] "character"
Vectors are one-dimension arrays that can hold numeric data, character data, or logical data. In other words, a vector is a simple tool to store data. In R, you create a vector with the combine function c(...). You place the vector elements separated by a comma between the parentheses. For example:
# "x" is a numerical vector of the first five odd numbers
x <- c(1,3,5,7,9)
# Show x
x## [1] 1 3 5 7 9
Here’s an example of a character vector:
# "sword" is a character vector of swords
sword <- c("katana", "brisingr", "broadsword", "zar'roc")
# Print sword
sword## [1] "katana" "brisingr" "broadsword" "zar'roc"
Vectors cannot hold values with different modes (types). Try mixing modes and see what happens:
# What happens to the vector in this particular case ?
c(1.57,TRUE,"three")## [1] "1.57" "TRUE" "three"
It’ll be tedious to type out all numbers - 1,2,3,4,5,…,100 if you were to create a vector of length 100. Luckily, there are generators for this sort of thing. The colon operator(:) is one example:
# ":" operator
1:10## [1] 1 2 3 4 5 6 7 8 9 10
Another versatile way is to use the function seq(...) :
seq(1,10)## [1] 1 2 3 4 5 6 7 8 9 10
Generally, every function takes in arguments and performs some operations using those arguments. The seq(...) function for example, takes three arguements:
# seq(from = ,to = , by = )
# from, to - the starting and (maximal) end values of the sequence.
# by - number: increment of the sequence.
seq(from = 3, to = -3, by = -0.5)## [1] 3.0 2.5 2.0 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 -2.0 -2.5 -3.0
Vector elements can be accessed by using a numerical index within square brakcets:
# Initialise a sentence
sentence <- c("the","black","dog")
# Obtain the 3rd element of the vector
sentence[3]## [1] "dog"
Indices in R start from 1:
# First element in vector, index starts with 1
sentence[1]## [1] "the"
Replacing can be done very easily:
# Replacing "black" with "brown"
sentence[2] <- "brown"
# Display sentence
sentence## [1] "the" "brown" "dog"
Adding values to a vector:
# Adding "barks" to the vector
sentence[4] <- "barks"
# Display sentence
sentence## [1] "the" "brown" "dog" "barks"
Using a vector within the square brackets enables you to access multiple elements:
# Fetches the 1st and 3rd elements
sentence[c(1,3)]## [1] "the" "dog"
Adding and Replacing multiple elements in a vector can be done as shown below:
# Adds c("at","the","man") at the 5th,6th and 7th positions
sentence[5:7] <- c("at","the","man")
# Display sentence
sentence## [1] "the" "brown" "dog" "barks" "at" "the" "man"
Selecting everything but a particular index can be done so:
# Select everything but the first element
sentence[-1]## [1] "brown" "dog" "barks" "at" "the" "man"
In some cases, it makes sense to give each element in a vector a name. If you’re familiar with Excel, you would know that each table would have named columns. In the scenario of dealing with tables, it helps to visualise a vector as a single row of the table. And vector names as the resulting column names.
# Initialise a vector
ranks <- 1:3
# Names of the elements of the vector
names(ranks) <- c("first","runner-up","second runner-up")
# Display ranks
ranks## first runner-up second runner-up
## 1 2 3
It becomes easier to access named elements as you do not have to remember the indicies of said elements:
ranks['runner-up']## runner-up
## 2
Most arithematic operations applies to every element in the vector:
# Initialise x
x <- c(1,3,5,7,9)
# Subtract 1 from x
x - 1## [1] 0 2 4 6 8
Multiplying by 4 gives us:
x * 4## [1] 4 12 20 28 36
Adding two vectors:
# Two vectors
a <- 1:3
b <- c(10,100,1000)
# Adding them
a + b## [1] 11 102 1003
In the case of vector arithematic, it is imperative that the length of two vectors you are trying to (+-*/) are the same because, every element in a is (+-*/) to every element in b. To check the length of a vector, we use the function length(...):
# Number of elements in "a"
length(a)## [1] 3
As in the example above, vectors can also be passed to functions:
# What is square root of "a" ?
sqrt(a)## [1] 1.000000 1.414214 1.732051
Vectors can be compared to a scalar:
# Initialise "a"
a <- c(12,24,36,48)
# Check if "a" less than 30
a < 30## [1] TRUE TRUE FALSE FALSE
Vectors can be compared to other vectors:
# Are these vectors similar ?
a == c(12,18,30,48)## [1] TRUE FALSE FALSE TRUE
We have already seen how vectors can be accessed by characters and numbers as indices but there’s a way for vectors to access elements from a logical vectors. The above examples return a logical vector. This essentially indicates the index of elements where the logical expression holds. Let’s pass this logical vector to the vector itself:
# 'a < 30' as a logical vector, returns the elements
a[a < 30]## [1] 12 24
Let’s see another example:
# 'a %% 24 == 0' checks whether the
# remainder is 0 when 'a' divided by 24.
# Elements of 'a' that are divisible by 24
a[a %% 24 == 0]## [1] 24 48
There are two sorts of plots that are commonplace. The first kind is a scatterplot:
# Let's first initialise the points for the equation y = sin(x)
x <- seq(1,20,0.1)
y <- sin(x)The plot(...) function takes in two arguments - the x and y coordinates:
# Let's plot x and y
plot(x,y) The next kind is a
boxplot:
swords <- c(14,1,5,2)
names(swords) <- c("katana", "brisingr", "broadsword", "zar'roc")
barplot(swords) We shall cover data visulisation in detail further.
I’m assuming that we have all encountered missing and sparse data from time to time. R has a value that explicitly indicates a sample was not available, NA. Many functions that work with vectors treat this value specially.
# Vector with missing value
a <- c(1,2,3,NA,5)
# Let's find the sum of "a"
sum(a)## [1] NA
Well, remember when I said earlier that some functions can handle missing values. The sum(...) function is one of them:
# "na.rm" is an argument that removes missing values, default is False
sum(a, na.rm = TRUE)## [1] 11
One important function to identify whether there are missing values in a vector is is.na(...)
# Gives us a logical vector that shows the position of the NA
is.na(a)## [1] FALSE FALSE FALSE TRUE FALSE
You can test for the object being a vector by using is.vector(...):
# Returns a boolean value
is.vector(c(1,2,3))## [1] TRUE
A matrix is nothing more than a two-dimensional array. In R, a matrix is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns. There are multiple ways of creating matrices. The first and least common way is to create an vector and then change it’s dimensions to create an array:
vec <- 1:8
# Change dimension of vec to (2,4)
dim(vec) <- c(2,4)
# Changes to an array representation
vec## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
The most common method is to call the matrix(...) function:
# Bring up help(matrix) to understand the arguements
matrix(1:8, nrow = 2, ncol = 4)## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
To access elements within data structures in R, it is essential to place the indices within square brackets. If I wanted to access the element in the 1st row and 3rd column in the matrix vec, I’d do:
# x[i, j] - i denotes the row and j denotes the column
vec[1, 3]## [1] 5
To select the entire 2nd row:
vec[2, ]## [1] 2 4 6 8
To select the entire 3rd column:
vec[, 3]## [1] 5 6
A vector passed in the row or column place in the square brackets provides a slice of the matrix:
vec[, 1:3]## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
Reassigning elements from the 4th column are exactly how we did with vectors:
vec[, 4] <- 0
vec## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 0
## [2,] 2 4 6 0
The basic rules for linear algebra apply for all matrix operations. However there are some special operators we need to look at to understand matrix multiplication. Let’s start with understanding the shape of the matrix vec using dim(...):
# Displays the dimensions of the matrix
dim(vec)## [1] 2 4
Let’s see what the * operator does when we try to multiply vec with itself:
# Elementwise operations
vec * vec## [,1] [,2] [,3] [,4]
## [1,] 1 9 25 0
## [2,] 4 16 36 0
Clearly, this isn’t quite matrix multiplication. Remember that in when dealing with matrices, the dimensions(the subscript of the formula) is very important: \[A_{m \times n} \times B_{n \times o} = C_{m \times o}\] So, it’s important to visit what a transpose is, if \(A\) is a matrix with dimensions \({m \times n}\) then, the transpose \(A^T\) has dimensions \({n \times m}\). In R:
# Transpose of 'vec'
t(vec)## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
## [3,] 5 6
## [4,] 0 0
Finally, the new operator that we shall use is %*% for matrix multiplication:
# Multiply 'vec' with its transpose
vec %*% t(vec)## [,1] [,2]
## [1,] 35 44
## [2,] 44 56
Let’s use the the US and Non-US sales of the box office tickets for the movie Star Wars to play around with matrices further. Let’s first define the Box Office Sales for each movie in the prequels:
# Box office Star Wars (in millions!)
new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.900)
return_jedi <- c(309.306, 165.8)Now, we construct a matrix:
# Construct the matrix
star_wars_matrix <- matrix(c(new_hope, empire_strikes, return_jedi), nrow = 3, byrow = TRUE)
star_wars_matrix## [,1] [,2]
## [1,] 460.998 314.4
## [2,] 290.475 247.9
## [3,] 309.306 165.8
Hmm, this just looks like abunch of numbers. Well, to identify what the matrix is about it is important to add row names and column names:
# Vectors region and titles, used for naming
region <- c("US", "non-US")
titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")
# Name the columns with region
colnames(star_wars_matrix) <- region
# Name the rows with titles
rownames(star_wars_matrix) <- titles
# Print out star_wars_matrix
star_wars_matrix## US non-US
## A New Hope 460.998 314.4
## The Empire Strikes Back 290.475 247.9
## Return of the Jedi 309.306 165.8
Adding names to the matrix helps the indexing to a greater deal by making it more readable:
# This example illustrates why it's important to name matrices
star_wars_matrix['A New Hope', 'US'] == star_wars_matrix[1, 1]## [1] TRUE
There exists a more concise way to name matrices, by passing the argument of list(row_names, col_names) to the matrix function. These are the names of the dimensions:
sales_new_movies <- c(474.5, 552.5, 310.7, 338.7, 380.3, 468.5)
new_titles <- c("The Phantom Menace", "Attack of the Clones", "Revenge of the Sith")
# Construct matrix2
star_wars_matrix2 <- matrix(sales_new_movies, nrow = 3, byrow = TRUE, dimnames = list(new_titles, region))
# Display star_wars_matrix2
star_wars_matrix2## US non-US
## The Phantom Menace 474.5 552.5
## Attack of the Clones 310.7 338.7
## Revenge of the Sith 380.3 468.5
Let’s try to combine the two matrices we just made by appending the rows together. This can be done using the rbind(...) function:
# Combining the matrices
star_wars <- rbind(star_wars_matrix,star_wars_matrix2)
star_wars## US non-US
## A New Hope 460.998 314.4
## The Empire Strikes Back 290.475 247.9
## Return of the Jedi 309.306 165.8
## The Phantom Menace 474.500 552.5
## Attack of the Clones 310.700 338.7
## Revenge of the Sith 380.300 468.5
About time to introduce another function here. Let’s try to find the total sales for every movie and add that as a column to the complete table. The function to calculate the sum of rows is, well, rowSums(...)
# Total Sales for every movie
total_sales <- rowSums(star_wars)
total_sales## A New Hope The Empire Strikes Back Return of the Jedi
## 775.398 538.375 475.106
## The Phantom Menace Attack of the Clones Revenge of the Sith
## 1027.000 649.400 848.800
Now that we have the total_sales vector, let’s add it to the star_wars table as a column using the cbind(...) function:
# Adding a column
star_wars <- cbind(star_wars, total_sales)
# Display
star_wars## US non-US total_sales
## A New Hope 460.998 314.4 775.398
## The Empire Strikes Back 290.475 247.9 538.375
## Return of the Jedi 309.306 165.8 475.106
## The Phantom Menace 474.500 552.5 1027.000
## Attack of the Clones 310.700 338.7 649.400
## Revenge of the Sith 380.300 468.5 848.800
Using the same logic as we did in vectors let’s find all the star wars movies which grossed more than 350 million in the US :
# Returns a logical vector where this expression holds
star_wars[, 'US'] > 350## A New Hope The Empire Strikes Back Return of the Jedi
## TRUE FALSE FALSE
## The Phantom Menace Attack of the Clones Revenge of the Sith
## TRUE FALSE TRUE
Aping what we did before, let’s pass the logical vector to the matrix:
# Values of all rows where US sales > 350mil
star_wars[star_wars[, 'US'] > 350,]## US non-US total_sales
## A New Hope 460.998 314.4 775.398
## The Phantom Menace 474.500 552.5 1027.000
## Revenge of the Sith 380.300 468.5 848.800
Let’s explore another example and find in which region ‘A New Hope’ made more than 400 million:
# Regions where 'A New Hope' made more than 400mil
star_wars['A New Hope',] > 400## US non-US total_sales
## TRUE FALSE TRUE
Let’s get how much sales these regions (‘US’ and ‘Worldwide’) made:
# Values
star_wars['A New Hope', star_wars['A New Hope',] > 400]## US total_sales
## 460.998 775.398
R includes powerful visualizations for matrix data. The following visualisations are used very often in EDA. For this example, we shall be using the built-in dataset volcano. The first plot we see is a contour map of the matrix.
# Contour Map
contour(volcano) Here’s a perspective plot:
# Perspective plot with z axis restricted to 0.3
# Highest value goes to z-max(0.3) and lowest to z-min(0)
persp(volcano, expand = 0.3) Finally, here’s a heat map:
image(volcano)While we may never know whether we are living in a simulation or not, there’s a simple way to determine whether an object is a matrix within R:
# Returns a matrix
is.matrix(star_wars)## [1] TRUE
There’s a way to convert a matrix back to a vector using the as.vector(...) function:
# Returns a vector
as.vector(star_wars)## [1] 460.998 290.475 309.306 474.500 310.700 380.300 314.400
## [8] 247.900 165.800 552.500 338.700 468.500 775.398 538.375
## [15] 475.106 1027.000 649.400 848.800