R

Basics

Vectors

Matrices

Factors

Lists

Data frames

Graphics

Basics

 

Language for statistical computing

- an open-source implementation of S

- visual capabilities, statistical techniques, highly extensible

 

Advantages

Open source, free, R packages, command-line interface, reproducilbility c R scripts

 

Disadvantages

Easy to learn, hard to master, command-line interface hard to master, poorly written code hard to read and maintain, and poorly written code is slow

 

Variables

Used to store a variable to reuse later

<-

 

ex:

# Assign a value to the variables my_apples and my_oranges

my_apples <- 5

my_oranges <- 6

 

# Add these two variables together and print the result

my_apples + my_oranges

 

# Create the variable my_fruit

my_fruit <- my_apples + my_oranges

 

To access variables in the workspace: ls()

 

R script

Text file c R commands, allowing automation of work

 

Comments  #

Use to make your code easy to understand

 

rm(variable) # will remove variable from workspace

 

# Clear the entire workspace

rm(list = ls())

 

Arithmetic in R

In its most basic form R can be used as a simple calculator. Consider the following arithmetic operators:

 

Addition: +

Subtraction: -

Multiplication: *

Division: /

Exponentiation: ^

Modulo: %%

The last two might need some explaining:

 

The ^ operator raises the number to its left to the power of the number to its right: for example 3^2 equals 9.

The modulo returns the remainder of the division of the number to the left by the number on its right, for example 5 modulo 3 or 5 %% 3 equals 2.

 

Calculate the volume of a donut

# Create the variables r and R

r<- 2

R <- 6

 

# Calculate the volume of the donut: vol_donut

vol_donut <- 2 * (pi^2) * (r^2) * R

 

# Remove all intermediary variables that you've used with rm()

rm(R, r)

 

# List the elements in your workspace

ls()

 

Basic Data Types

class() to reveal type

 

class (TRUE)

[1] "logical"

 

class (2)

[1] "numeric"

 

class(NA)

[1] "logical" #denotes missing values

 

class (2L)

[1] "integer"

 

is.numeric(2)
[1] TURE

 

integer is numeric, but numeric is not always integer

is.*() used to see whether variables are of certain type

as.*() used to transform the type of variable to another type

 

class( "I love data science!")

[1] "character"

 

Other atomic types:
double: higher precision

complex: complex numbers

raw: store raw bytes

 

Coercion - changing one variable type to another

- converting data type of character to interger (ie make "Hello" an integer) not possible

> as.numeric(TRUE)

[1] 1

> as.numeric(FALSE)

[1] 0

> as.character(4)

[1] "4"

> as.numeric("4.5")

[1] 4.5

> as.integer("4.5")

[1] 4

> as.numeric("Hello")

[1] NA

Warning message:

NAs introduced by coercion

 

> # Create variables var1, var2 and var3

> var1 <- TRUE

> var2 <- 0.3

> var3 <- "i"

>

> # Convert var1 to a character: var1_char

> var1_char <- as.character(var1)

>

> # See whether var1_char is a character

> is.character(var1_char)

[1] TRUE

>

> # Convert var2 to a logical: var2_log

> var2_log <- as.logical(var2)

>

> # Inspect the class of var2_log

> class(var2_log)

[1] "logical"

>

> # Coerce var3 to a numeric: var3_num

>   var3_num <- as.numeric(var3)

Warning message: NAs introduced by coercion

>

Vectors

 

Creating and naming Vectors
Vectors are sequences of data elements c same basic type

- can be characters, numeric or logical

 

Creating vectors c(); naming vectors names()

> remain <- c(11, 12, 11, 13)

> remain

[1] 11 12 11 13

> suits <- c("spades", "hearts", "diamonds", "clubs")

> names(remain) <- suits

> remain

spades hearts diamonds clubs

11 12 11 13

 

> remain <- c(spades = 11, hearts = 12,

diamonds = 11, clubs = 13)

 

> remain <- c("spades" = 11, "hearts" = 12,

"diamonds" = 11, "clubs" = 13)

 

Option 1

> remain <- c(11, 12, 11, 13)

> suits <- c("spades", "hearts", "diamonds", "clubs")

> names(remain) <- suits

 

Option 2

> remain <- c(spades = 11, hearts = 12,

diamonds = 11, clubs = 13)

 

Option 3

> remain <- c("spades" = 11, "hearts" = 12,

"diamonds" = 11, "clubs" = 13)

 

> str(remain)

Named num [1:4] 11 12 11 13

- attr(*, "names")= chr [1:4] "spades" "hearts"

"diamonds" "clubs"

 

Single value = vector

> my_apples <- 5

> my_oranges <- "six"

 

> is.vector(my_apples)

[1] TRUE

> is.vector(my_oranges)

[1] TRUE

 

> length(my_apples)

[1] 1

> length(my_oranges)

[1] 1

> length(drawn_suits)

[1] 5

 

Vectors are homogeneous

Only elements of the same type; atomic vectors < > lists

- auto coercion if necessary

 

> drawn_ranks <- c(7, 4, "A", 10, "K", 3, 2, "Q")

> drawn_ranks

[1] "7" "4" "A" "10" "K" "3" "2" "Q"

> class(drawn_ranks)

[1] "character"

 

Vector Arithmetic

> my_apples <- 5

> my_oranges <- 6

> my_apples + my_oranges

[1] 11

 

Computations are performed element-wise

> earnings <- c(50, 100, 30)

> earnings * 3

[1] 150 300 90

 

#Mathematics naturally extended

> earnings/10

[1] 5 10 3

> earnings - 20

[1] 30 80 10

> earnings + 100

[1] 150 200 130

> earnings^2

[1] 2500 10000 900

 

Element-wise

> earnings <- c(50, 100, 30)

> expenses <- c(30, 40, 80)

> earnings - expenses

[1] 20 60 -50

> earnings + c(10, 20, 30)

[1] 60 120 60

> earnings * c(1, 2, 3)

[1] 50 200 90

> earnings / c(1, 2, 3)

[1] 50 50 10

 

sum() and >

> earnings <- c(50, 100, 30)

> expenses <- c(30, 40, 80)

> bank <- earnings - expenses

> bank

[1] 20 60 -50

> sum(bank)

[1] 30

> earnings > expenses

[1] TRUE TRUE FALSE

Subsetting Vectors

Subset by index

> remain <- c(spades = 11, hearts = 12,

diamonds = 11, clubs = 13)

> remain[1]       [1] -> take element at index 1

spades              result is spades a (named) vector too!

11

> remain[3]

diamonds

11

 

Subset by name

> remain <- c(spades = 11, hearts = 12,

diamonds = 11, clubs = 13)

> remain["spades"]

spades

11

> remain["diamonds"]

diamonds

11

 

Subset multiple elements

> remain <- c(spades = 11, hearts = 12,

diamonds = 11, clubs = 13)

> remain_black <- remain[c(1, 4)]

> remain_black

spades clubs

11 13

> remain[c(4, 1)]     #order in selection vector matters!

clubs spades

13       11

> remain[c("clubs", "spades")]

clubs spades

13 11

 

Subset all but some

> remain <- c(spades = 11, hearts = 12,

diamonds = 11, clubs = 13)

> remain[-c(1, 2)]

diamonds clubs

11              13

> remain[-"spades"]

Error in -"spades" : invalid argument to unary operator

hearts diamonds clubs All but index 1 are returned

12 11 13

> remain[-1]

 

Subset using logical vector

> remain <- c(spades = 11, hearts = 12,

diamonds = 11, clubs = 13)

> remain[c(FALSE, TRUE, FALSE, TRUE)]

hearts clubs

12         13

> selection_vector <- c(FALSE, TRUE, FALSE, TRUE)

> remain[selection_vector]

hearts clubs

12 13

 

Subset using logical vector

> remain <- c(spades = 11, hearts = 12,

diamonds = 11, clubs = 13)

> remain[c(TRUE, FALSE)]     #Recycles to c(T,F,T,F)

spades diamonds

11 11

> remain[c(TRUE, FALSE, TRUE, FALSE)]

spades diamonds

11 11

> remain[c(TRUE, FALSE, TRUE)]

spades diamonds clubs

11 11 13

> remain[c(TRUE, FALSE, TRUE, TRUE)]

spades diamonds clubs

11 11 13

Lab

 

#1

Creating a vector

numeric_vector <- c(1, 10, 49)

character_vector <- c("a", "b", "c")

 

# Create boolean_vector

boolean_vector <- c(TRUE, FALSE, TRUE)

 

#2

# Poker winnings from Monday to Friday

poker_vector <- c(140, -50, 20, -120, 240)

 

# Roulette winnings from Monday to Friday

roulette_vector <- c(-24, -50, 100, -350, 10)

 

# Create the variable days_vector

days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

 

# Assign the names of the day to roulette_vector and poker_vector

names(poker_vector) <- days_vector

names(roulette_vector) <- days_vector

Vector Arithmetic Labs

> # A_vector and B_vector have already been defined for you

> A_vector <- c(1, 2, 3)

> B_vector <- c(4, 5, 6)

>

> # Take the sum of A_vector and B_vector: total_vector

> total_vector <- A_vector + B_vector

>

> # Print total_vector

> total_vector

[1] 5 7 9

>

> # Calculate the difference between A_vector and B_vector: diff_vector

> diff_vector <- A_vector - B_vector

>

> # Print diff_vector

> diff_vector

[1] -3 -3 -3

 

#Gambling total exercise

> # Casino winnings from Monday to Friday

> poker_vector <- c(140, -50, 20, -120, 240)

> roulette_vector <- c(-24, -50, 100, -350, 10)

> days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

> names(poker_vector) <- days_vector

> names(roulette_vector) <- days_vector

>

> # Total winnings with poker: total_poker

> total_poker <- sum(poker_vector)

>

> # Total winnings with roulette: total_roulette

> total_roulette <- sum(roulette_vector)

>

> # Total winnings overall: total_week

> total_week <- total_poker + total_roulette

>

> # Print total_week

> total_week

[1] -84

 

# Poker or roulette?

> # Casino winnings from Monday to Friday

> poker_vector <- c(140, -50, 20, -120, 240)

> roulette_vector <- c(-24, -50, 100, -350, 10)

> days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

> names(poker_vector) <- days_vector

> names(roulette_vector) <- days_vector

>

> # Calculate poker_better

> poker_better <- poker_vector > roulette_vector

>

> # Calculate total_poker and total_roulette, as before

> total_poker <- sum(poker_vector)

> total_roulette <- sum(roulette_vector)

>

> # Calculate choose_poker

> choose_poker <- total_poker > total_roulette

>

> # Print choose_poker

> choose_poker

[1] TRUE

>

Subsetting vectors lab

> # Casino winnings from Monday to Friday

> poker_vector <- c(140, -50, 20, -120, 240)

> roulette_vector <- c(-24, -50, 100, -350, 10)

> days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

> names(poker_vector) <- days_vector

> names(roulette_vector) <- days_vector

>

> # Poker results of Wednesday: poker_wednesday

> poker_wednesday <- poker_vector[3]

>

> # Roulette results of Friday: roulette_friday

> roulette_friday <- roulette_vector[5]

 

> # Casino winnings from Monday to Friday

> poker_vector <- c(140, -50, 20, -120, 240)

> roulette_vector <- c(-24, -50, 100, -350, 10)

> days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

> names(poker_vector) <- days_vector

> names(roulette_vector) <- days_vector

>

> # Mid-week poker results: poker_midweek

> poker_midweek <- poker_vector[c(2,3,4)]

>

> # End-of-week roulette results: roulette_endweek

> roulette_endweek <- roulette_vector[c(4,5)]

>

> # Roulette results for Tuesday to Friday inclusive: roulette_subset

> roulette_subset <- roulette_vector[c(2:5)]

>

> # Print roulette_subset

> roulette_subset

  Tuesday Wednesday  Thursday    Friday

      -50       100      -350        10

 

> # Select Thursday's roulette gains: roulette_thursday

> roulette_thursday <- roulette_vector["Thursday"]

>

> # Select Tuesday's poker gains: poker_tuesday

> poker_tuesday <- poker_vector["Tuesday"]

 

> # Select the first three elements from poker_vector: poker_start

> poker_start <- poker_vector[c(1:3)]

>

> # Calculate the average poker gains during the first three days: avg_poker_start

> avg_poker_start <- mean(poker_start)

 

> # Roulette results for day 1, 3 and 5: roulette_subset

> roulette_subset <- roulette_vector[c(1,3,5)]

>

> # Poker results for first three days: poker_start

> poker_start <- poker_vector[c(TRUE, TRUE, TRUE, FALSE, FALSE)]

 

# Create logical vector corresponding to profitable poker days: selection_vector

selection_vector <- poker_vector > 0

 

# Select amounts for profitable poker days: poker_profits

poker_profits <- poker_vector[c(selection_vector)]

 

> # Select amounts for profitable roulette days: roulette_profits

> roulette_profits <- roulette_vector[c(roulette_vector > 0)]

>

> # Sum of the profitable roulette days: roulette_total_profit

> roulette_total_profit <- sum(roulette_profits)

>

> # Number of profitable roulette days: num_profitable_days

> num_profitable_days <- sum(roulette_vector > 0)

Matrices

 

Creating and Naming Matrices

● Vector: 1D array of data elements

● Matrix: 2D array of data elements

● Rows and columns

● One atomic vector type

 

Create a matrix: matrix()

> matrix(1:6, nrow = 2)

[,1] [,2] [,3]

[1,] 1 3 5

[2,] 2 4 6

> matrix(1:6, ncol = 3)

[,1] [,2] [,3]

[1,] 1 3 5

[2,] 2 4 6

> matrix(1:6, nrow = 2, byrow = TRUE)

[,1] [,2] [,3]

[1,] 1 2 3

[2,] 4 5 6

 

Create a matrix: recycling

> matrix(1:3, nrow = 2, ncol = 3)

[,1] [,2] [,3]

[1,] 1 3 2

[2,] 2 1 3

> matrix(1:4, nrow = 2, ncol = 3)

[,1] [,2] [,3]

[1,] 1 3 1

[2,] 2 4 2

Warning message:

In matrix(1:4, nrow = 2, ncol = 3) :

data length [4] is not a sub-multiple or multiple of the

number of columns [3]

 

rbind() and cbind()

> cbind(1:3, 1:3)

[,1] [,2]

[1,] 1 1

[2,] 2 2

[3,] 3 3

> rbind(1:3, 1:3)

[,1] [,2] [,3]

[1,] 1 2 3

[2,] 1 2 3

 

> m <- matrix(1:6, byrow = TRUE, nrow = 2)

> rbind(m, 7:9)

[,1] [,2] [,3]

[1,] 1 2 3

[2,] 4 5 6

[3,] 7 8 9

> cbind(m, c(10, 11))

[,1] [,2] [,3] [,4]

[1,] 1 2 3 10

[2,] 4 5 6 11

 

Naming a matrix with rownames() and colnames()

> m <- matrix(1:6, byrow = TRUE, nrow = 2)

> rownames(m) <- c("row1", "row2")

[,1] [,2] [,3]

row1 1 2 3

row2 4 5 6

> m

> colnames(m) <- c("col1", "col2", "col3")

> m

col1 col2 col3

row1 1 2 3

row2 4 5 6

 

Naming a matrix

> m <- matrix(1:6, byrow = TRUE, nrow = 2,

dimnames = list(c("row1", "row2"),

c("col1", "col2", "col3")))

> m

col1 col2 col3

row1 1 2 3

row2 4 5 6

 

Coercion

> num <- matrix(1:8, ncol = 2)

> num

[,1] [,2]

[1,] 1 5

[2,] 2 6

[3,] 3 7

[4,] 4 8

> char <- matrix(LETTERS[1:6], nrow = 4, ncol = 3)

> char

[,1] [,2] [,3]

[1,] "A" "E" "C"

[2,] "B" "F" "D"

[3,] "C" "A" "E"

[4,] "D" "B" "F"

 

> num <- matrix(1:8, ncol = 2)

> char <- matrix(LETTERS[1:6], nrow = 4, ncol = 3)

> cbind(num, char)

[,1] [,2] [,3] [,4] [,5]

[1,] "1" "5" "A" "E" "C"

[2,] "2" "6" "B" "F" "D"

[3,] "3" "7" "C" "A" "E"

[4,] "4" "8" "D" "B" "F"

 

If it contains different types, list or data.frame

 

Matrix Subsetting

 

Subset element

> m <- matrix(sample(1:15, 12), nrow = 3)

> m

[,1] [,2] [,3] [,4]

[1,] 5 11 15 3

[2,] 12 14 8 9

[3,] 6 1 4 2

> m[1,3]

[1] 15

> m[3,2]

[1] 1

 

Subset column or row

> m

[,1] [,2] [,3] [,4]

[1,] 5 11 15 3

[2,] 12 14 8 9

[3,] 6 1 4 2

 

> m[3,]

[1] 6 1 4 2

> m[,3]

[1] 15 8 4

> m[4]

[1] 11

> m[9]

[1] 4

 

Subset multiple elements

> m

[,1] [,2] [,3] [,4]

[1,] 5 11 15 3

[2,] 12 14 8 9

[3,] 6 1 4 2

 

> m[2, c(2, 3)]

[1] 14 8

> m[c(1, 2), c(2, 3)]

> m[c(1, 3), c(1, 3, 4)]

[,1] [,2]

[1,] 11 15

[2,] 14 8

[,1] [,2] [,3]

[1,] 5 15 3

[2,] 6 4 2

 

Subset by name

> rownames(m) <- c("r1", "r2", "r3")

> colnames(m) <- c("a", "b", "c", "d")

> m

    a  b   c  d

r1 5 11 15 3

r2 12 14 8 9

r3 6 1 4 2

> m[2,3]

[1] 8

> m["r2","c"]

[1] 8

> m[2,"c"]

[1] 8

> m[3, c("c", "d")]

c d

4 2

 

Subset with logical vector

> m

a b c d

r1 5 11 15 3

r2 12 14 8 9

r3 6 1 4 2

 

> m[c(FALSE, FALSE, TRUE),

c(FALSE, FALSE, TRUE, TRUE)]

c d

4 2

> m[c(FALSE, FALSE, TRUE),

c(FALSE, TRUE)]

b d

1 2

> m[c(FALSE, FALSE, TRUE),

c(FALSE, TRUE, FALSE, TRUE)]

b d

1 2

 

Matrix Arithmetic

 

● colSums(), rowSums()

● Standard arithmetic possible

● Element-wise computation

 

lotr_matrix

> the_fellowship <- c(316, 556)

> two_towers <- c(343, 584)

> return_king <- c(378, 742)

> lotr_matrix <- rbind(the_fellowship, two_towers, return_king)

> colnames(lotr_matrix) <- c("US", "non-US")

> rownames(lotr_matrix) <- c("Fellowship", "Two Towers",

"Return King")

> lotr_matrix

US non-US

Fellowship 316 556

Two Towers 343 584

Return King 378 742

 

Matrix Scalar

> lotr_matrix

US non-US

Fellowship 316 556

Two Towers 343 584

Return King 378 742

 

> lotr_matrix / 1.12

US non-US

Fellowship 282.1429 496.4286

Two Towers 306.2500 521.4286

Return King 337.5000 662.5000

> lotr_matrix - 50

US non-US

Fellowship 266 506

Two Towers 293 534

Return King 328 692

 

Matrix - Matrix

> # Definition of theater_cut omitted

> theater_cut

[,1] [,2]

[1,] 50 50

[2,] 80 80

[3,] 100 100

> lotr_matrix - theater_cut

US non-US

Fellowship 266 506

Two Towers 263 504

Return King 278 642

 

Recycling

> lotr_matrix - c(50, 80, 100)

US non-US

Fellowship 266 506

Two Towers 263 504

Return King 278 642

> matrix(c(50, 80, 100), nrow = 3, ncol = 2)

[,1] [,2]

[1,] 50 50

[2,] 80 80

[3,] 100 100

 

Matrix Multiplication

> # Definition of rates omitted

> rates

[,1] [,2]

[1,] 1.11 1.11

[2,] 0.99 0.99

[3,] 0.82 0.82

> lotr_matrix * rates

US non-US

Fellowship 350.76 617.16

Two Towers 339.57 578.16

Return King 309.96 608.44

 

Matrices and Vectors

● Very similar

● Vector = 1D, matrix = 2D

● Coercion if necessary

● Recycling if necessary

● Element-wise calculations

Matrices lab

 

# Star Wars box office in millions (!)

new_hope <- c(460.998, 314.4)

empire_strikes <- c(290.475, 247.900)

return_jedi <- c(309.306, 165.8)

 

# Create star_wars_matrix, each movie per row

star_wars_matrix <- rbind(new_hope, empire_strikes, return_jedi)

 

# Name the columns and rows of star_wars_matrix

colnames(star_wars_matrix) <- c("US", "non-US")

rownames(star_wars_matrix) <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")

 

# Calculate the worldwide box office:

worldwide_vector <- rowSums(star_wars_matrix)

 

# Bind the new variable worldwide_vector as a column to star_wars_matrix: star_wars_ext

star_wars_ext <- cbind(star_wars_matrix, worldwide_vector)

 

# Combine both Star Wars trilogies in one matrix: all_wars_matrix

all_wars_matrix <- rbind(star_wars_matrix, star_wars_matrix2)

 

# Total revenue for US and non-US:

total_revenue_vector <- colSums(all_wars_matrix)

Subsetting Labs

 

# star_wars_matrix is already defined in your workspace

 

# US box office revenue for "The Empire Strikes Back"

star_wars_matrix[2,1]

 

# non-US box office revenue for "A New Hope"

star_wars_matrix[1,2]

 

# Select all US box office revenue

star_wars_matrix[,1]

 

# Select revenue for "A New Hope"

star_wars_matrix[1,]

 

# Average non-US revenue per movie: non_us_all

non_us_all <- mean(star_wars_matrix[,2])

 

# Average non-US revenue of first two movies:

non_us_some <- mean(star_wars_matrix[c(1,2),2])

 

# All figures for "A New Hope" and "Return of the Jedi"

star_wars_matrix[c(1,3), c(1,2)]

 

# Select the US revenues for "A New Hope" and "The Empire Strikes Back"

star_wars_matrix[c("A New Hope", "The Empire Strikes Back"),1]

 

# Select the last two rows and both columns

star_wars_matrix[c(FALSE, TRUE, TRUE),]

 

# Select the non-US revenue for "The Empire Strikes Back"

star_wars_matrix[2,2]

 

# Combine view_count_1 and view_count_2 in a new matrix: view_count_all

view_count_all <- cbind(view_count_1, view_count_2)

 

# Subset view counts for three loudest debaters: view_count_loud

view_count_loud <- view_count_all[,c(3,6,7)]

 

# Use colSums() to calculate the number of views: total_views_loud

total_views_loud <- colSums(view_count_loud)

Matrix Arithmetic Lab

 

# Star Wars box office in millions (!)

new_hope <- c(460.998, 314.4)

empire_strikes <- c(290.475, 247.900)

return_jedi <- c(309.306, 165.8)

star_wars_matrix <- rbind(new_hope, empire_strikes, return_jedi)

colnames(star_wars_matrix) <- c("US", "non-US")

rownames(star_wars_matrix) <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")

 

# Estimation of visitors ($5 per ticket)

visitors <- star_wars_matrix/5

 

# Print the estimate to the console

visitors

Factors

 

Categorical Variables

● Limited number of different values

● Belong to category

● In R: factor

 

Create a factor: factor()

> blood <- c("B", "AB", "O", "A", "O", "O", "A", "B")

> blood

[1] "B" "AB" "O" "A" "O" "O" "A" "B"

factor()

> blood_factor <- factor(blood)

> blood_factor

[1] B AB O A O O A B   #sorts alphabetically

Levels: A AB B O

> str(blood_factor)

Factor w/ 4 levels "A","AB","B","O": 3 2 4 1 4 4 1 3

 

Order levels differently

> blood_factor2 <- factor(blood,

levels = c("O", "A", "B", "AB"))

> blood_factor2

[1] B AB O A O O A B

Levels: O A B AB

> str(blood_factor2)

Factor w/ 4 levels "O","A","B","AB": 3 4 1 2 1 1 2 3

> str(blood_factor)

Factor w/ 4 levels "A","AB","B","O": 3 2 4 1 4 4 1 3

 

Rename factor levels

> blood <- c("B", "AB", "O", "A", "O", "O", "A", "B")

> blood_factor <- factor(blood)

> levels(blood_factor) <- c("BT_A", "BT_AB", "BT_B", "BT_O")

> blood_factor

[1] BT_B BT_AB BT_O BT_A BT_O BT_O BT_A BT_B

Levels: BT_A BT_AB BT_B BT_O

> factor(blood, labels = c("BT_A", "BT_AB", "BT_B", "BT_O"))

[1] BT_B BT_AB BT_O BT_A BT_O BT_O BT_A BT_B

Levels: BT_A BT_AB BT_B BT_O

 

> blood <- c("B", "AB", "O", "A", "O", "O", "A", "B")

> blood_factor <- factor(blood)

> factor(blood,

levels = c("O", "A", "B", "AB"),

labels = c("BT_O", "BT_A", "BT_B", "BT_AB"))

[1] BT_B BT_AB BT_O BT_A BT_O BT_O BT_A BT_B

Levels: BT_O BT_A BT_B BT_AB

 

Nominal vs Ordinal

> blood <- c("B", "AB", "O", "A", "O", "O", "A", "B")

> blood_factor <- factor(blood)

> blood_factor[1] < blood_factor[2]

[1] NA

Warning message:

In Ops.factor(blood_factor[1], blood_factor[2]) :

‘<’ not meaningful for factors

> tshirt <- c("M", "L", "S", "S", "L", "M", "L", "M")

> tshirt_factor <- factor(tshirt, ordered = TRUE,

levels = c("S", "M", "L"))

> tshirt_factor

[1] M L S S L M L M

Levels: S < M < L

 

Ordered factor

> tshirt <- c("M", "L", "S", "S", "L", "M", "L", "M")

> tshirt_factor <- factor(tshirt, ordered = TRUE,

levels = c("S", "M", "L"))

> tshirt_factor

[1] M L S S L M L M

Levels: S < M < L

> tshirt_factor[1] < tshirt_factor[2]

[1] TRUE

 

Wrap up

● Factors for categorical variables

● Factors are integer vectors

● Change factor levels:

levels() function or labels argument

● Ordered factors: ordered = TRUE

 

 

Factors Lab

 

# Definition of hand_vector

hand_vector <- c("Right", "Left", "Left", "Right", "Left")

 

# Convert hand_vector to a factor: hand_factor

hand_factor <- factor(hand_vector)

 

# Display the structure of hand_factor

str(hand_factor)

 

# Encode survey_vector as a factor with the correct names: survey_factor

survey_factor <- factor(survey_vector, levels = c("L", "R"), labels = c("Left", "Right"))

 

# Print survey_factor

survey_factor

 

# Summarize survey_vector

summary(survey_vector)

 

# Summarize survey_factor

summary(survey_factor)

 

Animals and Temperature

# Definition of animal_vector and temperature_vector

animal_vector <- c("Elephant", "Giraffe", "Donkey", "Horse")

temperature_vector <- c("High", "Low", "High", "Low", "Medium")

 

# Convert animal_vector to a factor: animal_factor

animal_factor <- factor(animal_vector)

 

# Encode temperature_vector as a factor: temperature_factor

temperature_factor <- factor(temperature_vector, ordered = TRUE, levels = c("Low", "Medium", "High"))

 

# Print out animal_factor and temperature_factor

animal_factor

temperature_factor

 

Speed of Data Analysts

# Convert speed_vector to ordered speed_factor

speed_factor <- factor(speed_vector, ordered = TRUE, levels = c("Slow", "OK", "Fast"))

 

# Print speed_factor

speed_factor

 

# Summarize speed_factor

summary(speed_factor)

 

 

# Definition of speed_vector and speed_factor

speed_vector <- c("Fast", "Slow", "Slow", "Fast", "Ultra-fast")

factor_speed_vector <- factor(speed_vector, ordered = TRUE, levels = c("Slow", "Fast", "Ultra-fast"))

 

# Compare DA2 with DA5: compare_them

compare_them <- speed_vector[2] > speed_vector[5]

 

# Print compare_them: Is DA2 faster than DA5?

compare_them

 

 

Lists

 

Creating Names and Lists

● Vector: 1D, same type

● Matrix: 2D, same type

● List

● Different R objects

● No coercion

● Loss of some functionality

 

Create list: list()

> c("Rsome times", 190, 5)

[1] "Rsome times" "190" "5"

 

> list("Rsome times", 190, 5)

[[1]]

[1] "Rsome times"

 

[[2]]

[1] 190

 

[[3]]

[1] 5

 

> song <- list("Rsome times", 190, 5)

> is.list(song)

[1] TRUE

 

Name list

> song <- list("Rsome times", 190, 5)

> names(song) <- c("title", "duration", "track")

 

> song

$title

[1] "Rsome times"

 

$duration

[1] 190

 

$track

[1] 5

 

> song <- list(title = "Rsome times",

duration = 190, track = 5)

 

> str(song)

List of 3

$ title : chr "Rsome times"

$ duration: num 190

$ track : num 5

 

List in List

> similar_song <- list(title = "R you on time?",

duration = 230)

 

> song <- list(title = "Rsome times",

duration = 190, track = 5, similar = similar_song)

 

> str(song)

List of 4

$ title : chr "Rsome times"

$ duration: num 190

$ track : num 5

$ similar :List of 2

..$ title : chr "R you on time?"

..$ duration: num 230

 

Subset and Extend Lists

 

The song list

> similar_song <- list(title = "R you on time?",

duration = 230)

> song <- list(title = "Rsome times",

duration = 190, track = 5,

similar = similar_song)

 

> song

List of 4

$ title : chr "Rsome times"

$ duration: num 190

$ track : num 5

$ similar :List of 2

..$ title : chr "R you on time?"

..$ duration: num 230

 

[ versus [[

> song

List of 4

$ title : chr "Rsome times"

$ duration: num 190

$ track : num 5

$ similar :List of 2

..$ title : chr "R you on time?"

..$ duration: num 230

 

> song[1]

List of 1

   $ title: chr "Rsome times"

 

> song[[1]]

[1] "Rsome times"

 

> song[c(1, 3)]

List of 2

  $ title: chr "Rsome times"

  $ track: num 5

 

> song[[c(1, 3)]]

Error in song[[c(1, 3)]] :

subscript out of bounds

 

> song[[1]][[3]]

Error in song[[1]][[3]] :

subscript out of bounds

 

> song[[4]][[1]]

[1] "R you on time?"

 

> song[[c(4, 1)]]

[1] "R you on time?"

 

Subset by names

> song[["duration"]]

[1] 190

 

> song["duration"]

List of 1

$ duration: num 190

 

> song[c("duration", "similar")]

List of 2

$ duration: num 190

$ similar :List of 2

..$ title : chr "R you on time?"

..$ duration: num 230

 

Subset by logicals

> song[c(FALSE, TRUE, TRUE, FALSE)]

List of 2

$ duration: num 190

$ track : num 5

 

> song[[c(FALSE, TRUE, TRUE, FALSE)]]

Error : attempt to select less than one element

 

> song[[F]][[T]][[T]][[F]]

Error : attempt to select less than one element

 

$ and extending

> song$duration

[1] 190

 

> friends <- c("Kurt", "Florence", "Patti", "Dave")

> song$sent <- friends

> song

List of 5

$ title : chr "Rsome times"

$ duration: num 190

$ track : num 5

$ similar :List of 2

..$ title : chr "R you on time?"

..$ duration: num 230

$ sent : chr [1:4] "Kurt" "Florence" "Patti" "Dave"

 

Extending lists

> song[["sent"]] <- friends

 

> song$similar$reason <- "too long"

 

> song

List of 5

$ title : chr "Rsome times"

$ duration: num 190

$ track : num 5

$ similar :List of 3

..$ title : chr "R you on time?"

..$ duration: num 230

..$ reason : chr "too long"

$ sent : chr [1:4] "Kurt" "Florence" "Patti" "Dave"

 

Wrap up

● [[ or [ ?

● [[ to select list element

● [ results in sublist

● [[ and $ to subset and extend lists

List Lab

 

# Numeric vector: 1 up to 10

my_vector <- 1:10

 

# Numeric matrix: 1 up to 9

my_matrix <- matrix(1:9, ncol = 3)

 

# Factor of sizes

my_factor <- factor(c("M","S","L","L","M"), ordered = TRUE, levels = c("S","M","L"))

 

# Construct my_list with these different elements

my_list <- list(my_vector, my_matrix, my_factor)

 

# Construct my_super_list with the four data structures above

my_super_list <- list(my_vector, my_matrix, my_factor, my_list)

 

# Display structure of my_super_list

str(my_super_list)

 

# Construct my_list with these different elements

my_list <- list(vec = my_vector, mat = my_matrix, fac = my_factor)

 

# Print my_list to the console

my_list

 

The Shining List

# Create actors and reviews

actors_vector <- c("Jack Nicholson","Shelley Duvall","Danny Lloyd","Scatman Crothers","Barry Nelson")

reviews_factor <- factor(c("Good", "OK", "Good", "Perfect", "Bad", "Perfect", "Good"),

                  ordered = TRUE, levels = c("Bad", "OK", "Good", "Perfect"))

 

# Create shining_list

shining_list <- list(title = "The Shining", actors = actors_vector, reviews = reviews_factor)

 

Using list and vector stuff

# Create the list lst

lst = list(top[5], prop[,4])

 

# Create the list skills

skills <- list(topics = top, context = cont, properties = prop, list_info = lst)

 

# Display the structure of skills

str(skills)

Labs from Lists Part 2

 

# shining_list is already defined in the workspace

 

# Actors from shining_list: act

act <- shining_list[["actors"]]

 

# List containing title and reviews from shining_list: sublist

sublist <- shining_list[c(1,3)]

 

# Display structure of sublist

str(sublist)

 

# Select the last actor: last_actor

last_actor <- shining_list[[2]][5]

 

# Select the second review: second_review

second_review <- shining_list[[3]][2]

 

# Add the release year to shining_list

shining_list$year <- 1980

 

# Add the director to shining_list

shining_list$director <- "Stanley Kubrick"

 

# Inspect the structure of shining_list

str(shining_list)

Data Frames

 

Exploring the Data Frame

 

Datasets...

● Observations

● Variables

● Example: people

● each person = observation

● properties (name, age …) = variables

● Matrix?   Need different types

● List?   Not very practical

 

Data Frame!

● Specifically for datasets

● Rows = observations (persons)

● Columns = variables (age, name, …)

● Contain elements of different types

● Elements in same column: same type

 

Creating a Data Frame

● Import from data source

● CSV file

● Relational Database (e.g. SQL)

● Software packages (Excel, SPSS …)

 

Creating a Data Frame: data.frame()

> name <- c("Anne", "Pete", "Frank", "Julia", "Cath")

> age <- c(28, 30, 21, 39, 35)

> child <- c(FALSE, TRUE, TRUE, FALSE, TRUE)

> df <- data.frame(name, age, child)

> df   # column names match variable names

name age child

1 Anne 28 FALSE

2 Pete 30 TRUE

3 Frank 21 TRUE

4 Julia 39 FALSE

5 Cath 35 TRUE

 

Name Data Frame

> names(df) <- c("Name", "Age", "Child")

> df

Name Age Child

1 Anne 28 FALSE

2 Pete 30 TRUE

...

5 Cath 35 TRUE

 

> df <- data.frame(Name = name, Age = age, Child = child)

> df

Name Age Child

1 Anne 28 FALSE

2 Pete 30 TRUE

...

5 Cath 35 TRUE

 

Data Frame Structure

> str(df)

'data.frame': 5 obs. of 3 variables:

$ Name : Factor w/ 5 levels "Anne","Cath",..: 1 5 3 4 2

$ Age : num 28 30 21 39 35

$ Child: logi FALSE TRUE TRUE FALSE TRUE

 

> data.frame(name[-1], age, child)

Error : arguments imply differing number of rows: 4, 5

 

> df <- data.frame(name, age, child,

stringsAsFactors = FALSE)

> str(df)

'data.frame': 5 obs. of 3 variables:

$ name : chr "Anne" "Pete" "Frank" "Julia" ...

$ age : num 28 30 21 39 35

$ child: logi FALSE TRUE TRUE FALSE TRUE

Data Frames Lab #1

# Print the first observations of mtcars

head(mtcars)

 

# Print the last observations of mtcars

tail(mtcars)

 

# Print the dimensions of mtcars

dim(mtcars)

 

# Investigate the structure of the mtcars data set

str(mtcars)

 

Creating a data frame

# Definition of vectors

planets <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")

type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet",

          "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")

diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)

rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)

rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)

 

# Create a data frame: planets_df

planets_df <- data.frame(planets, type, diameter, rotation, rings)

 

# Display the structure of planets_df

str(planets_df)

 

Stopping the Coercion of Chr to Factors

# Encode type as a factor: type_factor

type_factor <- factor(type)

 

# Construct planets_df: strings are not converted to factors!

planets_df <- data.frame(planets, type_factor, diameter, rotation, rings, stringsAsFactors=FALSE)

 

# Display the structure of planets_df

str(planets_df)

 

Renaming Data Frame vectors after creation

# Improve the names of planets_df

names(planets_df) <- c("name", "type", "diameter", "rotation", "has_rings")

planets_df

 

Making a countries data frame

# Convert continents to factor: continents_factor

continents_factor <- factor(continents)

 

# Create countries_df with the appropriate column names

countries_df <- data.frame(Countries = countries, Continent = continents_factor, GDP = gdp, HDI = hdi, President = president, stringsAsFactors = FALSE)

 

# Display the structure of countries_df

str(countries_df)

Data Frames: Extend, Subset, Sort

 

Subset Data Frame

● Subsetting syntax from matrices and lists

● [ from matrices

● [[ and $ from lists

 

people

> name <- c("Anne", "Pete", "Frank", "Julia", "Cath")

> age <- c(28, 30, 21, 39, 35)

> child <- c(FALSE, TRUE, TRUE, FALSE, TRUE)

> people <- data.frame(name, age, child,

stringsAsFactors = FALSE)

 

> people

name age child

1 Anne 28 FALSE

2 Pete 30 TRUE

3 Frank 21 TRUE

4 Julia 39 FALSE

5 Cath 35 TRUE

 

> people[3,2]

[1] 21

 

> people[3,"age"]

[1] 21

 

> people[3,]

name age child

3 Frank 21 TRUE

 

> people[,"age"]

[1] 28 30 21 39 35

 

> people[c(3, 5), c("age", "child")]

age child

3 21 TRUE

5 35 TRUE

 

> people[2]

age

1 28

2 30

3 21

4 39

5 35

 

List

> people$age

[1] 28 30 21 39 35

 

> people[["age"]]

[1] 28 30 21 39 35

 

> people[[2]]

[1] 28 30 21 39 35

 

> people["age"]

age

1 28

2 30

3 21

4 39

5 35

 

> people[2]

age

1 28

2 30

3 21

4 39

5 35

 

Extend Data Frame

● Add columns = add variables

● Add rows = add observations

 

Add Column

> height <- c(163, 177, 163, 162, 157)

> people$height <- height

> people[["height"]] <- height

 

> people

name age child height

1 Anne 28 FALSE 163

2 Pete 30 TRUE 177

3 Frank 21 TRUE 163

4 Julia 39 FALSE 162

5 Cath 35 TRUE 157

 

> weight <- c(74, 63, 68, 55, 56)

 

> cbind(people, weight)

name age child height weight

1 Anne 28 FALSE 163 74

2 Pete 30 TRUE 177 63

3 Frank 21 TRUE 163 68

4 Julia 39 FALSE 162 55

5 Cath 35 TRUE 157 56

 

Add row

> tom <- data.frame("Tom", 37, FALSE, 183)

 

> rbind(people, tom)

Error : names do not match previous names

 

> tom <- data.frame(name = "Tom", age = 37,

child = FALSE, height = 183)

 

> rbind(people, tom)

name age child height

1 Anne 28 FALSE 163

2 Pete 30 TRUE 177

3 Frank 21 TRUE 163

4 Julia 39 FALSE 162

5 Cath 35 TRUE 157

6 Tom 37 FALSE 183

 

Sorting

> sort(people$age)

[1] 21 28 30 35 39

 

> ranks <- order(people$age)

> ranks

[1] 3 1 2 5 4

 

> people$age

[1] 28 30 21 39 35

 

#21 is lowest: its index, 3, comes first in ranks

#28 is second lowest: its index, 1, comes second in ranks

#39 is highest: its index, 4, comes last in ranks

 

> sort(people$age)

[1] 21 28 30 35 39

 

> ranks <- order(people$age)

> ranks

[1] 3 1 2 5 4

 

> people[ranks, ]

name age child height

3 Frank 21 TRUE 163

1 Anne 28 FALSE 163

2 Pete 30 TRUE 177

5 Cath 35 TRUE 157

4 Julia 39 FALSE 162

 

> sort(people$age)

[1] 21 28 30 35 39

 

> ranks <- order(people$age)

> ranks

[1] 3 1 2 5 4

 

> people[order(people$age, decreasing = TRUE), ]

name age child height

4 Julia 39 FALSE 162

5 Cath 35 TRUE 157

2 Pete 30 TRUE 177

1 Anne 28 FALSE 163

3 Frank 21 TRUE 163

Labs

# first row, second column

my_df[1,2]

 

# rows 1, 2 and 3

# columns 2, 3 and 4

my_df[1:3,2:4]

 

# Entire first row

my_df[1, ]

 

Planet selection

# planets_df is pre-loaded

 

# The type of Mars: mars_type

mars_type <- planets_df[4, 2]

 

# Entire rotation column: rotation

rotation <- planets_df[, 4]

 

# First three planets: closest_planets_df

closest_planets_df <- planets_df[1:3,]

 

# Last three planets: furthest_planets_df

furthest_planets_df <- planets_df[6:8,]

 

# Diameter and rotation for Earth: earth_data

earth_data <- planets_df[3, 3:4]

 

# Diameter for the last six rows: furthest_planets_diameter

furthest_planets_diameter <- planets_df[3:8, "diameter"]

 

# Print furthest_planets_diameter

furthest_planets_diameter

 

# Create rings_vector

rings_vector <- planets_df$has_rings

 

# Print rings_vector

rings_vector

 

# Create rings_vector

rings_vector <- planets_df$has_rings

 

# Select the information on planets with rings: planets_with_rings_df

planets_with_rings_df <- planets_df[rings_vector,]

 

# Print planets_with_rings_df

planets_with_rings_df

 

# Planets that are smaller than planet Earth: small_planets_df

small_planets_df <- subset(planets_df, subset = diameter < 1)

 

# Planets that rotate slower than planet Earth: slow_planets_df

slow_planets_df <- subset(planets_df, subset = abs(rotation) > 1)

 

# Definition of moons and masses

moons <- c(0, 0, 1, 2, 67, 62, 27, 14)

masses <- c(0.06, 0.82, 1.00, 0.11, 317.8, 95.2, 14.6, 17.2)

 

# Add moons to planets_df under the name "moon"

planets_df$moon <- moons

 

# Add masses to planets_df under the name "mass"

planets_df[["mass"]] <- masses

 

Adding a new observation

# Name pluto correctly

pluto <- data.frame(name = "Pluto", type = "Terrestrial planet", diameter = 0.18, rotation = -6.38, has_rings =  FALSE)

 

# Bind planets_df and pluto together: planets_df_ext

planets_df_ext <- rbind(planets_df, pluto)

 

# Print out planets_df_ext

planets_df_ext

 

Sorting

# Create a desired ordering for planets_df: positions

positions <- order(planets_df$diameter, decreasing = TRUE)

 

# Create a new, ordered data frame: largest_first_df

largest_first_df <- planets_df[positions, ]

 

# Print largest_first_df

largest_first_df

 

Countries sorting and data frame changing

# Remove economic variables and add population

countries_df <- countries_df[c("name", "continent", "has_president")]

countries_df_dem <- cbind(countries_df, population)

 

# Add brazil

brazil = data.frame(name = "Brazil", continent = "South-America", has_president = TRUE, population = 202768562)

countries_df2 <- rbind(countries_df_dem, brazil)

countries_df2

 

# Sort by population

countries_df2[order(countries_df2$population, decreasing = TRUE), ]

Graphics

 

Basic Graphics in R

● Create plots with code

● Replication and modification easy

● Reproducibility!

● graphics package

● ggplot2, ggvis, lattice

 

Graphics package

● Many functions

● plot() and hist()

● plot()

● Generic

● Different inputs -> Different plots

● Vectors, linear models, kernel densities …

 

countries

> str(countries)

'data.frame': 194 obs. of 5 variables:

$ name : chr "Afghanistan" "Albania" "Algeria" ...

$ continent : Factor w/ 6 levels "Africa","Asia", ...

$ area : int 648 29 2388 0 0 1247 0 0 2777 2777 ...

$ population: int 16 3 20 0 0 7 0 0 28 28 ...

$ religion : Factor w/ 6 levels "Buddhist","Catholic" ...

 

plot() (categorical)

> plot(countries$continent)

# plots a bar chart bc continent is factor

 

plot() (numerical)

> plot(countries$population)

 

plot() (2x numerical)

> plot(countries$area, countries$population)

 

plot() (2x numerical)

> plot(log(countries$area), log(countries$population))

#log used on both variables

 

plot () (2x categorical)

> plot(countries$continent, countries$religion)

 

plot() (2x categorical)

# first is the x axis (horizontal) then the y-axis (vertical)

> plot(countries$religion, countries$continent)

 

hist()

● Short for histogram

● Visual representation of distribution

● Bin all values

● Plot frequency of bins

 

> africa_obs <- countries$continent == "Africa"

> africa <- countries[africa_obs, ]

 

> hist(africa$population)

 

> hist(africa$population, breaks = 10)

# breaks argument changes the bin numbers

 

Other graphics functions

● barplot()

● boxplot()

● pairs()

Graphics Lab

# movies is already pre-loaded

 

# Display the structure of movies

str(movies)

 

# Plot the genre variable of movies

plot(movies$genre)

 

# Plot the genre variable against the rating variable

plot(movies$genre, movies$rating)

 

# Plot the runtime variable of movies

plot(movies$runtime)

 

# Plot rating (x) against runtime (y)

plot(movies$rating, movies$runtime)

 

# Create a histogram for rating

hist(movies$rating)

 

# Create a histogram for rating, with 20 bins

hist(movies$rating, breaks = 20)

 

# Create a boxplot of the runtime variable

boxplot(movies$runtime)

 

# Subset the dateframe and plot it entirely

plot(movies[, 3:5])

 

# Create a pie chart of the table of counts of the genres

pie(table(movies$genre))

 

Salaries per education histogram

# Subset salaries: salaries_educ

salaries_educ <- subset(salaries, subset = degree == 3)

 

# Create a histogram of the salary column

hist(salaries_educ$salary, breaks = 10)

Customizing Plots

 

mercury

> mercury

temperature pressure

1 0 0.0002

2 20 0.0012

3 40 0.0060

4 60 0.0300

5 80 0.0900

6 100 0.2700

7 120 0.7500

8 140 1.8500

9 160 4.2000

10 180 8.8000

11 200 17.3000

...

19 360 806.0000

 

Basic Plot

> plot(mercury$temperature, mercury$pressure)

 

Fancy Plot

> plot(mercury$temperature, mercury$pressure,

xlab = "Temperature",    #horizontal axis label

ylab = "Pressure",  #vertical axis label

main = "T vs P for Mercury",   #plot title

type = "o",   #plot type

col = "orange")   #plot color

 

Graphical Parameters

> plot(mercury$temperature, mercury$pressure, col = "darkgreen")

 

> plot(mercury$temperature, mercury$pressure)

 

par()

> ?par    #opens up to par documentation

> par()

List of 72

$ xlog : logi FALSE

$ ylog : logi FALSE

$ adj : num 0.5

...

$ fin : num [1:2] 8.31 6.89

$ font : int 1

$ font.axis: int 1

$ font.lab : int 1

...

$ yaxs : chr "r"

$ yaxt : chr "s"

$ ylbias : num 0.2

 

> par(col = "blue")

> plot(mercury$temperature, mercury$pressure)

> plot(mercury$pressure, mercury$temperature)

> par()$col

[1] "blue"    #stays "blue" here, unless changed

 

More Graphical Parameters

> plot(mercury$temperature, mercury$pressure,

xlab = "Temperature",

ylab = "Pressure",

main = "T vs P for Mercury",

type = "o",

col = "orange",

col.main = "darkgray",

cex.axis = 0.6,    # font size of the labels

lty = 5,       # lty = Line TYpe (from 1-6)

pch = 4)     #pch = Plot symbol

 

Customizing plots lab

 

# movies is pre-loaded in your workspace

 

# Create a customized plot

plot(movies$votes, movies$runtime, main = "Votes versus Runtime", xlab = "Number of votes [-]", ylab = "Runtime [s]", sub = "No clear correlation")

 

# Customize the plot further

plot(movies$votes, movies$runtime,

     main = "Votes versus Runtime",

     xlab = "Number of votes [-]",

     ylab = "Runtime [s]",

     sub = "No clear correlation",

     pch = 9,

     col = "#dd2d2d",

     col.main = 604)

 

# Customize the plot further

plot(movies$votes, movies$year, main = "Are recent movies voted more on?",

xlab = "Number of votes [-]", ylab = "Year [-]",

pch = 19, col = "orange", cex.axis = .8)

 

# Build a customized histogram

hist(movies$runtime, breaks = 20, xlim = c(90,220), main = "Distribution of Runtime",

xlab = "Runtime [-]", col = "cyan", border = "red")

 

Work experience and salary

# Add the exp vector as a column experience to salaries

salaries$exp <- exp

 

# Filter salaries: only keep degree == 3: salaries_educ

salaries_educ <- subset(salaries, subset = degree == 3)

 

# Create plot with many customizations

plot(salaries_educ$exp, salaries_educ$salary, main = "Does experience matter?",

xlab = "Work experience",

ylab = "Salary", col = "blue", col.main = "red", cex.axis = 1.2)

Multiple Plots

 

Graphics so far

● Plot single source of data

● No combinations of plots

● No different layers

 

shop

> str(shop)

'data.frame': 27 obs. of 5 variables:

$ sales : num 231 156 10 519 437 487 299 195 20 ...

$ ads : num 8.2 6.9 3 12 10.6 ...

$ comp : int 11 12 15 1 5 4 10 12 15 8 ...

$ inv : int 294 232 149 600 567 571 512 347 212 ...

$ size_dist: num 8.2 4.1 4.3 16.1 14.1 ...

 

mfrow parameter in par()

> par()     #par can set graphical parameters as well

List of 72

$ xlog : logi FALSE

$ ylog : logi FALSE

$ adj : num 0.5

...

$ fin : num [1:2] 8.31 6.89

$ font : int 1

$ font.axis: int 1

$ font.lab : int 1

...

$ yaxs : chr "r"

$ yaxt : chr "s"

$ ylbias : num 0.2

 

> par(mfrow = c(2,2))

> plot(shop$ads, shop$sales)

> plot(shop$comp, shop$sales)

> plot(shop$inv, shop$sales)

> plot(shop$size_dist, shop$sales)

 

mfcol parameter

> par(mfcol = c(2,2))  # 2 rows by 2 columns

> plot(shop$ads, shop$sales)

> plot(shop$comp, shop$sales)

> plot(shop$inv, shop$sales)

> plot(shop$size_dist, shop$sales)

 

Reset the grid

> par(mfrow = c(1,1))

> plot(shop$sales, shop$ads)

 

layout()

> grid <- matrix(c(1, 1, 2, 3), nrow = 2,

ncol = 2, byrow = TRUE)

> grid  # defines the grid with one on top, 2 on bottom

[,1] [,2]

[1,] 1 1

[2,] 2 3

> layout(grid)

> plot(shop$ads, shop$sales)

> plot(shop$comp, shop$sales)

> plot(shop$inv, shop$sales)

 

Reset the grid

> layout(1)

> par(mfcol = c(1,1))

 

Reset all parameters

> old_par <- par()

> par(col = "red")

> plot(shop$ads, shop$sales)

> par(old_par)

> plot(shop$ads, shop$sales)

 

Stack graphical elements

> plot(shop$ads, shop$sales,

pch = 16, col = 2,

xlab = "advertisement",

ylab = "net sales")

> lm_sales <- lm(shop$sales ~ shop$ads)

> abline(coef(lm_sales), lwd = 2)

> lines(shop$ads, shop$sales)

 

Stack graphical elements

> ranks <- order(shop$ads)

> plot(shop$ads, shop$sales,

pch = 16, col = 2,

xlab = "advertisement",

ylab = "net sales")

> abline(coef(lm_sales), lwd = 2)

> lines(shop$ads[ranks], shop$sales[ranks])

Multiple plots lab

 

# movies is pre-loaded in your workspace

 

# List all the graphical parameters

par()

 

# Specify the mfrow parameter

par(mfrow = c(2,1))

 

# Build two plots

plot(movies$votes, movies$rating)

hist(movies$votes)

 

# Build the grid matrix, with 2 scatterplots on the left and a boxplot on the right

grid <- matrix(c(1, 3, 2, 3), nrow = 2, ncol = 2, byrow = TRUE)

 

# Specify the layout

layout(grid)

 

# Build three plots

plot(movies$rating, movies$runtime)

plot(movies$votes, movies$runtime)

boxplot(movies$runtime)

 

# Customize the three plots

plot(movies$rating, movies$runtime, xlab = "Rating", ylab = "Runtime", pch = 4)

plot(movies$votes, movies$runtime, xlab = "Number of Votes", ylab = "Runtime", col = "blue")

boxplot(movies$runtime, border = "darkgray", main = "Boxplot of Runtime")

 

Plot a linear regression

 

# Fit a linear regression: movies_lm

movies_lm <- lm(movies$rating ~ movies$votes)

 

# Build a scatterplot: rating versus votes

plot(movies$votes, movies$rating)

 

# Add straight line to scatterplot

abline(coef(movies_lm), lwd = 2)

 

# Customize scatterplot

plot(movies$votes, movies$rating, main = "Analysis of IMDb data", xlab = "Number of Votes",

ylab = "Rating", col = "darkorange", pch = 15, cex = 0.7)

 

# Customize straight line

abline(coef(movies_lm), lwd = 2, col = "red")

 

# Add text

xco <- 7e5

yco <- 7

text(xco, yco, label = "More votes? Higher rating!")

References

 

1.