= c("Pastore", "Kiesner", "Granziol", "Toffalini",
Teachers "Calignano", "Epifania", "Bastianelli")
Data structures, like vectors, matrices, dataframes, lists, are fundamental tools that allow you to organize and store complex information, so that they can be easily processes by functions (e.g., lm()
function may fit a linear model on variables stored in a dataframe)
Most operations you will perform in R (e.g., processing data, fitting models, plotting outputs) are performed on these data structures
Simple one-dimensional structures that store data of different types
Here is an actual example (of a numerical vector):
Vectors are just special cases of arrays
c()
Vectors can easily be created using the c()
base function, with a sequence of elements separated by commas “,
”
Vectors can be of different types. The following example shows a character vector (note the quotes " "
around objects):
Vectors must contain elements of the same type. If you mix types, R will automatically coerce the elements to a single type, which may lead to undesired results.
Therefore, avoid mixing data types! Example:
everything was coerced to become a character!
If needed, use NA
(Not Available):
You may coerce a vector to be a particular type if needed
[1] "10" "15" "20" "10" "15" "tbd" "15" "5"
Warning: NAs introduced by coercion
[1] 10 15 20 10 15 NA 15 5
But be careful! Elements that cannot be coerced to the target type, will be replace with NA
Select/extract elements with INDEXING using square brackets []
:
[1] 10
[1] 15 5 15
[1] 10 20 5
Know the length of a vector using the length()
function, and use it:
Negative indexing
You can use the minus sign -
to select all elements except some from a vector. (This method is also applicable to dataframes)
Often, you’ll need to extract values from a vector based on specific logical conditions. Here’s an example:
Hours = c(10, 15, 20, 10, 15, 5, 15, 5)
Hours[Hours >= 15] # extract only values greater than or equal to 15
[1] 15 20 15 15
This is called logical indexing because you are selecting elements based on a logical vector (i.e., a sequence of TRUE
, FALSE
):
[1] FALSE TRUE TRUE FALSE TRUE FALSE TRUE FALSE
Also, you can use a vector to extract values from another vector:
With indexing, you can not only select, but also assign or modify elements in a vector:
Hours = c(10, 15, 20, 10, 15, 5, 15, 5)
Hours[1] = 0 # assign a new value
Hours[3] = Hours[3]+50 # modify an existing element
Hours
[1] 0 15 70 10 15 5 15 5
You can even assign values outside the current range of the vector. But what happens?
you can simultaneously apply an operation to a whole vector, like
Of course, this is useful when you want to save the result as a new vector:
Similarly, you can apply functions to all elements of a vector:
A whole vector may serve to compute summary statistics, for example using functions such as mean()
, sd()
, median()
, quantile()
, max()
, min()
:
A whole vector may serve to compute summary statistics, for example using functions such as mean()
, sd()
, median()
, quantile()
, max()
, min()
:
NA
) ValuesAll of the previous summary statistics will fail if there is even a single NA
value:
Hours = c(10, 15, 20, 10, 15, NA, 15, 5)
mean(Hours) # a single NA value implies that the average is impossible to determine
[1] NA
Error in quantile.default(Hours, probs = c(0.25, 0.75)): missing values and NaN's not allowed if 'na.rm' is FALSE
You can easily manage missing values by adding the na.rm=TRUE
argument:
NA
With the Average ValueReplacing a missing value with the average across valid values is risky, as it may alter many other summary statistics, but it is a good example for understanding different concepts seen so far:
Hours = c(10, 15, 20, 10, 15, NA, 15, 5)
# compute the average value ignoring NAs, and put it wherever
# there is a NA value in the vector
Hours[is.na(Hours)] = mean(Hours, na.rm=TRUE)
# now let's inspect the updated content of the vector
Hours
[1] 10.00000 15.00000 20.00000 10.00000 15.00000 12.85714 15.00000 5.00000
[1] 12.85714
Another useful summary statistic is the frequency count, which shows how often each unique value appears in a vector. You can use the table()
function to calculate frequencies easily:
type = c("METHODOLOGY", "METHODOLOGY", "PROGRAMMING", "SOFT SKILLS", "SOFT SKILLS",
"METHODOLOGY", "SOFT SKILLS", "METHODOLOGY", "PROGRAMMING")
table(type)
type
METHODOLOGY PROGRAMMING SOFT SKILLS
4 2 3
Be careful: R is case sensitive!
type = c("METHODOLOGY", "methodology", "PROGRAMMING", "SOFT SKILLS", "SOFT SKILLS",
"METHODOLOGY", "SOFT SKILLS", "METHODOLOGY", "Programming")
table(type)
type
methodology METHODOLOGY Programming PROGRAMMING SOFT SKILLS
1 3 1 1 3