As a data scientist, most of your tasks will probably require working with dataframes and vectors (see Part 1; remember that a dataframe is essentially a collection of vectors of different types)
However, other data structures that you will encounter are:
Factors: store categorical data; factors are both a data type and a data structure
Lists: collections of objects of different types, flexible and indexable
Matrices: two-dimensional structures, essentially vectors organized into rows and columns, all elements must be of the same type
Arrays: generalization of vectors and matrices to multi-dimensional data (e.g., 3D, 4D arrays), all elements must be of the same type
Factors
Factors are a special type of data used to represent categorical data. They may look similar to simple character vectors. In fact, they function differently:
Internally, they consist of vectors of integers associated with “levels”
Levels are unique categories, labelled for readability
Ordered factors include a hierarchical relationship between levels (e.g., "low" < "medium" < "high"; or a Likert scale like "Strongly disagree" < "Disagree" < "Neutral" < "Agree" < "Strongly agree"). Using ordered factors may be especially important for certain data analysis, e.g., Structural Equation Modeling (SEM) with ordinal data (e.g., using the lavaan package)
Factors
Why use factors?
In many cases, you might ignore and avoid them. However:
Help ensure consistency when data is actually categorical
Many functions for statistical modeling (e.g., lm()) automatically treat characters as factors, assigning dummy variables for each level; also tools like ggplot2 for visualization use factors for grouping or labeling axes
Ensure efficient storage of information as compared to characters, thanks to their internal structure
Ordered data: see previous slide
Lists
Lists are flexible structure that contain objects of different types and different lengths (including other lists… potentially creating an infinite Inception…)
If you name each element in the list, you can also access them using the $ operator, just like a dataframe
myLittleList =list(name ="Enrico", sector ="m-psi/01",hours =c(42,40,10,10), school =c("psychology","amv","psychology","psychology"))
myLittleList$sector
[1] "m-psi/01"
myLittleList$hours
[1] 42 40 10 10
That’s not surprising… a dataframe is actually a special kind of list! just two key constraints: 1) all elements are vectors of the same length; 2) vectors are named.
Lists
Why use lists?
Provide very flexible storage (for example, in a complex Monte Carlo simulation you might want to store not just a single result from each iteration, but multiple objects, such as each simulated dataframe, or whole model outputs)
Common in R: many functions (e.g., lm()) return their summaries and results as lists (even dataframes themselves are special cases of lists), so get familiar with them!
Are used in many context for handling nested data (e.g., JSON-formatted data)
Lists
example with a power simulation
N =30; b0 =0; b1 =0.3; sigma =1niter =1000results =list()for(i in1:niter){ x =rnorm(N, 0, 1) y = b0 + b1*x +rnorm(N, 0, sigma) results[[i]] =lm(y ~ x)}
This is an example of using a list in a power simulation. Typically, you store only one or a few values (e.g., p-values), but lists allow storing all fitted objects if needed.
Matrices
In R, a matrix is a 2-dimensional structure that contains only elements of the same type. Essentially, it can be thought of as a 2D vector.
You can create a matrix easily using the matrix() function:
Mathematical operations: matrices are fundamental for many tasks of linear algebra
Essential in modeling: many statistical methods for statistical modeling and machine learning actually operate on matrices (even though this may remain hidden to you)
Computational efficiency: much faster than dataframes for numeric computations
Arrays
Arrays are multi-dimensional structures in R, generalizing vectors (1-dimensional) and matrices (2-dimensional) to the n-dimensional case
It’s easy to create an array using the array() function:
Might be useful for storing, and manipulate efficiently structure of multi-dimensional data
Generally used in advanced topics and machine learning like when working on image/video processing and spatial data
Arrays in R are conceptually similar to tensors in Python (e.g., NumPy, TensorFlow), where they play a fundamental role in machine learning and deep learning, as they allow researchers to manage large amounts of data with complex structures