age = 20Conditional statements like if, if...else, and ifelse in R are essential tools for automating tasks and assisting decision making in data science. What follows are a few simple “toy examples”, but focus on the underlying logic. This will be greatly useful in more advanced applications
if statementif statement performs an action only if a condition is met
if statementBasic flowchart showing the logic of the if statement:
if...else statementSometimes, however, you may need to perform alternative actions
if...else statementSometimes, however, you may need to perform alternative actions
Here is a practical example of theif...else statement
In the above example:
age is 18 or older, R will print "Adult";"Minor"if...else if...else statementif...else statements
if...else statementPossible, pratical use of if...else in a preplanned analysis for a hypothetical preregistered study: automate the decision to conduct additional analyses based on the result of a preliminary test. This helps create a reproducible analysis pipeline with a clear set of decisions
## PREPLANNED ANALYSIS
# preliminary test
tt1 = t.test(x1, x2, data=df, paired=TRUE)
# based on the p-value of the preliminary t-test, choose the next step
if (tt1$p.value < 0.05) {
# If significant, perform an additional analysis with a linear model (lm)
print("Significant result: proceeding with follow-up analysis")
fit = lm(outcome ~ pred1 + pred2 * moder1, data = df)
summary(fit)
} else {
# else, report only the preliminary test
print("No significant result: reporting preliminary results only")
print(tt1)
}if...else statementAll previous examples required to evaluate a single particular condition that might be TRUE or FALSE. However, you often want to apply this type of operation to an entire vector
Using if and if...else directly on a vector will NOT work as intended:
ifelse()All previous examples required to evaluate a single particular condition that might be TRUE or FALSE. However, you often want to apply this type of operation to an entire vector
To handle such cases you can use the base ifelse() function, that evaluates each element of a vector individually:
— Best practices: store results and set type as appropriate!
ifelse()The ifelse() function can also be nested to manage multiple conditions, such as in the following example:
dplyr::case_when()While this works, the nested structure can become cumbersome as the number of conditions increases.
The above case may become cumbersome and less readable when you need to combine a large number of conditions. In such cases, the case_when() function from the dplyr package (part of the tidyverse collection)
Iterative programming allows you to repeat one or a series of actions automatically, for a predetermined number of times or until a condition is met
Here’s the basics of iterative programming with the for loop:
for loopfor loop
for loopHere’s a more interesting example of iterative for loop with practical usefulness: we want to repeat a data simulation for a predetermined number of times (5 iterations), each time drawing \(n = 30\) values from a standard normal distribution, computing and displaying the average …
This is actually the starting point of a Monte Carlo simulation! 😃
for loopin the previous example, the for loop displayed the results but didn’t store it. For more effective use, you can combine the for loop with indexing with [] to save each result:
set.seed(0) # set a seed for reproducibility: best practice!
niter = 5 # set the desired number of iterations: best practice!
# initialize a results vector with NAs: best practice!
results = rep(NA, niter)
# now run the for loop! :-)
for(i in 1:niter){
x = rnorm(n = 30, mean = 0, sd = 1)
results[i] = mean(x)
}
results # display results[1] 0.021950789 -0.025771530 -0.009581231 0.032123159 -0.294644080
[1] 0.1358843
for loop→ Enjoy it! This is a proper estimation of the Standard Error of the Mean via Monte Carlo simulation! 😊
for loopYou don’t necessarily have to iterate over a sequence of integers (e.g., “i in 1:10000”; “j in 1:ncol(df)”), although this is the most common practice. You could iterate over whatever, for example, directly over the elements of a vector or other data structures
while loopThe while loop is another type of iterative structure in R. It may be useful when the precise number of iterations is not predetermined, but depends on a target being reached
amount = 1000
month = 0
interest_rate = 0.001 # 0.1% monthly interest rate
while(amount < 1500){
month = month + 1
amount = amount + amount * interest_rate
}
month[1] 406
Interpretation: it takes 406 months to reach an amount of \(€1,500\) when starting with an amount of \(€1,000\) with a \(0.1\)% monthly interest rate
repeat loopThe repeat loop has a logic similar to the while loop but 1) it always runs at least one iteration, 2) It explicitly emphasizes repetition until a condition (not necessarily a target) is met, using a break statement to terminate
apply familyapply is a family of base functions that provide efficient tools for running iterations on structures like dataframes, vectors, matrices, lists
Traditional loops provide a straghtforward, intuitive way to compute sequences of operations, but the apply family allows you to run faster computations… this may become particularly important when you need to parallelize for computationally intensive tasks
The following is not a computationally heavy task — but for example, let’s say we want to compute the mean value per column in this dataframe:
BD SI DS PCn CD VC LN MR CO SS
1 13 10 7 10 15 7 10 16 8 13
2 7 11 6 8 13 10 9 5 9 14
3 12 6 5 7 9 7 7 6 9 8
4 8 7 9 11 1 5 7 6 8 4
5 12 13 8 10 10 10 9 11 13 12
6 13 17 13 7 10 19 13 10 15 13
7 12 10 9 5 10 8 7 9 11 11
8 9 12 15 14 7 11 14 13 8 14
9 11 14 8 11 8 12 14 12 10 9
10 13 12 14 11 5 15 17 14 14 8
11 7 7 7 6 6 6 4 7 12 9
12 10 11 8 8 10 7 7 8 8 15
apply familyHere is how you could use the base apply function for computing the mean value by column:
BD SI DS PCn CD VC LN MR
9.824121 9.856423 9.706767 9.889169 9.781955 9.867168 9.987437 9.904762
CO SS
9.889447 10.005051
In fact, for such a simple task, even colMeans() could be sufficient:
BD SI DS PCn CD VC LN MR
9.824121 9.856423 9.706767 9.889169 9.781955 9.867168 9.987437 9.904762
CO SS
9.889447 10.005051
but let consider slightly more complex cases …
apply familyLet’s say you need to compute the standard deviation per column …
BD SI DS PCn CD VC LN MR
2.941790 3.137753 3.072283 2.855583 3.022541 3.167819 2.951726 2.989253
CO SS
2.999217 2.896523
… or to count the number of NA occurrences per column
BD SI DS PCn CD VC LN MR CO SS
2 3 1 3 1 1 2 1 2 4
in the latter case, we had to define a custom function, but that’s relatively simple to do!
apply familyAlthough any of such tasks could be done using a for loop, the code would be more cumbersome and less efficient. For example, here’s how the exact same result as the latter apply example could be obtained using a for loop:
apply familyapply family:
tapply(): applies a function to subsets of a vector grouped by a factor, example tapply(df$values, df$group, FUN=mean) (know that the function aggregate() might be more convenient in some cases)
lapply(): applies a function to each element of a list, returning results in a list format, example lapply(my_list, length)
sapply(): the same as the previous one but returns results as a vector if possible, example sapply(my_list, length)
mapply() multivariate version of sapply() that runs across more lists or vectors
sapplyHere’s how sapply() can be used to help compute the standard error of the mean via Monte Carlo simulation
for loop is used to fill each slot in the list with a vector of 30 randomly generated numbers;sapply is used to compute the mean of each vector in the list; sd on the vector of the meansSaving all generated data allows for more flexibility in analysis, although it uses more memory
lapply and sapplyHere’s another, even more compact way of doing the same, without using any for loop, but this requires defining a small custom function:
"i" represents the current element of 1:10000 over which lapply iterates. Even though it is practically useless here, because only random numbers are generated at each iteration, it must be included because sapply must pass an alement as the argument to the function by default
function()Previously, we saw a few cases of custom functions, for example for counting NAs in vectors, or computing the mean of a randomly generated vector
Here is the schema for defining custom functions with input (argument[s]), body, and output (return):
function()Here’s a full example:
After creating it, a custom function can be used like any other function:
function()Here’s a slightly more sophisticated example:
Let’s use it:
function()In some previous examples, custom functions were used directly in combination with apply(), without curly brackets {} or return(), yet they worked!
Why?
Generally, curly brackets {} and return() enhance clarity and are best practice, but in some cases they can be omitted for more compact code:
Curly brackets {} can be skipped if all code fits on a single line
return() can be omitted if the last (or only) code line represents the output
function()To clarify the previous slide, these are four alternative and increasingly compact ways of writing the same function: