Exercises - Dataframes, The nightmare
Basics of R for Data Science
Like a data scientist data cleaner
The following set of exercises is a simulation of the challenges you will actually encounter in real life as a data scientist. Know that a large portion of your time as a data scientist will likely be spent on data management and cleaning! In the following scenario, imagine that students (or colleagues) have brought you messy dataframes, full of inconsistencies, varying formats, and other issues. Some guidance is provided here, but your goal is to think critically and figure out the most effective solutions. And don’t worry: the real world out there is more nightmarish! 🙂
Scenario
Some students (or colleagues) have provided you with data collected from various scales in a sample of participants (Wechsler, INVALSI [Italian National Assessment of Students], Personality Questionnaires). Your task is to perform some computations like a data scientist. However, before this, you will need to fix some inconsistencies, correct possible errors, and merge and combine different dataframes, as the data is scattered across multiple documents.
Your Final Goals
- Produce a clean dataframe with one row per participant including only the total/aggregate scores for each type of data (e.g., “
InvalsiTot
” for INVALSI items data, “WechslerTot
” for Wechsler subtests data, “meanAcc
for the lab-based trials data,”OpennessTot
” and “AgreabTot
” for the personality-questionnaires data); - Produce a readable correlation matrix between all aggregate scores;
- Produce some descriptive statistics for the aggregate scores (e.g., medians, means, standard deviations, skewness and kurtosis coefficients, counts of missing values);
- T-test comparison on INVALSI data or other variables (for males vs females);
- Some basic plots (histograms, scatter plots, boxplots) for distributions and pairs of variables.
Datasets:
- INVALSI items data:
ExerData_Invalsi_1.csv
andExerData_Invalsi_2.csv
, participant responses to INVALSI tasks splitted in two different datasets (need to be combined first); - Wechsler subtests data:
ExerData_Wechsler.xlsx
; - Lab-based trials data on an attentional task, includes accuracies to repeated trials:
ExerData_LabTrials.csv
; - Personality questionnaires:
ExerData_Questionnaires.csv
, includes item-by-item scores for Openness and Agreeableness.
Recommended functions and methods
Task | Subtask | Function / Method |
---|---|---|
Data Import | Load datasets from CSV/Excel | read.csv(...) readxl::read_excel(...) |
Data Cleaning | In a data frame column, replace character with another (e.g., “,” with “.”) | gsub(",", ".", df$colName) |
Coerce a string column to numbers, handling errors as NA |
as.numeric(df$colName) |
|
Remove invalid values (e.g., implausibly large scores) replacing them with missing values | df$colName[...] = NA |
|
Data Wrangling | Sum repeated observations by participant | aggregate(. ~ idName, data=df, FUN=sum) |
Combine DataFrames that have the same columns | rbind(df1, df2) |
|
Merge info across different DataFrames by the same participant ID, keeping all information | merge(df1, df2, by = "idName", all = TRUE) |
|
Variable Creation | Compute a new “total score” column that includes the sum of values across other columns | df$totScore = rowSums(df[ , c(...)] |
Descriptive Stats | Quickly compute some summary statistics from a DataFrame | summary(df) |
Correlation matrix from a selection of DataFrame columns | cor(df[ , c(...)], use="complete.obs) (or use="pairwise.complete") |
|
Skewness and kurtosis of DataFrame column(s) | moments::skewness(...) moments::kurtosis() |
|
Inferential Stats | Compare genders using t-test, dropping missing values | t.test() |
Visualization | Histograms, scatterplots, boxplots | hist(...) plot(...) boxplot(...) |
Notes:
- Some aspects of data cleaning could be addressed externally (e.g., using MS Excel); however, try to solve all issues within R, using code instead of manual corrections. Imagine one day you might be working with tens of thousands of rows: manual fixes aren’t feasible!
- This whole exercise focuses mostly on data structures and dataframes, but it may also involve other concepts, such as using the
for
loop (programming) to streamline operations across several variables. - While brute-force methods may get the job done, aim to write clean, efficient, and concise R code. Solving the problem is the priority, but an optimal solution balances correctness with minimal code complexity and minimal redundancy.