Logo

Exercises - Dataframes: The nightmare… in R

Basics of R for Data Science

Like a data scientist data cleaner

The following set of exercises is a simulation of the challenges you will actually encounter in real life as a data scientist. Know that like >70% of your time as a data scientist will be spent on data management and cleaning! (Source: experience.)

In the following scenario, imagine that you have received messy datasets, full of inconsistencies, obvious errors, varying formats, data scattered across different files, and other issues. Some guidance is provided here, but your goal is to think critically and figure out the most effective solutions. And don’t worry: the real world out there is more nightmarish! 🙂

Scenario

You have received messy datasets from students or colleagues who have been collecting data from different tests: INVALSI [Italian National Assessment of Students], Wechsler [Intelligence], Big Five [Personality], and an experimental attention task. Before you can do any meaningful analysis, you need to clean, merge, combine, and analyze the data.

Datasets:

Your Final Goals

  • Produce a single clean dataframe with one row per participant including only the total/aggregate scores for each type of data (e.g., “InvalsiTot” for INVALSI items data, “WechslerTot” for Wechsler subtests data, “meanAcc” for the lab-based trials data, “OpennessTot” and “AgreabTot” for the personality-questionnaires data);
  • Produce a readable correlation matrix between all aggregate scores;
  • Produce some descriptive statistics for the aggregate scores (e.g., medians, means, standard deviations, skewness and kurtosis coefficients, counts of missing values);
  • Conduct a t-test comparison on INVALSI data (for males vs females);
  • Create some basic visualizations (histograms, scatter plots, boxplots) to explore distributions and relationships among variables.