Logo

Exercises - Dataframes, The nightmare

Basics of R for Data Science

Like a data scientist data cleaner

The following set of exercises is a simulation of the challenges you will actually encounter in real life as a data scientist. Know that a large portion of your time as a data scientist will likely be spent on data management and cleaning! In the following scenario, imagine that students (or colleagues) have brought you messy dataframes, full of inconsistencies, varying formats, and other issues. Some guidance is provided here, but your goal is to think critically and figure out the most effective solutions. And don’t worry: the real world out there is more nightmarish! 🙂

Scenario

Some students (or colleagues) have provided you with data collected from various scales in a sample of participants (Wechsler, INVALSI [Italian National Assessment of Students], Personality Questionnaires). Your task is to perform some computations like a data scientist. However, before this, you will need to fix some inconsistencies, correct possible errors, and merge and combine different dataframes, as the data is scattered across multiple documents.

Your Final Goals

  • Produce a clean dataframe with one row per participant including only the total/aggregate scores for each type of data (e.g., “InvalsiTot” for INVALSI items data, “WechslerTot” for Wechsler subtests data, “meanAcc for the lab-based trials data,”OpennessTot” and “AgreabTot” for the personality-questionnaires data);
  • Produce a readable correlation matrix between all aggregate scores;
  • Produce some descriptive statistics for the aggregate scores (e.g., medians, means, standard deviations, skewness and kurtosis coefficients, counts of missing values);
  • T-test comparison on INVALSI data or other variables (for males vs females);
  • Some basic plots (histograms, scatter plots, boxplots) for distributions and pairs of variables.

Datasets: