Exercises - Dataframes: The nightmare… in Python

Basics of Python for Data Science

Like a data scientist data cleaner

This is the 🐍 version of the exercise that you (as students of the PhD program in Psychological Sciences) previously encountered for the course on Basics of R for Data Science

Scenario

You have received messy datasets from students or colleagues who have been collecting data from different tests: INVALSI, Wechsler, an experimental attention task, and personality questionnaires. Before you can do any meaningful analysis, you need to clean, merge, combine, and analyze the data

Your Final Goals

Produce a clean dataframe with one row per participant including only the total/aggregate scores for each type of data (e.g., “InvalsiTot” for INVALSI items data, “WechslerTot” for Wechsler subtests data, “meanAcc for the lab-based trials data,”OpennessTot” and “AgreabTot” for the personality-questionnaires data);
Produce a readable correlation matrix between all aggregate scores;
Produce some descriptive statistics for the aggregate scores (e.g., means, standard deviations, skewness coefficients, counts of missing values);
T-test comparison on INVALSI data (for males vs females);
Some histograms and scatter plots for distributions and pairs of variables.

Datasets:

INVALSI items data: ExerData_Invalsi_1.csv and ExerData_Invalsi_2.csv, participant responses to INVALSI tasks splitted in two different datasets (need to be combined first);
Wechsler subtests data: ExerData_Wechsler.xlsx;
Lab-based trials data on an attentional task, includes accuracies to repeated trials: ExerData_LabTrials.csv;
Personality questionnaires: ExerData_Questionnaires.csv, includes item-by-item scores for Openness and Agreeableness.

Recommended functions and methods

Task	Subtask	Function / Method
Data Import	Load datasets from CSV/Excel	`pd.read_csv(...)` `pd.read_excel(...)`
Data Cleaning	In a DataFrame column, replace character with another (e.g., “,” with “.”)	`.astype(str).str.replace(",",".")`
	Coerce a string column to numbers, handling errors as `NaN`	`pd.to_numeric(..., errors="coerce")`
	Remove invalid values (e.g., implausibly large scores) replacing them with missing values	`df.loc[...] = np.nan`
Data Wrangling	Sum repeated observations by participant	`.groupby(...).sum().reset_index()`
	Combine DataFrames that have the same columns	`pd.concat([df1, df2])`
	Merge info across different DataFrames by the same participant ID, keeping all information	`df.merge(..., on="idName", how="outer")`
Variable Creation	Compute a new “total score” column that includes the sum of values across other columns	`df[...] = df[...].sum(1)`
Descriptive Stats	Quickly compute some summary statistics from a DataFrame	`df.describe()`
	Correlation matrix from a selection of DataFrame columns	`df[[...]].corr()`
	Skewness and kurtosis of DataFrame column(s)	`.skew()` `.kurt()`
Inferential Stats	Compare genders using t-test, dropping missing values	`ttest_ind(...).dropna()` (from module `scipy.stats`)
Visualization	Histograms (create, show, clear)	`plt.hist(...)` `plt.show()` `plt.clf()`
	Other plots from the `seaborn` package (as `sns`); remember to render them with `plt.show()` after creating them	`sns.boxplot(...)` `sns.scatterplot(x=...,y=...)` `sns.regplot(x=...,y=...,scatter=True,ci=95)`