Exercises - Dataframes: The nightmare… in Python
Basics of Python for Data Science
Like a data scientist data cleaner
This is the 🐍 version of the exercise that you (as students of the PhD program in Psychological Sciences) previously encountered for the course on Basics of R for Data Science
Scenario
You have received messy datasets from students or colleagues who have been collecting data from different tests: INVALSI, Wechsler, an experimental attention task, and personality questionnaires. Before you can do any meaningful analysis, you need to clean, merge, combine, and analyze the data
Your Final Goals
- Produce a clean dataframe with one row per participant including only the total/aggregate scores for each type of data (e.g., “
InvalsiTot
” for INVALSI items data, “WechslerTot
” for Wechsler subtests data, “meanAcc
for the lab-based trials data,”OpennessTot
” and “AgreabTot
” for the personality-questionnaires data); - Produce a readable correlation matrix between all aggregate scores;
- Produce some descriptive statistics for the aggregate scores (e.g., means, standard deviations, skewness coefficients, counts of missing values);
- T-test comparison on INVALSI data (for males vs females);
- Some histograms and scatter plots for distributions and pairs of variables.
Datasets:
- INVALSI items data:
ExerData_Invalsi_1.csv
andExerData_Invalsi_2.csv
, participant responses to INVALSI tasks splitted in two different datasets (need to be combined first); - Wechsler subtests data:
ExerData_Wechsler.xlsx
; - Lab-based trials data on an attentional task, includes accuracies to repeated trials:
ExerData_LabTrials.csv
; - Personality questionnaires:
ExerData_Questionnaires.csv
, includes item-by-item scores for Openness and Agreeableness.
Recommended functions and methods
Task | Subtask | Function / Method |
---|---|---|
Data Import | Load datasets from CSV/Excel | pd.read_csv(...) pd.read_excel(...) |
Data Cleaning | In a DataFrame column, replace character with another (e.g., “,” with “.”) | .astype(str).str.replace(",",".") |
Coerce a string column to numbers, handling errors as NaN |
pd.to_numeric(..., errors="coerce") |
|
Remove invalid values (e.g., implausibly large scores) replacing them with missing values | df.loc[...] = np.nan |
|
Data Wrangling | Sum repeated observations by participant | .groupby(...).sum().reset_index() |
Combine DataFrames that have the same columns | pd.concat([df1, df2]) |
|
Merge info across different DataFrames by the same participant ID, keeping all information | df.merge(..., on="idName", how="outer") |
|
Variable Creation | Compute a new “total score” column that includes the sum of values across other columns | df[...] = df[...].sum(1) |
Descriptive Stats | Quickly compute some summary statistics from a DataFrame | df.describe() |
Correlation matrix from a selection of DataFrame columns | df[[...]].corr() |
|
Skewness and kurtosis of DataFrame column(s) | .skew() .kurt() |
|
Inferential Stats | Compare genders using t-test, dropping missing values | ttest_ind(...).dropna() (from module scipy.stats ) |
Visualization | Histograms (create, show, clear) | plt.hist(...) plt.show() plt.clf() |
Other plots from the seaborn package (as sns ); remember to render them with plt.show() after creating them |
sns.boxplot(...) sns.scatterplot(x=...,y=...) sns.regplot(x=...,y=...,scatter=True,ci=95) |