Exercises - Programming Like a Data Scientist
Basics of Python for Data Science
This set of exercises requires practicing and combining various skills: creating a project folder with a virtual environment, working with arrays and dataframes, using conditional statements and loops. Some of these exercises are bit advanced for the purposes of the present course!
This first part in blue concerns managing a virtual environment for reproducibility. It is not strictly mandatory for completing the rest of the exercise, so you could skip it, but it is good for doing practice… so you are encouraged to try!
- Create a new project folder somewhere on your computer. In it, create a new python virtual environment (with this command in your bash/terminal:
python -m venv nameOfYourEnv
), and activate it as shown in the slides depending on your operating system and whether you work in an IDE. The whole project folder should be organized as follows:
/
yourProjectFolder/ ← virtual environment
├── venv/ ← .csv (see below)
├── data/ ← .py scripts
├── scripts/ ← output files, figures
├── results.txt ← list of installed packages for reproducibility (see below)
├── requirements.md ← brief description of the project (with something related to this course) └── README
In the just created virtual environment install
numpy
,pandas
,matplotlib
,seaborn
(even if you had already installed them in the main python installation, you should re-install them in your local virtual environment)After having successfully installed the above packages, generate the
requirements.txt
file for reproducibility, using this command in your bash/terminal (after having activated the virtual environment):pip freeze > requirements.txt
Import the above listed python packages using appropriate aliases (e.g.,
import matplotlib as mpl
)Download the following dataframes:
df1.csv
,dfWide1.csv
,dfWide2.csv
and import them in Python usingpd.read_csv()
If the imported dataframes present a strange initial column like
"Unnamed: 0"
know that it is an artifact and should be removed. You could try removing it with the method.drop(columns=["Unnamed: 0"])
The dataframe
df1
has all its columns stored as strings, but many are mostly numeric, and should be treated as such. Your task is to determine which columns can be “acceptably” coerced to numeric. Use a combination of iterative and conditional programming to complete the following steps:- Iterate over all columns of
df1
(using afor
loop is recommended); - For each column, attempt coercion to numeric using
pd.to_numeric(..., errors='coerce')
, and calculate the percentage of valid (i.e., non-NaN
) values (you can use the methods.isna().sum()
); - If the percentage of valid values exceeds 80%, actually convert/coerce the column to numeric;
- Keep track of which columns were coerced by storing their names in a separate list or array;
- After processing all columns, compute a correlation matrix including only the coerced-to-numeric columns (
df1.corr()
withmethod="pearson"
); - Redo the previous step using the method
.dropna()
before computing.corr()
in order to remove listwise any row with any missing values, and visually inspect whether the correlation coefficients have changed a lot; - inspect the
.shape
attribute to see how many (complete) rows are left in the DataFrame after the listwise deletion with.dropna()
; - Optional: Use
seaborn.heatmap()
to produce a colorful correlation matrix plot.
- Iterate over all columns of
The dataframe
dfWide1
contains data for treated subjects at three times ("T0"
,"T1"
, and"T2"
) in wide format, but it needs to be reshaped to long format for data analysis. Usepd.melt()
(orpd.wide_to_long()
) for reshaping (hint: first look the help/documentation and/or the examples in the slides.)Extra for advanced users: The dataframe
dfWide2
is even more complex: it contains two variables,Acc
(for accuracy) andRT
(for response times), measured at three time points. ReshapedfWide2
to long format, keepingAcc
andRT
as separate columns. You could “melt” each variable separately, and the usepd.merge()
to combine the two dataframes… but mind the names of the values!Extra: enjoy some data visualization on the long-form dataframe using for example:
- “spaghetti plot” for individual trajectories:
sns.lineplot(data=..., x="Time", y="Score", hue="id", units="id")
- more classical plot showing average scores and their CIs:
sns.pointplot(data=..., x="Time", y="Score", hue="group", errorbar="ci", dodge=True, join=True)
- “spaghetti plot” for individual trajectories:
Finally, tidy up!: place all documents, code, results, figures, and related files into the appropriate subfolders within your project folder.