In this notebook, you'll
load the raw data into R or Pandas dataframe from the
./data/raw
directory using the functions and classes (written during the data gathering phase) in the./script
directoryperform visual exploration on your data variables to detect anomalies, errors, outliers, interesting features etc.
clean categorical and quantitative variables (by removing observations and variables with too many missing values, by consolidating categorical variables, by selecting an appropriate subset of variables and observations for later analysis, etc.)
save the cleaned data into csv files in the
./data/cleaned
directorysave interesting graphics you obtained during visual exploration into the
./visualiation
directory
Team members responsible for this notebook:
List the team members contributing to this notebook, along with their responsabilities:
- team member 1 name: team member 1 responsabilities
- team member 2 name: team member 2 responsabilities
- etc.
I advise you to work at least in pairs for each project notebook, as you did for the homework assignments. Of course, all team members may participate to each notebook.
Example¶
Here I'll load the plant data in xml format into a R data frame using the
create_df_from_plant_xml(file)
function contained in the R script
./script/plant_df-R
%load_ext rmagic
To load the function into a R cell, one needs to use the
source(R_script_file)
command in R, which works in a similar way as the
import module
command in Python:
%%R
source('./script/plant_df-R')
Now, we can create a data frame directly from the XML file using the functin contained in the scrip above.
If you wish to perform the cleaning using Pandas data frames instead of R data frames, one make the R data frame available to Python cells by using the R magic command:
%%R -d df_name
To know more on how to pass variables back and forth between R and Python cells, please have a look at the notebook here.
%%R -d data
library(XML)
data = create_df_from_plant_xml('./data/raw/plant.xml')
Now let's load our data into a Pandas data frame:
from pandas import DataFrame
df = DataFrame(data)
df.head()
The actual data cleaning can now begins!