In this notebook, you'll

load the raw data into R or Pandas dataframe from the ./data/raw directory using the functions and classes (written during the data gathering phase) in the ./script directory
perform visual exploration on your data variables to detect anomalies, errors, outliers, interesting features etc.
clean categorical and quantitative variables (by removing observations and variables with too many missing values, by consolidating categorical variables, by selecting an appropriate subset of variables and observations for later analysis, etc.)
save the cleaned data into csv files in the ./data/cleaned directory
save interesting graphics you obtained during visual exploration into the ./visualiation directory

Team members responsible for this notebook:

List the team members contributing to this notebook, along with their responsabilities:

team member 1 name: team member 1 responsabilities
team member 2 name: team member 2 responsabilities
etc.

I advise you to work at least in pairs for each project notebook, as you did for the homework assignments. Of course, all team members may participate to each notebook.

Example¶

Here I'll load the plant data in xml format into a R data frame using the

create_df_from_plant_xml(file)

function contained in the R script

    ./script/plant_df-R

In [4]:

%load_ext rmagic

The rmagic extension is already loaded. To reload it, use:
  %reload_ext rmagic

To load the function into a R cell, one needs to use the

source(R_script_file)

command in R, which works in a similar way as the

import module

command in Python:

In [5]:

%%R
source('./script/plant_df-R')

Now, we can create a data frame directly from the XML file using the functin contained in the scrip above.

If you wish to perform the cleaning using Pandas data frames instead of R data frames, one make the R data frame available to Python cells by using the R magic command:

%%R -d df_name

To know more on how to pass variables back and forth between R and Python cells, please have a look at the notebook here.

In [7]:

%%R -d data

library(XML)

data = create_df_from_plant_xml('./data/raw/plant.xml')

Now let's load our data into a Pandas data frame:

In [8]:

from pandas import DataFrame

df = DataFrame(data)
df.head()

Out[8]:

	COMMON	BOTANICAL	ZONE	LIGHT	PRICE	AVAILABILITY
0	Bloodroot	Sanguinaria canadensis	4	Mostly Shady	$2.44	031599
1	Columbine	Aquilegia canadensis	3	Mostly Shady	$9.37	030699
2	Marsh Marigold	Caltha palustris	4	Mostly Sunny	$6.81	051799
3	Cowslip	Caltha palustris	4	Mostly Shady	$9.90	030699
4	Dutchman's-Breeches	Dicentra cucullaria	3	Mostly Shady	$6.44	012099

The actual data cleaning can now begins!

In []: