Date

In this notebook, you'll

  • load the raw data into R or Pandas dataframe from the ./data/raw directory using the functions and classes (written during the data gathering phase) in the ./script directory

  • perform visual exploration on your data variables to detect anomalies, errors, outliers, interesting features etc.

  • clean categorical and quantitative variables (by removing observations and variables with too many missing values, by consolidating categorical variables, by selecting an appropriate subset of variables and observations for later analysis, etc.)

  • save the cleaned data into csv files in the ./data/cleaned directory

  • save interesting graphics you obtained during visual exploration into the ./visualiation directory

Team members responsible for this notebook:

List the team members contributing to this notebook, along with their responsabilities:

  • team member 1 name: team member 1 responsabilities
  • team member 2 name: team member 2 responsabilities
  • etc.

I advise you to work at least in pairs for each project notebook, as you did for the homework assignments. Of course, all team members may participate to each notebook.

Example

Here I'll load the plant data in xml format into a R data frame using the

create_df_from_plant_xml(file)

function contained in the R script

    ./script/plant_df-R
In [4]:
%load_ext rmagic
The rmagic extension is already loaded. To reload it, use:
  %reload_ext rmagic

To load the function into a R cell, one needs to use the

source(R_script_file)

command in R, which works in a similar way as the

import module

command in Python:

In [5]:
%%R
source('./script/plant_df-R')

Now, we can create a data frame directly from the XML file using the functin contained in the scrip above.

If you wish to perform the cleaning using Pandas data frames instead of R data frames, one make the R data frame available to Python cells by using the R magic command:

%%R -d df_name

To know more on how to pass variables back and forth between R and Python cells, please have a look at the notebook here.

In [7]:
%%R -d data

library(XML)

data = create_df_from_plant_xml('./data/raw/plant.xml')

Now let's load our data into a Pandas data frame:

In [8]:
from pandas import DataFrame

df = DataFrame(data)
df.head()
Out[8]:
COMMON BOTANICAL ZONE LIGHT PRICE AVAILABILITY
0 Bloodroot Sanguinaria canadensis 4 Mostly Shady $2.44 031599
1 Columbine Aquilegia canadensis 3 Mostly Shady $9.37 030699
2 Marsh Marigold Caltha palustris 4 Mostly Sunny $6.81 051799
3 Cowslip Caltha palustris 4 Mostly Shady $9.90 030699
4 Dutchman's-Breeches Dicentra cucullaria 3 Mostly Shady $6.44 012099

The actual data cleaning can now begins!

In []:
In []: