Date

In this notebook, you'll

  • create the directory structure for your project
  • download the raw data for your project
  • write functions and classes to transform raw data (e.g in XML, jason, etc.) into data frames
  • explain what your are doing

Team members responsible for this notebook:

List the team members contributing to this notebook, along with their responsabilities:

  • team member 1 name: team member 1 responsabilities
  • team member 2 name: team member 2 responsabilities
  • etc.

I advise you to work at least in pairs for each project notebook, as you did for the homework assignments. Of course, all team members may participate to each notebook.

Instructions:

In markdown cells, you'll

  • list and describe the data sources for your class project
    • provide the data urls
    • describe the file formats
    • explain the data content of each file

Remark: If your data is not directly available through the web (for instance, you requested it from a company, or got it from Inna or Henry), place you data into your personal bdrive, and provide a link to it, that you'll use in this notebook to download it into the data folder, as explained below).

In code cells, you'll

  • create the following directories:

    ./data
    ./data/raw        # to store your raw data
    ./data/cleaned    # to store the cleanned data
    ./data/simulated  # to store simulated data
    ./visualizations  # to store your plots
  • display samples of your data files (e.g. for a csv files, use data frames and the head method)

  • Write scripts into the ./script directory using the he magic command

    %%file ./script/file_name

Your scripts should contain functions and objects allowing you to

  • download the actual raw data in the directory ./data/raw

  • load the downloaded the raw data files into data frames

This way, the functions and objects contained in your scripts will be accessible in other notebooks through the import command in Python (see example below).

Remark 1: In the code cells, you may use Python, R, or Bash, as you find more convenient.

Remark 2: Try to make your notebook as readable and usable (by others) as possible.

Example

Here is an example of how to package your code into scripts reusable in other notebook.

You'll need to do something similar to that. You'll also need to write explanations as outlined above in markdown cells.

Creating the directory structure

In [2]:
%%bash
mkdir ../script ../data ../data/raw ../data/cleaned ../data/simulated ../visualizations
ls -r
visualizations
script
README.md
NB4_project_report.ipynb
NB3_data_analysis.ipynb
NB2_data_cleaning.ipynb
NB1_data_gathering.ipynb
data

mkdir: cannot create directory ‘../data’: File exists

Downloading and displaying raw data

In this example, I am using R. You want to use Python instead.

In [4]:
%load_ext rmagic

TO DO: list sources, display data sample, explain data content, etc.

We now download an XML source containing plant data, and display an typical plant entry, represented as a XML node:

In [6]:
%%R
library(XML)
plant_url  =  'http://www.stat.berkeley.edu/classes/s133/data/plant_catalog.xml'
plant_file = './data/raw/plant.xml'

download.file(plant_url, plant_file, method="curl")

xml_doc   = xmlParse(plant_file)
root_node = xmlRoot(xml_doc)

plant_nodes = xmlChildren(root_node)
print(plant_nodes[[1]])

Attaching package: ‘XML’

The following object(s) are masked from ‘package:tools’:

    toHTML

<PLANT>
  <COMMON>Bloodroot</COMMON>
  <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
  <ZONE>4</ZONE>
  <LIGHT>Mostly Shady</LIGHT>
  <PRICE>$2.44</PRICE>
  <AVAILABILITY>031599</AVAILABILITY>
</PLANT> 

Loading raw data into data frames

The dataframe column labels are the tags of the XML nodes, contained in the plant XML nodes:

In [7]:
%%R
column_names = names(plant_nodes[[1]])
cat(column_names)
COMMON BOTANICAL ZONE LIGHT PRICE AVAILABILITY

Given a column name, we create a data frame column containing the column values:

In [11]:
%%R
common_tag = column_names[1]

get_value = function(plant_node, tag) xmlValue(plant_node[[tag]])
    
common_column = sapply(plant_nodes, get_value, common_tag)
    
cat(common_column)
Bloodroot Columbine Marsh Marigold Cowslip Dutchman's-Breeches Ginger, Wild Hepatica Liverleaf Jack-In-The-Pulpit Mayapple Phlox, Woodland Phlox, Blue Spring-Beauty Trillium Wake Robin Violet, Dog-Tooth Trout Lily Adder's-Tongue Anemone Grecian Windflower Bee Balm Bergamot Black-Eyed Susan Buttercup Crowfoot Butterfly Weed Cinquefoil Primrose Gentian Blue Gentian Jacob's Ladder Greek Valerian California Poppy Shooting Star Snakeroot Cardinal Flower

We package the code above into a function that creates the data frame column corresponding from a tag name and a list of plant nodes:

In [12]:
%%R
get_column = function(tag, plants) sapply(plants, get_value, tag)
    
cat((get_column('COMMON', plant_nodes)))
Bloodroot Columbine Marsh Marigold Cowslip Dutchman's-Breeches Ginger, Wild Hepatica Liverleaf Jack-In-The-Pulpit Mayapple Phlox, Woodland Phlox, Blue Spring-Beauty Trillium Wake Robin Violet, Dog-Tooth Trout Lily Adder's-Tongue Anemone Grecian Windflower Bee Balm Bergamot Black-Eyed Susan Buttercup Crowfoot Butterfly Weed Cinquefoil Primrose Gentian Blue Gentian Jacob's Ladder Greek Valerian California Poppy Shooting Star Snakeroot Cardinal Flower

We are now ready to retrieve all the columns of our data frame into a list of vectors, which we will use to construct our data frame:

In [13]:
%%R
data = lapply(column_names, get_column, plant_nodes)

plant_df = data.frame(data, stringsAsFactors = FALSE)

print(head(plant_df))
               COMMON              BOTANICAL ZONE        LIGHT PRICE
1           Bloodroot Sanguinaria canadensis    4 Mostly Shady $2.44
2           Columbine   Aquilegia canadensis    3 Mostly Shady $9.37
3      Marsh Marigold       Caltha palustris    4 Mostly Sunny $6.81
4             Cowslip       Caltha palustris    4 Mostly Shady $9.90
5 Dutchman's-Breeches    Dicentra cucullaria    3 Mostly Shady $6.44
6        Ginger, Wild       Asarum canadense    3 Mostly Shady $9.03
  AVAILABILITY
1       031599
2       030699
3       051799
4       030699
5       012099
6       041899

Using the data frame we just created, we save the plant data into a csv file into the raw data directory:

In [14]:
%%R
write.csv(plant_df, './data/raw/plants_to_be_cleaned.csv')
In []:
!head ./data/raw/plants_to_be_cleaned.csv

Packaging the gathering code into reusable functions

In [15]:
%%file ./script/plant_df-R

get_value  = function(plant_node, tag) xmlValue(plant_node[[tag]])
    
get_column = function(tag, plants) sapply(plants, get_value, tag)
    
create_df_from_plant_xml = function(plant_file){
    require(XML)
    xml_doc      = xmlParse(plant_file)
    root_node    = xmlRoot(xml_doc)
    plant_nodes  = xmlChildren(root_node)
    column_names = names(plant_nodes[[1]])

    data     = lapply(column_names, get_column, plant_nodes)
    return(data.frame(data, stringsAsFactors = FALSE))
}
Writing ./script/plant_df-R

Now, our function converting a raw plant XML file into a R data frame can be used in other notebooks, using the source command:

In [16]:
%%R
source('./script/plant_df-R')

print(head(create_df_from_plant_xml(plant_file)))
               COMMON              BOTANICAL ZONE        LIGHT PRICE
1           Bloodroot Sanguinaria canadensis    4 Mostly Shady $2.44
2           Columbine   Aquilegia canadensis    3 Mostly Shady $9.37
3      Marsh Marigold       Caltha palustris    4 Mostly Sunny $6.81
4             Cowslip       Caltha palustris    4 Mostly Shady $9.90
5 Dutchman's-Breeches    Dicentra cucullaria    3 Mostly Shady $6.44
6        Ginger, Wild       Asarum canadense    3 Mostly Shady $9.03
  AVAILABILITY
1       031599
2       030699
3       051799
4       030699
5       012099
6       041899

In []: