Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Data Science in the Open

No description

John Taylor

on 21 June 2011

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Data Science in the Open

Data Science in the Open Products John L. Taylor
Senior Data Analyst, iovation
johnnylogic.org Data Acquisition Data Understanding Data Preparation Modeling CRISP-DM Evaluation Collect Store Analyze Deploy Chefs Ingredients Recipes Tools Data Science

Data Scientists Input Data Analysis Tools Store



General Purpose
Python Obtain
iNterpret The Big Data Exploratorium: Data Mining, from Patents to Memes
Wednesday, June 22, 2011 from 2:30 – 3:15pm in B302/03 Data Warehousing 101
Thursday, June 23, 2011 from 2:30 – 3:15pm in B201 Here's a plot of the fraction of the mass concentrated in the corners as a function of dimension. For a 7 dimensional cube, about 96% of the mass is concentrated in one of it's 128 "corners" Data Description
Descriptive statistics
Data Quality Assessment
Missing values
Outliers Data cleansing: Removing outliers, placing things in standard form, and otherwise reducing the noisiness of data.

Data transformation: application of a deterministic mathematical function to each point in a data set .

Data imputation: the substitution of some value for a missing data point or a missing component of a data point. Once all missing values have been imputed, the dataset can then be analyzed using standard techniques for complete data.

Data weighting and balancing: should cases be treated the same, or somehow normalized?

Data filtering: high and low pass filters can be used to further cleanse data.

Data abstraction: Should data be re-categorized or coarse-grained differently?

Data reduction
Data sampling
Data dimensionality reduction
Data discretization

Data derivation: Should new variables be created? Select modeling technique
Choose modeling algorithms
Choose modeling architecture
Specify modeling assumptions

Create an experimental design

Build the model
Set parameters (if not automatic)
Build various types of models Performance Measures
Confusion Matrix
Estimating Performance
Standard Errors
Confidence Intervals
Comparing Performance
Cross Validation Google
Etc. Social Personal Commercial Scientific Mike Loukides:
"…merely using data isn’t really what we mean by “data science.” A data application acquires its value from the data itself, and creates more data as a result. It’s not just an application with data; it’s a data product. Data science enables the creation of data product."

(from "What is Data Science?") Challenges The Curse of
Dimensionality Bias/Variance Tradeoff Underdetermination Problem Understanding Instances: (rows)
Attributes: predefined set of features (columns) Rudimentary Rules
Statistical Modeling
Decision Trees
Covering Rules
Association Rules
Linear Models
Instance-based Learning
Clustering: no class values are provided Concepts: The thing to be learned (e.g. the cluster, classification, association, etc.) Business Objective

Modeling Environment

Deployment Environment

Data Mining Goals and Objectives Open Data! According to Hilary Mason According to Drew Conway According to Steve Miller According to Indeed.com According to Some Cynics Data Mining Methods Output Surveys Data Warehousing NoSQL ETL Data Marts Cubes Open Source and Open Access Rules the Data Science Realm! These are all services which leverage data to make the invisible, visible through automated intelligence, so we may better explain, predict and decide. C4.5
SVM: Support Vector Machines
kNN: k-Nearest Neighbors
Naïve Bayes
CART: Classification and Regression Trees (from Wu, Kumar, et al 2008. "Top Ten Data Mining Algorithms" Knowl Inf Syst (2008) 14:1–37
DOI 10.1007/s10115-007-0114-2) Learning Theory

Algorithmic Information Theory
Computational Learning Theory
Formal Learning Theory Machine Learning APIs Websites Transducers Data scraping Data Access
Data Integration Regular Expressions Measurement Theory Machine Learning Maths! Prob. Theory
Numeric Analysis
Linear Algebra
Etc. You Get The Picture! Thank You! Measurement Theory Data Description Data Preparation
Full transcript