### Present Remotely

Send the link below via email or IM

CopyPresent to your audience

Start remote presentation- Invited audience members
**will follow you**as you navigate and present - People invited to a presentation
**do not need a Prezi account** - This link expires
**10 minutes**after you close the presentation - A maximum of
**30 users**can follow your presentation - Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

### Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.

You can change this under Settings & Account at any time.

# Data Science in the Open

No description

by

Tweet## John Taylor

on 21 June 2011#### Transcript of Data Science in the Open

Data Science in the Open Products John L. Taylor

Senior Data Analyst, iovation

johnnylogic@gmail.com

johnnylogic.org Data Acquisition Data Understanding Data Preparation Modeling CRISP-DM Evaluation Collect Store Analyze Deploy Chefs Ingredients Recipes Tools Data Science

Data Scientists Input Data Analysis Tools Store

Cassandra

Hadoop

Analyze

KNIME

R

RapidMiner

Weka

Present

Ggobi

Protovis

General Purpose

Perl

Python Obtain

Scrub

Explore

Model

iNterpret The Big Data Exploratorium: Data Mining, from Patents to Memes

Wednesday, June 22, 2011 from 2:30 – 3:15pm in B302/03 Data Warehousing 101

Thursday, June 23, 2011 from 2:30 – 3:15pm in B201 Here's a plot of the fraction of the mass concentrated in the corners as a function of dimension. For a 7 dimensional cube, about 96% of the mass is concentrated in one of it's 128 "corners" Data Description

Variables

Cases

Descriptive statistics

Data Quality Assessment

Missing values

Outliers Data cleansing: Removing outliers, placing things in standard form, and otherwise reducing the noisiness of data.

Data transformation: application of a deterministic mathematical function to each point in a data set .

Data imputation: the substitution of some value for a missing data point or a missing component of a data point. Once all missing values have been imputed, the dataset can then be analyzed using standard techniques for complete data.

Data weighting and balancing: should cases be treated the same, or somehow normalized?

Data filtering: high and low pass filters can be used to further cleanse data.

Data abstraction: Should data be re-categorized or coarse-grained differently?

Data reduction

Data sampling

Data dimensionality reduction

Data discretization

Data derivation: Should new variables be created? Select modeling technique

Choose modeling algorithms

Choose modeling architecture

Specify modeling assumptions

Create an experimental design

Build the model

Set parameters (if not automatic)

Build various types of models Performance Measures

Confusion Matrix

Estimating Performance

Standard Errors

Confidence Intervals

Comparing Performance

Cross Validation Google

Facebook

LinkedIn

Amazon

Etc. Social Personal Commercial Scientific Mike Loukides:

"…merely using data isn’t really what we mean by “data science.” A data application acquires its value from the data itself, and creates more data as a result. It’s not just an application with data; it’s a data product. Data science enables the creation of data product."

(from "What is Data Science?") Challenges The Curse of

Dimensionality Bias/Variance Tradeoff Underdetermination Problem Understanding Instances: (rows)

Attributes: predefined set of features (columns) Rudimentary Rules

Statistical Modeling

Decision Trees

Covering Rules

Association Rules

Linear Models

Instance-based Learning

Clustering: no class values are provided Concepts: The thing to be learned (e.g. the cluster, classification, association, etc.) Business Objective

Modeling Environment

Deployment Environment

Data Mining Goals and Objectives Open Data! According to Hilary Mason According to Drew Conway According to Steve Miller According to Indeed.com According to Some Cynics Data Mining Methods Output Surveys Data Warehousing NoSQL ETL Data Marts Cubes Open Source and Open Access Rules the Data Science Realm! These are all services which leverage data to make the invisible, visible through automated intelligence, so we may better explain, predict and decide. C4.5

K-Means

SVM: Support Vector Machines

Apriori

EM

PageRank

AdaBoost

kNN: k-Nearest Neighbors

Naïve Bayes

CART: Classification and Regression Trees (from Wu, Kumar, et al 2008. "Top Ten Data Mining Algorithms" Knowl Inf Syst (2008) 14:1–37

DOI 10.1007/s10115-007-0114-2) Learning Theory

Algorithmic Information Theory

Computational Learning Theory

Formal Learning Theory Machine Learning APIs Websites Transducers Data scraping Data Access

Data Integration Regular Expressions Measurement Theory Machine Learning Maths! Prob. Theory

Statistics

Numeric Analysis

Linear Algebra

Etc. You Get The Picture! Thank You! Measurement Theory Data Description Data Preparation

Full transcriptSenior Data Analyst, iovation

johnnylogic@gmail.com

johnnylogic.org Data Acquisition Data Understanding Data Preparation Modeling CRISP-DM Evaluation Collect Store Analyze Deploy Chefs Ingredients Recipes Tools Data Science

Data Scientists Input Data Analysis Tools Store

Cassandra

Hadoop

Analyze

KNIME

R

RapidMiner

Weka

Present

Ggobi

Protovis

General Purpose

Perl

Python Obtain

Scrub

Explore

Model

iNterpret The Big Data Exploratorium: Data Mining, from Patents to Memes

Wednesday, June 22, 2011 from 2:30 – 3:15pm in B302/03 Data Warehousing 101

Thursday, June 23, 2011 from 2:30 – 3:15pm in B201 Here's a plot of the fraction of the mass concentrated in the corners as a function of dimension. For a 7 dimensional cube, about 96% of the mass is concentrated in one of it's 128 "corners" Data Description

Variables

Cases

Descriptive statistics

Data Quality Assessment

Missing values

Outliers Data cleansing: Removing outliers, placing things in standard form, and otherwise reducing the noisiness of data.

Data transformation: application of a deterministic mathematical function to each point in a data set .

Data imputation: the substitution of some value for a missing data point or a missing component of a data point. Once all missing values have been imputed, the dataset can then be analyzed using standard techniques for complete data.

Data weighting and balancing: should cases be treated the same, or somehow normalized?

Data filtering: high and low pass filters can be used to further cleanse data.

Data abstraction: Should data be re-categorized or coarse-grained differently?

Data reduction

Data sampling

Data dimensionality reduction

Data discretization

Data derivation: Should new variables be created? Select modeling technique

Choose modeling algorithms

Choose modeling architecture

Specify modeling assumptions

Create an experimental design

Build the model

Set parameters (if not automatic)

Build various types of models Performance Measures

Confusion Matrix

Estimating Performance

Standard Errors

Confidence Intervals

Comparing Performance

Cross Validation Google

Amazon

Etc. Social Personal Commercial Scientific Mike Loukides:

"…merely using data isn’t really what we mean by “data science.” A data application acquires its value from the data itself, and creates more data as a result. It’s not just an application with data; it’s a data product. Data science enables the creation of data product."

(from "What is Data Science?") Challenges The Curse of

Dimensionality Bias/Variance Tradeoff Underdetermination Problem Understanding Instances: (rows)

Attributes: predefined set of features (columns) Rudimentary Rules

Statistical Modeling

Decision Trees

Covering Rules

Association Rules

Linear Models

Instance-based Learning

Clustering: no class values are provided Concepts: The thing to be learned (e.g. the cluster, classification, association, etc.) Business Objective

Modeling Environment

Deployment Environment

Data Mining Goals and Objectives Open Data! According to Hilary Mason According to Drew Conway According to Steve Miller According to Indeed.com According to Some Cynics Data Mining Methods Output Surveys Data Warehousing NoSQL ETL Data Marts Cubes Open Source and Open Access Rules the Data Science Realm! These are all services which leverage data to make the invisible, visible through automated intelligence, so we may better explain, predict and decide. C4.5

K-Means

SVM: Support Vector Machines

Apriori

EM

PageRank

AdaBoost

kNN: k-Nearest Neighbors

Naïve Bayes

CART: Classification and Regression Trees (from Wu, Kumar, et al 2008. "Top Ten Data Mining Algorithms" Knowl Inf Syst (2008) 14:1–37

DOI 10.1007/s10115-007-0114-2) Learning Theory

Algorithmic Information Theory

Computational Learning Theory

Formal Learning Theory Machine Learning APIs Websites Transducers Data scraping Data Access

Data Integration Regular Expressions Measurement Theory Machine Learning Maths! Prob. Theory

Statistics

Numeric Analysis

Linear Algebra

Etc. You Get The Picture! Thank You! Measurement Theory Data Description Data Preparation