Introducing 

Prezi AI.

Your new presentation assistant.

Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.

Loading…
Transcript

These are all services which leverage data to make the invisible, visible through automated intelligence, so we may better explain, predict and decide.

Machine Learning

The Big Data Exploratorium: Data Mining, from Patents to Memes

Wednesday, June 22, 2011 from 2:30 – 3:15pm in B302/03

Data Science in the Open

John L. Taylor

Senior Data Analyst, iovation

johnnylogic@gmail.com

johnnylogic.org

Ingredients

Chefs

Challenges

  • Input

According to Some Cynics

According to Drew Conway

According to Hilary Mason

According to Indeed.com

Open Data!

According to Steve Miller

Instances: (rows)

Attributes: predefined set of features (columns)

Underdetermination

  • Data Science

  • Data Scientists
  • Data Mining Methods

Maths!

Prob. Theory

Statistics

Numeric Analysis

Linear Algebra

Etc.

  • Products

Commercial

Personal

  • C4.5
  • K-Means
  • SVM: Support Vector Machines
  • Apriori
  • EM
  • PageRank
  • AdaBoost
  • kNN: k-Nearest Neighbors
  • Naïve Bayes
  • CART: Classification and Regression Trees

Bias/Variance Tradeoff

  • Google
  • Facebook
  • LinkedIn
  • Amazon
  • Etc.

Mike Loukides:

"…merely using data isn’t really what we mean by “data science.” A data application acquires its value from the data itself, and creates more data as a result. It’s not just an application with data; it’s a data product. Data science enables the creation of data product."

(from "What is Data Science?")

Scientific

Social

(from Wu, Kumar, et al 2008. "Top Ten Data Mining Algorithms" Knowl Inf Syst (2008) 14:1–37

DOI 10.1007/s10115-007-0114-2)

  • Output

Concepts: The thing to be learned (e.g. the cluster, classification, association, etc.)

The Curse of

Dimensionality

Learning Theory

  • Algorithmic Information Theory
  • Computational Learning Theory
  • Formal Learning Theory
  • Rudimentary Rules
  • Statistical Modeling
  • Decision Trees
  • Covering Rules
  • Association Rules
  • Linear Models
  • Instance-based Learning
  • Clustering: no class values are provided

Tools

Here's a plot of the fraction of the mass concentrated in the corners as a function of dimension. For a 7 dimensional cube, about 96% of the mass is concentrated in one of it's 128 "corners"

Data Understanding

Measurement Theory

Data Preparation

Recipes

Data Analysis Tools

Open Source and Open Access Rules the Data Science Realm!

Problem Understanding

Business Objective

Modeling Environment

Deployment Environment

Data Mining Goals and Objectives

Store

  • Cassandra
  • Hadoop

Analyze

  • KNIME
  • R
  • RapidMiner
  • Weka

Present

  • Ggobi
  • Protovis

General Purpose

  • Perl
  • Python

Data Acquisition

Regular Expressions

  • Data Access
  • Data Integration

Data Warehousing

Data Description

NoSQL

Measurement Theory

Data Description

  • Variables
  • Cases
  • Descriptive statistics

Data Quality Assessment

  • Missing values
  • Outliers

APIs

ETL

Websites

Data Marts

Data Preparation

Cubes

Surveys

Transducers

Data scraping

Data cleansing: Removing outliers, placing things in standard form, and otherwise reducing the noisiness of data.

Data transformation: application of a deterministic mathematical function to each point in a data set .

Data imputation: the substitution of some value for a missing data point or a missing component of a data point. Once all missing values have been imputed, the dataset can then be analyzed using standard techniques for complete data.

Data weighting and balancing: should cases be treated the same, or somehow normalized?

Data filtering: high and low pass filters can be used to further cleanse data.

Data abstraction: Should data be re-categorized or coarse-grained differently?

Data reduction

  • Data sampling
  • Data dimensionality reduction
  • Data discretization

Data derivation: Should new variables be created?

You Get The Picture!

Thank You!

Data Warehousing 101

Thursday, June 23, 2011 from 2:30 – 3:15pm in B201

Modeling

Evaluation

Machine Learning

Performance Measures

  • Confusion Matrix

Estimating Performance

  • Standard Errors
  • Confidence Intervals

Comparing Performance

  • Cross Validation

Select modeling technique

  • Choose modeling algorithms
  • Choose modeling architecture
  • Specify modeling assumptions

Create an experimental design

Build the model

  • Set parameters (if not automatic)
  • Build various types of models

Analyze

Collect

Deploy

Store

Obtain

Scrub

Explore

Model

iNterpret

CRISP-DM

Learn more about creating dynamic, engaging presentations with Prezi