Introducing 

Prezi AI.

Your new presentation assistant.

Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.

Loading…
Transcript

Data Mining Funnel

ensembling

aka voting

overfit

underfit

tradeoff

R demo

Random Forest Demo

Guiding Themes

K.I.S.S

validation

more than just pretty graphs

Data Analytics,

and so should you!

by Billy Hill

SEMMA (SAS via Enterprise Miner)

Sample - enough rows to discover patterns w/out overwhelming

Explore - sort, max, min, describe, plot

CRISP-DM (Cross Industry Standard Process for Data Mining)

Business Understanding

understanding objectives and requirements, converting to knowledge into data mining problem

Data Understanding

data collection, get familiar with data, identify data quality problems, discover insights into data, detect interesting subsets to form hypotheses for hidden information

Explore Data

regression

neural net

RapidMiner Demo

DS: Polynomial

SHUT UP! and show me cool stuff

  • decision tree (C4.5)
  • regression and neural networks

decision tree

  • very powerful
  • easy to interpret

RapidMiner Demo

Feature Selection, Dimensionality Reduction

Arguably the most important phase that will be repeated

Lots of machine learning algorithms and statistical tools

RapidMinder Demo, DS: Sonar

Social Media Tips

Select Features

Data Warehousing

- Start with the most successful sites

- Place a key focus on quality content

- Always respond to comments

- Join conversations and share your thoughts

- Use promotions and giveaways

- Don't make selling your main focus

- Be consistent with posting times, schedule updates and posts on evenings and weekends

Hughes Phenomenon

With a fixed number of training samples, the predictive power reduces as the dimensionality increases

Dimension Reduction

aka, less is more

Curse of Dimensionality

more fields = more sparsity

easy to get started,

aka K.I.S.S. principle

Very Simple Classification Rules Perform Well on Most Commonly Used Datasets

-Robert C. Holte, 1993, Computer Science Department, University of Ottawa

K-Fold Cross Validation

One Rule (1R) Algorithm

i.e.,

use most predictive feature

Like so much buzzwords,

I thirst for explanations

data science, data mining, machine learning, statistical inference, supervised learning, unsupervised learning, big data, clustering, predictive analysis, big science, business intelligence, analytics, prescriptive analysis, text mining, text analysis, unstructured analysis, pattern recognition

ETL,

As DW experts and BI Engineers,

you know more about this than me

Prepare Data

not too much math

Data Scientist: The Sexiest Job of the 21st Century, Davenport, Patil, HBR, 2012

...

Data scientists’ most basic, universal skill is the ability to write code.

...

A quantitative analyst can be great at analyzing data but not at subduing a mass of unstructured data and getting it into a form in which it can be analyzed.

  • [Infer] a function from labeled training data
  • Foundations of Machine Learning, 2012, Mohri

Model, Validate

ensembling

Data Science

classification and regression

  • classification
  • predicting a type/label/boolean
  • survival on Titanic
  • customer renewal
  • regression
  • predicting a number
  • quarterly sales
  • stock price

Operationalize

can be done via

  • proprietary systems like SAS
  • SQL
  • custom mid tier
  • drools or other biz rules system
  • reporting
  • dashboards
  • ETL into data mart(s)
Learn more about creating dynamic, engaging presentations with Prezi