Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Data Science in the Open
Transcript of Data Science in the Open
Senior Data Analyst, iovation
johnnylogic.org Data Acquisition Data Understanding Data Preparation Modeling CRISP-DM Evaluation Collect Store Analyze Deploy Chefs Ingredients Recipes Tools Data Science
Data Scientists Input Data Analysis Tools Store
iNterpret The Big Data Exploratorium: Data Mining, from Patents to Memes
Wednesday, June 22, 2011 from 2:30 – 3:15pm in B302/03 Data Warehousing 101
Thursday, June 23, 2011 from 2:30 – 3:15pm in B201 Here's a plot of the fraction of the mass concentrated in the corners as a function of dimension. For a 7 dimensional cube, about 96% of the mass is concentrated in one of it's 128 "corners" Data Description
Data Quality Assessment
Outliers Data cleansing: Removing outliers, placing things in standard form, and otherwise reducing the noisiness of data.
Data transformation: application of a deterministic mathematical function to each point in a data set .
Data imputation: the substitution of some value for a missing data point or a missing component of a data point. Once all missing values have been imputed, the dataset can then be analyzed using standard techniques for complete data.
Data weighting and balancing: should cases be treated the same, or somehow normalized?
Data filtering: high and low pass filters can be used to further cleanse data.
Data abstraction: Should data be re-categorized or coarse-grained differently?
Data dimensionality reduction
Data derivation: Should new variables be created? Select modeling technique
Choose modeling algorithms
Choose modeling architecture
Specify modeling assumptions
Create an experimental design
Build the model
Set parameters (if not automatic)
Build various types of models Performance Measures
Cross Validation Google
Etc. Social Personal Commercial Scientific Mike Loukides:
"…merely using data isn’t really what we mean by “data science.” A data application acquires its value from the data itself, and creates more data as a result. It’s not just an application with data; it’s a data product. Data science enables the creation of data product."
(from "What is Data Science?") Challenges The Curse of
Dimensionality Bias/Variance Tradeoff Underdetermination Problem Understanding Instances: (rows)
Attributes: predefined set of features (columns) Rudimentary Rules
Clustering: no class values are provided Concepts: The thing to be learned (e.g. the cluster, classification, association, etc.) Business Objective
Data Mining Goals and Objectives Open Data! According to Hilary Mason According to Drew Conway According to Steve Miller According to Indeed.com According to Some Cynics Data Mining Methods Output Surveys Data Warehousing NoSQL ETL Data Marts Cubes Open Source and Open Access Rules the Data Science Realm! These are all services which leverage data to make the invisible, visible through automated intelligence, so we may better explain, predict and decide. C4.5
SVM: Support Vector Machines
kNN: k-Nearest Neighbors
CART: Classification and Regression Trees (from Wu, Kumar, et al 2008. "Top Ten Data Mining Algorithms" Knowl Inf Syst (2008) 14:1–37
DOI 10.1007/s10115-007-0114-2) Learning Theory
Algorithmic Information Theory
Computational Learning Theory
Formal Learning Theory Machine Learning APIs Websites Transducers Data scraping Data Access
Data Integration Regular Expressions Measurement Theory Machine Learning Maths! Prob. Theory
Etc. You Get The Picture! Thank You! Measurement Theory Data Description Data Preparation