Machine Learning Presentation
Transcript: Machine Learning What is Machine Learning What is Machine Learning? “Machine Learning is a field of computer science that applies statistical techniques to give the system the ability to “learn”.” (from Wiki) The model “learns” from those data points. → if you the model knows x, it will predict y. Example Define Project Objective Define Project Objectives Specify business problem Acquire subject matter expertise Define unit of analysis and prediction target, find and remove any target leakage Prioritize modeling criteria Consider risks and success criteria Decide whether to continue Acquire & Find Appropriate Data Acquire & find appropriate data 1. Find appropriate data 2. Merge data into single table 3. Conduct exploratory data analysis 4. Feature engineering Find Appropriate Data Source: internal, external, public Step 1 & 2 Merge Data Source engineering and BI team Step 3 View Raw Data: missing, wrong content (table importing), strange data input Study Distribution of each Variable/Column Plots Single variable Discrete: bar chart, box-plot, etc. Continuous: histogram etc. Multiple variables → scatter plots Study Descriptive statistics: min, max, mean, variance, median for each variable Conduct Exploratory Data Analysis Bar Chart is NOT the same as histogram at all Step 4 Feature: column/field/variable Feature Engineering: create feature to make machine learning work Also known as data transformation Numeric skewed data → normally distributed to meet model requirements Non-linear relationship between X and Y → linear relationship between them for modeling Refill missing values Differences of features (time-series data) Categorical Categorize variables for modeling (classification model) Text For comments analysis: sentimental analysis etc. Date Day difference Feature Engineering Model Data Model Data Variable selection Build candidate models Model validation and selection Linear Regression Linear Regression Limitation of Linear & Logistic Regression Limitation of LR Decision Trees Decision Trees Over vs Under Fitting Over vs Under Fitting Cross Validation Cross Validation Interpret Model Interpret & Communicate Model Can we believe the model in practice? Unlock holdout dataset after we decide model Measure model performance again Confusion Matrix Lift Chart ROC Curve Others: model performance vs sample size; model speed vs model accuracy etc. Communicate Depends on audiences Management Team: Top Level (effects, impacts etc.) Data Team: Very detailed information (models, parameters, etc.) Confusion Matrix Often used for measuring accuracy of classification models Pro: interpret friendly (easy to map business concepts) Con: imprecise Conclusion: quick view of our model performance Confusion Matrix Lift Chart Pros: Straightforward Accuracy prediction & model’s overall behavior Cons: Tend to focus on prediction only Lift Chart ROC Curve Pro: Precise measure of model’s performance Cons: Not easy interpretation Rank ordered probability only Compare models ROC (Receiver Operating Characteristic) Curve New vs Old Data New vs Old Data Set target column of old data as one Run model with old data only Predict target column of combined data using existing model coefficients Exam performance metrics of new target column Auto-Document all modeling steps Documentation Document Demo w. real data