Transcript: Data Warehousing • A subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decision-making processes (1). How does data mining work? Some of the data mining techniques Oracle • 98% of Fortune 500 companies use Oracle • Oracle generate near to 40% revenue in 2011 than they made in 2010 (Yahoo finance) Issues related to data mining Network Setting & Cost Privicy and Security Classical Statistics 5. Forecasting (known as Predictive Analytics) 3. Classification Time efficiency Data Mining 2. Sequence (or Path Analysis) i2 technologies examining the feature of a newly presented object and assigning it to one of a predefined set of classes Not all data are independent and identically distributed. Integration of data mining and knowledge inference. Make confidence diction base on fact not feeling or guessing. Better predict about the future outcomes and Scenarios (3) Data Mining History Strong relationship with customers Time efficiency high-quality of diction making -Data Mining is the threat to an individual's privacy -companies should inform customers about how they will use any data collected from them. 1. "What is a Data Warehouse?" W.H. Inmon, Prism, Volume 1, Number 1, 1995). Difine the Problem dividing a population into a number of subgroups or clusters The process of collecting large quantities of data and then summarizing and analyzing it to produce previously unknown useful information Data Gathering Background There are many application Data Mining 4. Clustering (called Segmentation) Machine Learning 2.Berson, Alex, Stephen Smith, and Kurt Thearling.Building Data Mining Applications for CRM.New York: McGraw-Hill, 2000. Print. SAP Citation • A technique, used in large retail chains, which studies every purchase made by customers to find out which sales are most commonly made together Business impacts (3) Example of data mining application Artificial neural networks (neural networks): Mining Complex Knowledge from Complex Data: • The term itself was introduced relatively recently (in the 1990s), but it’s roots are traced back along three family lines: Salam Almahdi Example: - Finding profitable information -Managers make a diction with short time The organization can seek many competitive advantages by using Data Mining such as: Statistics v.s Data Mining How Data mining is impact the Business? 3.Lange, Kathy. "Differences Between Statistics and Data Mining." Information Management Dec. 2006. Web. 30 Sept. 2011. <http://www.information-management.com/issues/20061201/1069947-1.html>. Model Building & Evaluation Hussain Alqatari High-quality of Diction making: Emptoris 4.Noton, Adriana. "Data Mining and Its Impact on Business." Web. 1 Oct. 2011. Short time of activities • What they need and what is the right time they need it. • Learn how to serve them better consisting of patterns where one event leads to another event (such as the birth of a child and purchasing diapers) Ariba Knowledge Deployment Artificial Intelligence • Statistics is part of data mining which job is to differentiate between random noise and significant finding. Also, it is help to estimates probability of out comes (2). • Data mining is the entire process of data analysis. It is include (Statistics, forecasting, and operation research) (2). Iveta Guneva (3) discovering patterns in data that can lead to reasonable predictions about the future Distinguish between profitable and unprofitable customers Which customers are likely to switch to an alternative supplier in the near future • Decision trees: Wasan Alameer Labor Installation maintenance The Main Data Mining Tasks Nearest neighbor: Customers 1. Association (Market Basket Analysis)
Transcript: Data mining in computer science is the process of discovering interesting and useful patterns and relationships in large volumes of data. Data mining commonly involves four classes of tasks: * Clustering - is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data. * Classification - is the task of generalizing known structure to apply to new data. For example, an email program might attempt to classify an email as legitimate or spam. Common algorithms include decision tree learning, nearest neighbor, naive Bayesian classification, neural networks and support vector machines. * Regression - Attempts to find a function which models the data with the least error. * Association rule learning - Searches for relationships between variables. For example a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis. For example, one Midwest grocery chain used the data mining capacity of Oracle software to analyze local buying patterns. They discovered that when men bought diapers on Thursdays and Saturdays, they also tended to buy beer. Further analysis showed that these shoppers typically did their weekly grocery shopping on Saturdays. On Thursdays, however, they only bought a few items. The retailer concluded that they purchased the beer to have it available for the upcoming weekend. The grocery chain could use this newly discovered information in various ways to increase revenue. For example, they could move the beer display closer to the diaper display. And, they could make sure beer and diapers were sold at full price on Thursdays. Data mining is primarily used today by companies with a strong consumer focus - retail, financial, communication, and marketing organizations. It enables these companies to determine relationships among "internal" factors such as price, product positioning, or staff skills, and "external" factors such as economic indicators, competition, and customer demographics. And, it enables them to determine the impact on sales, customer satisfaction, and corporate profits. Finally, it enables them to "drill down" into summary information to view detail transactional data. DATA MINING Mathematical algorithms, equations, clustering of data are the back bone of its functioning. COMMERCE AND BUSINESS When Data is being generated..for example- swiping of MasterCard.
Transcript: Each time a user issues a Query, we get ads on positions 1,2 or 3 The depth is at most 3 and it is >= postition. This 3 rows (or less) constitute a search session. Approach #2 Managing sample data: The information is too much, querys take hours, so we will use sample data: Each will be cleaned ( e.g. duplicate removal or too many missing values) sample validation: Is this a good sample? Is it random enough? (how can we make sure of this) Does it has many different values to work with? At this point, we dont believe that we should disregard any of the columns we have, but it is an important point to consider later. "Who do we work for?" If we use different perspectives to look at the information, we hope to find interesting and meaningful information or for different stakeholders. Q & A It also brings up several Questions: Is it useful to identify sessions? Should we stick to this data as given and start working as it is? Understanding our Data What is this? Companies might be intrested in: Ads that are popular among a specific age of gender group. Key words, titles or descriptions with high click rate. Approach#1 Things to verify (preprocess) Actual Data: Setting (Depth, position), may vary on each search session. click registry. Example: Our user clicked the for the second time until the 4th time he issued this query, the 5th time, he did no click Data mining Final Project Introduction soso.com can be intrested in: keywords that trigger unexpected Ads in good positions. Two ads that are not from related items or companies, but appear in the same query. Different querys that trigger the same Ads. Data verification and cleansing Rising Number of Impressions each time the Ad was shown. Different types of "treasure" for different type of people... If this is a Log , then we should be able to verify it like this: For User =1, Query=4 and AdiD=6 (with the same title and description), we will retrieve this log: Miners: Dereck Davis Feng-Ren Tsai Gibril Lowe Intan Maghfirah Ruben Berrios Omar The sessions are broken up and then aggregated from different perspectives. So it is hard to build the original session records.
Transcript: What it is: Data mining is the analysis and summarization of very large amounts of data to form a useful picture from it. For example Car insurers have used data mining and statistical analysis to determine that drivers of red cars are more likely to commit moving violations than the drivers of any other color car. Data Mining is used mostly by scientists and multinational corporations to research patterns hidden in masses of data. It is better than sampling since it takes all data into account, without omissions Increasingly, techniques are involving artificial neural networks to provide better results. Unlike algorhythms typical to most computer programming, neural networks have a capacity to learn. Data mining is an emerging field, only recently made possible by today's increasingly powerful computers The source of the data is not usually more than simple programs (Java or other) that aggregateand acquire the data for processing Data Mining
Transcript: Nobody knows babies like we do! Quality products . Good Customer service. Every Kid really loves this store.. BABYLOU ABOUT US About Us BabyLou was established in 2004. It has been more than a decade since we started, where we have ensured to take care of every need and want of every child and infant under one roof, true to the caption “NO BODY KNOWS BABIES LIKE WE DO”. Our benchmark is to provide 100% customer service and satisfaction and continue to deliver the same with a wide range of toys, garments and Baby Products. Play and Create We Are Best 01 02 03 Block games Building Blocks help Kids to use their brain. PLAY TO LEARN in Crusing Adventures Our Discoveries Enjoy a sunny vacation aboard a luxury yacht with the LEGO® Creator 3in1 31083 Cruising Adventures set. This ship has all the comforts you need, including a well-equipped cabin and a toilet. Sail away to a sunny bay and take the cool water scooter to the beach. Build a sandcastle, enjoy a picnic, go surfing or check out the cute sea creatures before you head back to the yacht for a spot of fishing. Escape into the mountains Disney Little Princes in Also available for your Babies..... Also... Out of The World… Our reponsibility BABYLOU…. Our Responsibility All children have the right to fun, creative and engaging play experiences. Play is essential because when children play, they learn. As a provider of play experiences, we must ensure that our behaviour and actions are responsible towards all children and towards our stakeholders, society and the environment. We are committed to continue earning the trust our stakeholders place in us, and we are always inspired by children to be the best we can be. Innovate for children We aim to inspire children through our unique playful learning experiences and to play an active role in making a global difference on product safety while being dedicated promoters of responsibility towards children.
Transcript: As is common in association rule mining, given a set of item sets the algorithm attempts to find subsets which are common to at least a minimum number of the item sets. Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found. Association rules are usually required to satisfy a user-specified minimum support and a user-specified minimum confidence at the same time The purpose of the Apriori Algorithm is to find associations between different sets of data. Each set of data has a number of items and is called a transaction. The output of Apriori is sets of rules that tell us how often items are contained in sets of data. Apriori is the best-known algorithm to mine association rules. It uses a breadth-first search strategy to count the support of item sets and uses a candidate generation function which exploits the downward closure property of support. Presentation on Apriori Algorithm Association rule generation is usually split up into two separate steps: 1.First, minimum support is applied to find all frequent item sets in a database. 2.Second, these frequent item sets and the minimum confidence constraint are used to form rules. Association Rules In Data Mining context association rule learning is a popular and well researched method for discovering the relations between the variables in large database Many algorithms for generating association rules were presented over time. Some well known algorithms are Apriori, and FP-Growth By Dheeraj Reddy Jonnalagadda (800752576) Sravani Reddy Burri (800736309) It is used for discovering strong rules in databases using the different measures Apriori Algorithm
Transcript: Subset of database or data warehouse Usually one large table Columns from different tables Filtered from larger databases For DM operations Types of data Purposes of data mining Data sets Need for pre-processing of data Denormalized data OLAP Based on archived data Nightly runs to create DW Database for Data mining Data mart: Small scale data warehouse for specific department / functional area Data warehouse Columns and rows tables relationships Relational databases For quick reading and writing For transactions OLTP system SQL Relational database Data for Data mining Learn something new from data Classification / categorization Predictions Apply new knowledge Organizational data (Master data) Operational data or Transaction data Denormalized Databases
Transcript: You start to ask questions Conclusion However, requires long computation time and high cost What we have learned.... What we have learned from other materials.... Harvard Business Review - Case study: Netflix - DVD Company - Gaind success over Blockbuster - Recommendation system implimented by using data mining technique Inventory control by recommendation system Find individual's priority by cosine similarity, CRM Even marital status! Attribute Selection Getting more used to data mining programs Able to use data mining tools Data Mining Final Project Visualization Normalization ANN Artificial Neural Network Find other applications that can be adopted to real world Ignore Missing Value Blue>= $ 50K/yr Red<$ 50K/yr Best: Ensemble-Bagging Worst: Naive Bayes Reduce the computation time Normalization enables the computation Accuracy rate is highly dependent to the preprocessing The importance of attribute selection in order to anaylze data with higher accuracy <ANN: Training Set> Decision Tree Naive Bayes ANN SVM Ensemble(Bagging) Ensemble - Bagging Result - 3016 Training set - 1506 Test set <Naive Bayes: Test set> ROC (Weighted Average ROC) <Naive Bayes: Training Set> <ANN: Test set> <Bagging: Test set> Tool AND... Accuracy of Training set and Test set are very similar It is not under or over estimated 1. Introduction Use various kinds of data WEKA SPSS SAS EXCEL Data Source:Census Income Data Set from UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets/Census+Income) <SVM: Training Set> Naive Bayes 2. Preprocessing <Decision Tree: Training Set> <10% of the actual Data set> Change format into ARFF How much he earns SVM Support Vector Machine Enhance the accuracy Why Ensemble-Bagging became the best analysis method? Future Works... Attributes 14 Nominal and numerical Attributes Nominal attributes have text values Some highly correlated attributes Data Instances 32561 training instances which contain some missing values 16281 test instances also contain some missing values Sampling Decision Tree Naive Bayes ANN SVM Ensemble <SVM: Test set> LOVE DATA MINING <Bagging: Training Set> Decision Tree Good Looking Education Multiple iteration <Decision Tree: Test set> The importance of preprocessing Job Continuous Value: Age, Education num, Capital gain, Capital loss Accuracy Getting more comfortable with various data mining methods 1800 instances from workclass에서 약 7 instances from occupation에서 556 instances from native-country에서 제거 3. Data Analysis eg) Best: Ensemble-Bagging Worst: SVM Try other data mining methods Developed new way by Data Mining You start to like him much Family Age The number indicates the area under the ROC curve Weighted ROC is a result of the ROC value regards to the target value Perhaps, Good body
Description: Add some color to your quarterly business review with this vibrant business presentation template. The bold visuals in this business template will make your next QBR a memorable one.
Description: The sky’s the limit. Boost your new sales initiative into orbit with an engaging and compelling SKO presentation. This template features a effective sales kickoff theme that makes it easy to be engaging. Like all Prezi SKO templates, it’s fully customizable with your own information.
Description: Catch the eye and engage the imagination with this cool-looking Prezi proposal template. The bold, bright design and highly dynamic theme all but guarantee success for your next sales or marketing proposal. All Prezi presentation templates are easily customized.
Now you can make any subject more engaging and memorable