Transcript: The Ethics of Corporate Data Mining James D'Souza 4/28/2020 Corporations Collect personal data for multiple reasons PREMISE WHY The Benefits It is easier to target advertisers Knowing how a person prefers helps better your own algorithm for maximum user attention Information gained can be sold to help mitigate outlying costs Knowing the user-base more intimately can help lead to business decisions Counter Argument Reasons Why its a Bad Idea If you really think about it #1 #3 #2 We do not know who receives our data Google and Microsoft applications come with the computer After "reading" a 60 page user agreement... HOW MUCH BANK WE TALKING 1 Stake- holders 1 FACEBOOK Facebook - net worth $86.2 Billion Stuff Facebook collects: Facebook collects about your browsing history Facebook collects about the apps you visit and your activity within those apps the advertisers who uploaded your contact information to Facebook more than two months earlier ads that you interacted with more than two months prior age, employer, relationship status, likes and location, expected net worth facial biometrics, medical conditions Based on contacts, it can assume social bubbles you are in. The data of an average American is worth between $0.20 and $0.40. assume international mean is $.15. $.15 x 1.62 Billion users is $243 Million. GOOGLE Google - net worth $133.3 Billion Stuff Google collects: Google collects about your browsing history Google collects about the apps you visit and your activity within those apps the advertisers who uploaded your contact information to Google based on how long you've gone since you've cleared your history. ads that you interacted with age, employer, relationship status, likes and location, expected net worth medical conditions, purchase history Watch times and engagement with advertisements. Ambient noise Google does not sell personal data. They manage it all in house. By optimizing their algorithm, they enable the products wanted to be advertised in a one-stop shop BING Microsoft- net worth $1 trillion Stuff MS collects: MS collects about your browsing history MS collects about the apps you visit and your activity within those apps the advertisers who uploaded your contact information to Microsoft media ads that you interacted with more than two months prior age, employer, relationship status, likes and location, expected net worth facial biometrics, medical conditions Watch times and engagement with advertisements. Microsoft does not sell personal data. They manage it all in house. By optimizing their algorithm, they enable the products wanted to be advertise in a one-stop shop Government Governments All chatter gets stored in some facility in New Mexico. I heard it on the internet once The US government buys bulk data to survey possible security threats China mass censors the internet, they set up social credit scores that have internet history among other things affect. Russia and China both use bot accounts to influence foreign policy that benefit from the echochamber environment created by the internet Forms of monetization Types of Monetization Advertisements Sponsorships Subscription based Crowd-sourcing Donations Data mining Advertisements Strengths/Weakness Weaknesses Strengths Money from a source that is not your user base Not effective unless data mining makes targeted advertisements You do not have to seek out patrons Opportunities Threats Disgruntled viewership Minimal intrusion into viewers finances User base get deals on products they might be interested Lauren Ingraham Incident Improves relations with a corporate entity via networking Trickle down economics failures Minimal interaction with sponsors SWOT- Sponsors Strengths/Weakness Weaknesses Strengths Money from a source that is not your user base Opportunities Threats Free merchandise No regular paycheck unless the collaboration is a constant thing False positive reviews User base get deals on products they might be interested Improves relations with a corporate entity via networking Economic sustainability of sponsors Downward spirals faster than advertisements Raid is a turn based rpg done right. In case you’ve been living under a rock and haven’t heard, raid is a badass mobile game that changes everything. The game is crazy popular, with almost 15 million downloads in the last 6 months. Raid is an epic dark fantasy done right. A hero collecting turn based game with over 400 champions to collect and customize. In raid you can get knights orcs undead and more. Raid with friends in a clan, claim glory in the pvp arena. Some other cool features are multi battle auto mode, set battles to run in auto mode while you do something else. Spend less time grinding and more time developing your team and finding the fun stuff. They also have weekly tournaments and events, such as fighting in the arena, running special dungeons, or leveling up your hero’s. There’s always a way to compete and win extra prizes every week. The game is growing In
Transcript: 200700176 Kwon Dongan 200901000 Nam Bo-bae "Quora will be bigger than Twitter." - Telegraph "Bigger than Chuck Norris" - Willie Morris, Twitter user. Mobile-First Social: Real-Time Updates Flipboard 'Death' of Foreign Correspondents Personalized consumption Curated social content YouTube video production? Twitter content curation? Facebook News? Closing the gap Quality of social referrals Sites to watch: StumbleUpon More stringers More reliance on social Bright spots: Asia & Africa Facebook distribution Media Acquisitions & Mergers Connect With Me Soon to be NewsBeast TechCrunch & AOL Kommons Dailybooth 데이터소개 Questions? Mobile Networks? Help me investigate Picplz Social Storytelling News.me Porkappolis from Gannett's Cincinnati.com Organizing Social DataMining Final Project Quora @Lavrusik on Twitter email@example.com Rise of Interactive TV Social vs. Search Pulse Accountability & Social Questions Blogging & Desktop Publishing Leakification in Journalism Personal Networks 'The Social Networking Trend of 2011' Presentation: http://bit.ly/smnews11 Post: http://on.mash.to/futurenews11 Mediabistro and 10,000Words Instagram On-the-go location-based news Growing beyond 4% Mobile-only projects Beyond content on mobile: Utility Social News Visualized Location-Based News
Transcript: Model 3. Ensemble as a plus:Bagging/Boosting 2.Mean Integrated Squared Error: Problem Description: Predict user’s star rating on a business, rounded to half-star. Basic Ideas: 1. Overall learning: learn from the user-business review pool. 2. Targeted learning: learn each user individually based on review history. Future Problem 1. How others’ choice will affect a user’s choice 2. How can we learn a user’s interest by digging into similar users’ interests Result Data: Yelp Academic Datasets. Split into: 80% training data + 20% testing data Result so far 1. Should we ignore the content about a business? 2. Discontinuous Linear Regression on users with a lot of reviews. Recommendation system based on Yelp 1. More than 40,000 users’ review 2. More than 10,000 businesses’ profile 3. Sparse matrix with 215,000 non-zero entries (0.05%). 4. More than 7700 users take more than 40 reviews (20%) . c is the treatment cut-off, D is a binary variable and equals to 1 when X ≥ c and h is the bandwidth Future Work: 3. Comparing different evaluation criteria. 2. Regularization on loss function Error may come from: 1. Model is too simple 2. Without any regularization 3. Gray Sheep/Black Sheep 3.Ranked evaluation metric by Heckerman et al. (1998) Learn from the crowd Thank you! 1.Mean Squared Error: 1. Loss function 1. Adding Log-based relevance model into Collaborative Filtering 2. Perform Linear Regression Discontinuity Design Memory-based: Pearson Correlation; Vector Similarity Model-based: Log-based Collaborative Filtering (2006) Data Mining 2. Feature selection: Matrix Factorization such as Singular Value Decomposition Evalutaion Learn From the Crowd & Recommendation System Based on Yelp 1. Learn from others: collaborative filtering 1.Build more complex model Data 3. Cross-validation 2. Personalized recommendation: regression discontinuity design for each user Chen Liu, Ruize Lu, Weizhe Ni
Transcript: Data Warehousing • A subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decision-making processes (1). How does data mining work? Some of the data mining techniques Oracle • 98% of Fortune 500 companies use Oracle • Oracle generate near to 40% revenue in 2011 than they made in 2010 (Yahoo finance) Issues related to data mining Network Setting & Cost Privicy and Security Classical Statistics 5. Forecasting (known as Predictive Analytics) 3. Classification Time efficiency Data Mining 2. Sequence (or Path Analysis) i2 technologies examining the feature of a newly presented object and assigning it to one of a predefined set of classes Not all data are independent and identically distributed. Integration of data mining and knowledge inference. Make confidence diction base on fact not feeling or guessing. Better predict about the future outcomes and Scenarios (3) Data Mining History Strong relationship with customers Time efficiency high-quality of diction making -Data Mining is the threat to an individual's privacy -companies should inform customers about how they will use any data collected from them. 1. "What is a Data Warehouse?" W.H. Inmon, Prism, Volume 1, Number 1, 1995). Difine the Problem dividing a population into a number of subgroups or clusters The process of collecting large quantities of data and then summarizing and analyzing it to produce previously unknown useful information Data Gathering Background There are many application Data Mining 4. Clustering (called Segmentation) Machine Learning 2.Berson, Alex, Stephen Smith, and Kurt Thearling.Building Data Mining Applications for CRM.New York: McGraw-Hill, 2000. Print. SAP Citation • A technique, used in large retail chains, which studies every purchase made by customers to find out which sales are most commonly made together Business impacts (3) Example of data mining application Artificial neural networks (neural networks): Mining Complex Knowledge from Complex Data: • The term itself was introduced relatively recently (in the 1990s), but it’s roots are traced back along three family lines: Salam Almahdi Example: - Finding profitable information -Managers make a diction with short time The organization can seek many competitive advantages by using Data Mining such as: Statistics v.s Data Mining How Data mining is impact the Business? 3.Lange, Kathy. "Differences Between Statistics and Data Mining." Information Management Dec. 2006. Web. 30 Sept. 2011. <http://www.information-management.com/issues/20061201/1069947-1.html>. Model Building & Evaluation Hussain Alqatari High-quality of Diction making: Emptoris 4.Noton, Adriana. "Data Mining and Its Impact on Business." Web. 1 Oct. 2011. Short time of activities • What they need and what is the right time they need it. • Learn how to serve them better consisting of patterns where one event leads to another event (such as the birth of a child and purchasing diapers) Ariba Knowledge Deployment Artificial Intelligence • Statistics is part of data mining which job is to differentiate between random noise and significant finding. Also, it is help to estimates probability of out comes (2). • Data mining is the entire process of data analysis. It is include (Statistics, forecasting, and operation research) (2). Iveta Guneva (3) discovering patterns in data that can lead to reasonable predictions about the future Distinguish between profitable and unprofitable customers Which customers are likely to switch to an alternative supplier in the near future • Decision trees: Wasan Alameer Labor Installation maintenance The Main Data Mining Tasks Nearest neighbor: Customers 1. Association (Market Basket Analysis)
Transcript: Housing Prices using Advanced Regression Techniques Presented By- Xiaoxia Liu Faiz Nassur Mrunal Bokil Chetana Kamble Vineeta Agarwal Data Mining Project Introduction To predict the residential house price in Ames, Iowa Target variable: House Sale Price Independent variable: 79 explanatory variables to predict house price Data Dimensions: 80 columns including dependent variable (categorical and continuous variables), rows - 2930 30% test observations and 70% training Data Exploration Data Exploration Variables showing skewed distribution Missing Values Missing values constituted up to 40% Feature Correlation Feature Correlation Graph All features Important features (threshold > 0.5) Sale price variation with housing data variables Main Attributes Feature Engineering- existing features Numerical features were actually categorical e.g MSSubclass, MonthSold : converted it to ordinal Feature Engineering- creating new features e.g YearBuilt and Yearsold = Age at time of sale Encoding the Categorical features String format to Numeric e.g FoundationType= slab/stone/ wood Created Dummy variables (Retained the Ordinal variables) Data Preprocessing Missing Value Imputation - Mean-continuous - Mode-Categorical Log transformation of skewed continuous variables Standardized the continuous data after test training split Model Selection Six Methods: 1) Support Vector Regression 2) Linear Regression 3) Random Forest 4) Gradient Boosting Machine 5) LASSO 6) RIDGE Model - RMSE Comparison Final Model - GBM Future Scope A Neural Network based model for real estate price estimation Actual Survey questionnaire data Implement other feature scaling techniques Thank You! https://ac.els-cdn.com/S2352146514002300/1-s2.0-S2352146514002300-main.pdf?_tid=1ca03840-4959-4027-99d2-9983dc442274&acdnat=1528033355_532392cc92e4b7196378469b6c817c23 https://www.hindawi.com/journals/aaa/2014/648047/ https://arxiv.org/ftp/arxiv/papers/1403/1403.2877.pdf References
Transcript: Opinion Spamming Reviews have become increasingly important Most important in influencing sale of a product As E-commerce becomes relevant, detecting fake reviews become even more important Fraud Detection Refers to "illegal" activities (e.g., writing fake reviews, also called shilling) that try to mislead readers or automated systems by giving undeserving positive/negative opinions to some target entities Opinion spam has many forms, Fake reviews (also called bogus reviews), Fake comments, Fake blogs, Fake social network postings, Deceptions, and Deceptive Messages. Credit Card Online Auctions Cellphones Reviews on E-Commerce Website Financial Trading Graphical Approach to Fake Review Detection Types of Fake Reviews Garbage reviews - Totally irrelevant - Troll Reviews Reviews about the brand and not the product Legitimate looking reviews trying to glorify a product Legitimate looking Reviews trying to tarnish a product Guan Wang, Sihong Xie, Bing Liu, Philip S. YuUniversity of Illinois at ChicagoChicago, USA Fake Review Detection Himanshu Jindal Data Mining and Networks Fraud Detection
Transcript: Create a list of words. Remove stop words. Stem words. Calculate frequency of each stemmed word. Recovery of information, especially in a database stored in a computer. A given word may occur in a variety of syntactic forms: Plurals. Past tense. Gerund forms. A stem is a what is left after its affixes (prefixes and suffixes) are removed : ed, s, or, ed, ing, and ion are suffixes. pre and post are prefixes. Information retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources. Modern information retrieval information retrieval Chapter2 by Rajendra Akerkar, Pawan lingras Definition information retrieval Document representation: Using keywords. Relative weight of keywords. Query representation: keywords. Relative importance of keywords. The process of information retrieval : IR Two IR main approaches are matching words in the query against the database index (keyword searching) and traversing the database using hypertext or hypermedia links. Keyword searching has been the dominant approach to text retrieval since the early 1960s; hypertext has so far been confined largely to personal or corporate information-retrieval applications. IR As soon as information archives started building, so did information retrieval techniques. catalogs, index, table of contents. Porters Stemming Algorithm:
Transcript: DATA MINING REDDIT TO UNDERSTAND STUDENTS' CONCERNS STALLONE EU Data mining social media is growing in popularity. An academic journal from Purdue used several '#engineering' on Twitter to analyze common grievances of their Engineering students. The difference from my research and theirs is that Twitter has a reputation of people portraying how they want others to see them. My data also allows me to get a wider spectrum of students for my data set. RESEARCH
Description: If you work in education, make your next report visually interesting and easy to navigate. The line-drawn illustrations in this edu report presentation template encourage curiosity and discovery.
Description: Stand far above the stacks and stacks of flat, boring resumes on any hiring manager’s desk with a Prezi resume template. Just customize this Prezi presentation template to create your very own “Prezume” and impress them with your dynamism, coolness, and originality.
Description: Structuring your syllabus doesn't have to be a huge headache with this customizable lesson plan presentation template. With a classic chalkboard theme and adaptable structure, it's easy to add new subjects, assessments, assignments, and more.
Description: When you need to clearly spell out your message, this creative Prezi template is the way to go. As with all Prezi education templates and Prezi nonprofit templates, this one is easy to customize to let you zoom in on your ideas or pull back to show the big picture.
Now you can make any subject more engaging and memorable