Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Data Mining: Introduction (Kyoto University)

No description

Janez Demsar

on 2 July 2010

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Data Mining: Introduction (Kyoto University)

Data Organization of lectures This and future posters are available at http://prezi.com/ereeyes0csjn/data-mining-kyoto-university/

I love to be interrupted.

I love getting e-mail: janez.demsar@fri.uni-lj.si

Software: http://www.ailab.si/orange (please install!) Evaluation
and deployment Supervised
modelling Unsupervised
modelling Visualization Data hospitals
government institutions
commercial data - market chains, mobile phone operators, web sites
research (bioinformatics, physics...)
other organizations (ecological modelling etc.) What shall we do with all these data?! Data mining is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in the data (Fayyad)

Data mining is the process of extracting previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions. (Zekulin)

Data mining is the process of discovering advantageuos patterns in the data. (John)

Data mining is a decision support process where we look in large data bases for unknown and unexpected patterns of information. (Parsaye)
Janez Demšar (assistant: Andraž Žagar)

Faculty of Computer and Information Science
University of Ljubljana, Slovenia

Basic programming
Data mining

Computer Science -> Artificial Intelligence -> Machine Learning

medical data mining
genetics and bioinformatics
industrial projects
Area: 20.000 sq. km (Japan: 378.000, 19×Slovenia)
Population: 2 mio (Japan: 127 mio, 64×Slovenia) GapMinder (http://www.gapminder.org/)

Life expectancy vs. Income (http://www.bit.ly/d4hOUy)
#Cell phones per 100 people (http://www.bit.ly/doYy6k)

See also: Hans Rossling's talk at TED (http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen.html) AIDS prevalence Toy exports Military spending Alcohol consumption Housing prices More:
http://www.dailymail.co.uk/news/article-439315/How-world-really-shapes-up.html Guys with data hire the guys with hammers ;) Department within the company is in charge of making the collected data useful
-- or --
The company (institution, organization) hires (or asks) somebody outside to help getting sense from the data. Either case: two parties, which need understand each other. The "miner" needs to have the basic understanding of the problem/the business and the data
-- and --
The data owner needs to have the basic understanding of the data mining process and methods. These lectures:
a "miner" talks to the "data owners". You will
get the basic knowledge of data mining,
learn to do as much by yourself as you can
understand what can you expect from a professional. Machine
learning Mining Visualization Main goal of data visualization is to communicate information clearly and effectively through graphical means (Friedman, 2008)

Visualization is
fun to use,
puts the patterns right in front of our eyes.

finding the right visualization Statistics Mathematical discipline based on probability calculus.

Features methods for.
observing properties of data (mean, variance, correlation...),
fitting models to data (esp. regression, Bayesian models) to predict values, probabilities of events...,
testing hypotheses.

The emphasis of school curricula is on the latter (sadly!).
See e.g.
Cohen (1994): The Earth is Round (p<0.05) (http://www.ics.uci.edu/~sternh/courses/210/cohen94_pval.pdf)
Gigerenzer (2004): Mindless Statistics (http://courses.umass.edu/bioep740/yr2009/topics/Gigerenzer-jSoc-Econ-1994.pdf)

Null-hypothesis significance testing requires a strict procedure
form the hypothesis in advance,
collect appropriate data,
disprove the null-hypothesis

Using the data in forming the hypothesis
is strictly prohibited.

Null-hypothesis significance testing is
the opposite of data mining. So - we get do to all the work?! Area of artificial intelligence, which tries to make the computers to mimic human reasoning
can handle large amounts of data
uncover patterns in the data
gives understandable (human brain compatible) patterns

the holly grail of ML is to not let the computer know anything and let it discover everything,
ML is not about discovering new knowledge.

interest in models that optimize predictive accuracy, without any regard for understandability (e.g. SVM, the successor to neural networks)
More interested in imitating and following
humans than in assisting and leading them About lecturer Data Collections Heart rate
Full transcript