Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Data Mining: a (very) short introduction

ECAC/DMCF
by

Carlos Soares

on 17 September 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Data Mining: a (very) short introduction

Data Mining
linkedin: http://pt.linkedin.com/in/cpsoares
plaxo: procurar “carlos soares universidade do porto”
Carlos Soares
Carlos Soares
classification
new observations are Y/N?
decisions:
it's all about data!
I may be exaggerating a little bit...
maybe I could learn something from previous promotions...
who should I send the promotion to?
... a model relating attributes to result
... and use to make predictions for new campaign
so, what is data mining?
“non-trivial process of identifying valid, novel and potentially useful and ultimately understandable patterns in data”

(fayyad, piatetsky-shapiro and smyth, 1996)
statistics/
machine learning
databases
artificial intelligence
data
mining
data mining tasks
and why should I care?...
clustering
applications
bank customers
credit card spending
customized shirts
shirt characteristics
shops according to distribution of customer segments
completed
transactions
association
ongoing
transaction
which one will be bought next?
applications
PortalExecutivo.com
recommendation of content
PalcoPrincipal.pt
recommendation of musics for playlists
Bivolino
fashion trend analysis
N
Y
circles are Y
applications
INE
error detection in foreign trade transactions
workflow automation in public organization
route incoming documents
information extraction in shoe company
identify document name in scanned invoices
regression
applications
STCP
predict trip time duration
Retail
space-sales elasticity
(breunig, 2000)
outlier detection
applications
INE
error detection in foreign trade transactions
Foam industry
quality control
http://tinyurl.com/ofazhw4
1997
2012
2004
1990
1995
introduction
methodology
text
relational
data
streams
data
graphs
http://noticias.sapo.pt/interativo/omundovistodaqui/artigo/nuno-santos-vao-se-as-imagens-fi_5355.html
signals
Fleck, S. & Straßer, W., 2010. Privacy Sensitive Surveillance for Assisted Living – A Smart Camera Approach. In H. Nakashima, H. Aghajan, & J. C. Augusto, eds. Handbook of Ambient Intelligence and Smart Environments. Boston, MA: Springer US, pp. 985–1014.
http://www.tony5m17h.net/Matrix.html
r&d@csoares
metalearning
label ranking
machine learning for optimization
spatial/temporal data
http://infovis.orgfree.com/exemplos.html#2
sensors
http://www.infovis.net/printMag.php?lang=2&num=190
Ana Anacleto, Ana Isabel Loureiro, Ana Isabel Marques, Ana Lisa Monte, André Rossi, Artur Aiguzhinov, Aurora Cameirão, Bruno Souza, Carla Carvalho, Carla Rebelo, Carla Silva, Carlos Gomes, Catarina Félix, Cátia Cunha, Cláudio Sá, Diana Soares, Emanuel Matos, Fábio Pinto, Filipe Fortuna, Filipe Pinheira, Filipe Rocha, Geraldine Ribeiro, Hélder Paiva, João Sousa, Jorge Kanda, José Pedro Machado, Luís Gomes, Luís Moreira, Marcos Domingues, Mohammad Nozari, Nelson Cunha, Nereida Moreira, Pedro Abreu, Pedro Machado, Pedro Saleiro, Ricardo Leal, Sérgio Almeida, Sérgio Morais, Sandeep Rajoria, Sónia Sousa, Taciana Gomes, Tiago Cunha, Tiago Pereira, Tiago Vicente e Tomy Rodrigues
with a little help from my friends

machine learning/data mining
[management] information systems
decision support systems
activities

data mining/machine learning
(+web+text)
business intelligence
machine learning + optimization

... AI

cooperation with portuguese and foreign universities
U. Leiden, U. Genebra, U. Masaryk, U.P. Valência, USP, UF Pernambuco, U. Waikato
… and companies
bivolino, SONAE, palco principal, ALERT, flex2000, declarativa, sensoria/paletadeideias, portalexecutivo.com/portal ver, STCP, INE, daimlerchrysler
R&ID and consulting
PROJECTS
TEACHING
RESEARCH
SELECTED RESULTS
offspring
2+8 Ph.D.
14+8 M.Sc.
publications
3 books
few dozen papers
conferences
ECMLPKDD 2012
KDD 2009
outlier detection
regression
knowledge management
recommender systems
spatio/temporal profiles in twitter
analyze diffusion of information from news to twitter
sentiment analysis on twitter
web & text mining
algorithm selection
machine learning algorithms
heuristics/meta-heuristics
model reuse
algorithms
k-NN
decision trees
naive bayes
association rules
neural networks
pre-processing methods
discretization
applications
error detection
fault prediction
other algorithms
classification
clustering
regression
clustering
applications
banking
customized shirts
research question
quantify business value of clusters/clusterings
application
optimize shelf allocation in retail
optimize surgery room scheduling
bus trip time duration
business intelligence
applications
textile
shoes
shop management
applications
monitoring editing processes
official document classification
routing incoming mail in a public institution
task: select customer for a single campaign








goal: maximize value of campaign
classification for targeting
CRISP-DM: CRoss Industry Standard Process for Data Mining
http://www.crisp-dm.org/
tool-independent
SEMMA
http://www.sas.com/offices/europe/uk/technologies/analytics/datamining/miner/semma.html
SAS Enterprise Miner
Others
specific ones
http://datalligence.blogspot.com/2008/12/data-mining-methodologies.html

Essentially equivalent
data mining methodologies
CRISP-DM phases
carlos soares
Goller, Hogg and Kalafatis (2002), “A new research agenda for business
segmentation”, European Journal of Marketing, Vol. 36, No. 1/2, pp. 252-271
e.g., market segmentation with clustering
data mining projects:
the big picture
operacionalização/integração
preparar operacionalização
formação
monitorização/manutenção do modelo
Goller, Hogg and Kalafatis (2002), “A new research agenda for business
segmentation”, European Journal of Marketing, Vol. 36, No. 1/2, pp. 252-271
fonte: http://it.toolbox.com/blogs/infosphere/spinach-how-a-data-quality-mistake-created-a-myth-and-a-cartoon-character-10166
data are never 100% reliable
e.g. myth concerning nutricional value of spinhas is due to a misplaced decimal place
80%+ do tempo de um projecto de data mining é para
conhecer os dados
limpar os dados
transformar os dados
data quality
data mining in the cloud

zementis
http://www.zementis.com/
the future (?)
increasing integration of DM inti DBMS
eliminate need to transfer data
faster access

SQL server
http://www.sqlserverdatamining.com/ssdm/
oracle
http://www.oracle.com/technology/products/bi/odm/index.html
DB2 intelligent miner
http://www-01.ibm.com/software/data/iminer/
sgbd & dm
commercial
SAS enterprise miner
http://www.sas.com/technologies/analytics/datamining/miner/
SPSS PASW modeler (anteriormente clementine)
http://www.spss.com/software/modeling/modeler/
xlminer (add-in cxcel)
http://www.resample.com/xlminer/
academic (?)
WEKA
http://www.cs.waikato.ac.nz/ml/weka/
R
http://www.r-project.org/
rapidminer
http://rapid-i.com/
mais informação:
http://www.kdnuggets.com/software/suites.html
tools




compare predictions with truth
correct answers in diagonal
total is number of examples





error rate
e.g., (2+1) / (5+1+2+29) = 8.1%
evaluation measures:
e.g., confusion matrix
manage contact with customers
e.g., period between 2 contacts to the same customer











budget
reality: multiple campaigns
+ high-performance computing
data mining project phases
carlos soares
Goller, Hogg and Kalafatis (2002), “A new research agenda for business
segmentation”, European Journal of Marketing, Vol. 36, No. 1/2, pp. 252-271
e.g. customer segmentation using clustering methods
data mining projects:
the big picture
how (not) to estimate performance?
how to estimate performance!
DIG
Full transcript