Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Final Project - Data Mining 4.18.2015

use for presentation
by

Ahmad Torjaman

on 22 April 2015

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Final Project - Data Mining 4.18.2015

Final Project CIDM_6355_70
Data Mining Methods

About our Dataset..
Dataset Background
Attributes of Fields
Data Preparation
RapidMiner
The data is related with
direct marketing
campaigns
of a Portuguese bank.
Conclusion:

We have
unbalanced classes
and about 90% of attributes are labeled
UNKOWN


if we test performance we can't just look at accuracy, but we need to look at
precision and recall for YES

We will use RapidMiner as our tool and will exclude six columns as either they have a lot of UNKOWN values, or they are not useful.

We will apply three prediction methods:
Naive Bayes, Decision Tree, and Neural Network.

for NN we have to use Numerical values, but we have to pay attention that for some fields numerical values can be misleading.
Now that data is uploaded, let's apply
Neural Networks,
but after we convert nominal values
to numerical values!

We will apply Naive Bayes
and Decision Tree methods later on as well
c
Ahmad Torjaman
Abani Khanal
Watban Alshammari
Daisy Leal-Balderaz
by Data Drillers
Background:
Dataset Overview
Project Objectives
Dataset Overview:
Project Objectives:
The database we will be using is

bank-full.csv.
(Dataset cover period from May 2008 to November 2010).
Number of Instances: 45,211 for bank-full.csv

Number of Attributes: 16, and an output attribute.

The
classification goal
is to
predict
if the
client will subscribe
a term deposit (variable Y).
1 - age
4 - education :
Values:
"unknown",
"secondary",
"primary",
"tertiary"
3 - marital : marital status
2 - job : type of job
Values:
"admin.",
"unknown",
"unemployed",
"management",
"housemaid",
"entrepreneur",
and "student"

5 - default: has credit in default?
(binary: "yes","no")
8 - loan: has personal loan?
7-housing: has housing loan?
6 - balance:
average yearly balance, in euros
(numeric)
9 - contact:

contact communication type
(categorical:
"unknown",
"telephone",
"cellular")
12 - duration:
last contact duration, in
seconds

11 - month:
last contact month of year
10 - day:
last contact day of the month
13-campaign:
14 - pdays:
Values:
"married",
"divorced",
"single"
Note: "divorced" means divorced or widowed)
binary: "yes","no"
binary: "yes","no"
(numeric)
(categorical: "jan", "feb", "mar", ..., "nov", "dec")
(numeric)
Number of contacts performed during
this campaign and for this client
(numeric, includes last contact)
number of days that passed by
after
the client was last
contacted

from
a
previous
campaign
(numeric, -1 means client was not previously contacted)
15 - previous:
number of contacts performed
before
this campaign and for this client
(numeric)
16 -
p
outcome:
outcome of previous marketing campaign
(categorical: "unknown","other","failure","success")
17 - y (Output variable,desired target):
has the client subscribed a term deposit?
(binary: "yes","no")
The marketing campaigns were based on phone calls.
Often, more than one contact to the same client was required,in order to access if the product (bank term deposit) would be (or not) subscribed.

Test1:
missing values
Attributes such as “contact” and “poutcome” had missing values
Test2: UNKOWN values: Original Data has many fields that have
UNKOWN values
, such as Education, contact, poutcome.
Test3:
Usefulness for Prediction
: many fields won't be
useful for prediction, such as Day of Month and Month. Also, “marital” was not useful for data mining using neural network because numerical value could not be assigned to marital status.
Attributes to exclude:

(numeric)
Conclusion:
Data Drillers!
(Uploading Dateset)
Selecting Dataset file to upload
Removing quotoes and semicolon..
Selecting last column, Y, as a label..
Excluding problematic fields..
Upload is done, Dataview Screen
Metadata Statistics..
Metadata Statistics..
c
Other attributes such as “campaign”, “pdays” and “previous” were removed as they would have very less impact on the outcome.
Testing Dataset
Neural Networks
Decision Tree
Naive Bayes
Original Data
Filtered data, modified attribute names
NN Process in RapidMiner
NN Process in RapidMiner
NN Results
The Neural Network analysis result shows that the model has a good accuracy of 88.52%.

“True no – class recall” and “pred no – class precision” are respectively 94.88% and 92.34%.

However, “true yes – class recall”
and “pred yes – class precision” are relatively lower.

Based on the nature of dataset,
Neural Networks is not the best data mining method
because the dataset has many nominal values, which cannot be correctly converted to numerical values.

Other methods like Decision Tree and Naive Bayes are more suitable for this kind of data. However, study using Neural Networks helps to do comparative study with other methods to come up with the best possible result.
Neural Network only works with numerical values. Therefore the nominal values will be converted to numerical values.
Evaluating Results of NN:
This dataset is public and available for research.
Citation: [Moro et al., 2011] S. Moro, R. Laureano and P. Cortez.

Using Data Mining for Bank Direct Marketing:
An Application of the CRISP-DM Methodology.

In P. Novais et al. (Eds.), Proceedings of the European Simulation
and Modelling Conference - ESM'2011, pp. 117-121,
Guimarães, Portugal, October, 2011. EUROSIS.

Available at: [pdf] http://hdl.handle.net/1822/14838
[bib] http://www3.dsi.uminho.pt/pcortez/bib/2011-esm-1.txt
Citation:
Final comparison of three methods
Original Dataset
Dataset after excluding
Not useful fields
NB Process in RapidMiner
NB Process in RapidMiner
NB Results
Naive Bayes was successfully used to perform the analysis and performs well for the given dataset.

The accuracy, precision and recall indicate that the model can be used with good confidence to predict whether the product (bank term deposit) would be (or not) subscribed.
Evaluating Results of NB:
Original Dataset
After excluding not useful fields
DT Process in RapidMinder
DT Process in RapidMinder
Result of DT
Decision tree method did not perform very well as a decision tree could not be produced from the dataset.

The result shows that precision and recall are zero for "pred. yes" and "true yes" respectively.

This indicates that the model does not perform well. Therefore this method is not the best to perform the analysis.


Evaluating Results of DT:
Thank you for watching!
Full transcript