The eight steps are:

1. Defining the problem

2. Collecting data

3. Preparing data

4. Data preprocessing

5. Selecting an algorithm

6. Selecting algorithm training parameters

7. Train and test

8. Evaluate final model

Data mining process

**Week 4**

Data Mining Process

Data Mining Process

This is the most important step of the data mining process, the driver.

The project's business problems needs to be first understood. Experts will then define the project's objectives, boundaries and requirements from a business perspective, which will then be defined into a data mining perspective.

This steps determines the success/failure.

What problems are suitable for data driven modeling?

How do you evaluate the results?

Is it a classification or estimation problem?

What are the inputs and outputs?

The raw data is collected, given a data mining problem, to produce the models.

Cleaning up raw data. Analyzing and removing inconsistent data. Handling missing data . Must be in the right format. Replaces or gets rid of any erroneous data.

**Defining the problem**

Step 1

Collecting Data

Preparing data

Step 2

Step 3

Questions that need to be answered:

**Chu Wen Hsin (Caryn)**

Syed Muhammad Zeeshan

Syed Muhammad Zeeshan

Converts data into its final form. Simplifies the problem.

Data preprocessing

Step 4

Must satisfy hard constraints and their optimization.

Specific parameters must be selected after selecting the algorithm.

May iterate to select different learning parameters on the same algorithm, or select a different algorithm. (Iterate)

Selecting an algorithm

Step 5

Selecting algorithm training parameters

Step 6

Perform training and testing to evaluate the "goodness" of the model.

It has to be tested to see how well it performs.

The best model, based on the estimated accuracy, is chosen as the final model to be used for future prediction or categorisation.

**Train and test**

Step 7

Evaluate final model

Step 8

Steps are repeated and iterated to improve the solution. Different parameters, algorithms may be iterated or the problem might even be redefined etc. N iterations normally produces n models.

Problems unsuitable for data mining

Problem has a complete, closed-form mathematical solution. If it can be easily solved with maths.

(eg. estimating the time it takes a dropped ball to hit the ground given the height)

Well understood, good analytical/rule based solution. No point if the results produced through this rule is very good.

(eg. Classifying eggs as “medium”, “large” or “extra-large” by weight)

Problems suitable for data mining

There is no good existing solution and the problem has these conditions:

Lots of data(in hand/collected)

Not well understood(No clear way to approach the problem)

Characterized as input to output relationship

Existing models have strong and possibly erroneous assumptions(thus not performing well)

How do you evaluate the results?

What evaluation method will you use to measure the model performance?

What level of accuracy would be considered successful?

How will you benchmark the performance of a developed solution?

What existing alternatives will you compare against?

What kind of data will be used to evaluate the various models?

What will the models be used for and how well do they support that purpose?

Classification or estimation problem?

Discrete or continuous outputs?

Discrete outputs - classification problem.

Continuous outputs - estimation problem.

Depending on the desired granularity and application of the output, there can still be borderline cases, depends on what you want to achieve.

eg. Predicting the Malaysian currency value can be an estimation problem, or re-cast as a classification, where all we want to know is whether the value goes up or down one month later.

What are the inputs and outputs?

Inputs - any data related to the problem that can help to determine the desired output values.

Issues to be aware of

Inputs should be causally related to outputs. The input must cause the output to happen, different to core relation. (Don't use non-causal/unintentionally biased data) Unintentionally biased data produces models that do not represent future data.

Inputs must contain enough information to be able to generate the desired output, or else it degrades the accuracy of the final model.

Data set must be representative of future examples presented to the model, or else it shouldn't be used. It has to be related.

How much data is enough data?

Depends on the problem complexity and the amount of noise.

This is however unknown in actual practice.

The behavior of a learning algorithm is well studied and assumes a set pattern.

Every learning algorithm has their own learning curve, the accuracy increases with the increasing training data size.

When it reaches its optimal performance, the accuracy will no longer improve.

Experimentation allows us to know how fast an algorithm's accuracy will increase initially and when it will reach its optimal.

Gradually reducing the training size and plot part of the learning curve with it.

Try different training sizes and if the trend is increasing, we expect the algorithm to perform better when there's more data.

If the trend is beginning to stabilize, it's an indication that it has reached its optimal performance, so more training data will not improve its performance.

This experiment must be carried with a validation method rather than a simple train/test split, to reduce statistical variations.

Question 1. Explain how are you going to decide whether a given problem is suitable for a data mining solution.

Question 2. Suppose the data provided is the last promotional mail-out records which consist of information about each of the 100000 customers (name, address, occupation, salary) and whether each individual customer responded to the mail (i.e., an attribute indicating “yes” or “no”). You are asked to produce a data mining solution, that is, a model describing the characteristics of customers who are likely, as well as unlikely, to respond to the promotional mail-out. The company could then use this model to target customers who are likely to respond to the next promotional mail-out for the same product.

Discuss the following issues:

• is this problem suitable for data mining solution?

• whether this is a classification or estimation problem.

• what are the inputs and output?

• what is the alternative to producing a model?

• how you will use the data for training a model and evaluating the model?

Question 3. Let say you are given a set of training data with 50% class “positive” and 50% class “negative”, and you have explored several models and selected the best model. You can now use the best model to do future prediction. Now, you are informed that the future data you are going to get is likely to have the following class distribution: 90% class “positive” and 10% class “negative”. Would you go ahead to use the best model to do prediction for all future data? Provide a reason for your answer. In the case that your answer is no, you shall also provide an alternative solution to do prediction for the future data.

Question 4. How does one decide whether to collect more data or not in a non-time series data mining task?

Since the ratio changed from 1:1 to 9:1, the training data and the test data do not represent each other, this will result in poor performance and high error rates.

The model chosen would then be a very bad choice as the output will not be reliable and related to the training data, a new model would have to be chosen.

Yes, there is a lot of raw data available, does not have a clear way to solve it. Input is related to output. It's a classification problem, discrete data. Input: customer attributes, output: repeat buyer.

Alternative? Randomly pick customers?

Split the data, run a train/test method, use one percentage of the data as training and the rest as test data.

What is non time series?

Using the graph provided, it depends on whether the model has reached its optimal performance, that is where we stop collecting more data.

Some issues to deal with while collecting data:

it's size, accuracy, authority, error, missing values, attributes, noise....

The outputs of a classification problem are categories.

The outputs of a estimation problem are values that might be used for future prediction.

Experimentation - Learning Curve

Algorithm A performs better than algorithm B

with small training data size.

However B is better than A in terms of accuracy with large training data size.

Thank you! =)

Questions?