### Present Remotely

Send the link below via email or IM

CopyPresent to your audience

Start remote presentation- Invited audience members
**will follow you**as you navigate and present - People invited to a presentation
**do not need a Prezi account** - This link expires
**10 minutes**after you close the presentation - A maximum of
**30 users**can follow your presentation - Learn more about this feature in our knowledge base article

# Decision Tree

No description

by

Tweet## Ranadeep Guha

on 12 November 2012#### Transcript of Decision Tree

A Presentation on Decision Trees By

Ranadeep Guha,

Niravra Kar &

Abik Halder. 2

A computer program is said to learn from experience E with respect to a class of tasks T and performance measure P if its performances at tasks T as measured by P, improves with experience E. Machine Learning 3 Task T : Playing checkers

Performance P : Percent of games won by opponents

Experience E : Gained by playing against itself An Example: Checkers learning Problem 4 Concept learning can be formulated as a problem of searching through a predefined space of potential hypotheses for the hypothesis that best fits the training examples.

Much of learning involves acquiring general concepts from specific training examples. Concept Learning Nodes are pruned iteratively, always choosing the node whose removal most increases the decision tree accuracy over the validation set.

Pruning is done until further pruning results in reducing the accuracy of decision tree over the validation set. Reduced Error Pruning (contd.) There are many methods by which a decision tree can be pruned:

Reduced Error Pruning

Rule Post Pruning Pruning the decision tree Use a separate set of examples, called Validation set distinct from the training examples, to evaluate the utility of post-pruning nodes from the tree.

Use all the available data for training, but apply a statistical test to estimate whether expanding (or pruning) a particular node is likely to produce an improvement beyond the training set. How to determine the correct final size of tree

Approach that stops growing the tree earlier, before it reached the point where it perfectly classifies the training data

Approach that first over-fits the data, then post prunes the tree. Ways to avoid over-fitting Precision = fff/ /(f11 +f01 )

Recall = f11 /(f11 + f10 )

Example- let there be 8 batsmen present …. Now the prediction is made and said that there are 7 batsmen. It is found that out of these 7 ,5 are batsmen and 2 were bowlers . So precision is 5/7 and recall is 5/8… ------ACCURACY

------ERROR RATE

------PRECISION

------RECALL A few terms: Splitting Based on Continuous Attributes In Boolean Algebra, a formula is in DNF if it is a disjunction of clauses, where a clause is a conjunction of literals. Also known as Sum of Products.

Example: (A ^ B ^ C) V (B ^ C) Disjunctive Normal Form Decision Tree based Methods

Rule-based Methods

Memory based reasoning

Neural Networks

Naïve Bayes and Bayesian Belief Networks

Support Vector Machines Classification Techniques Step 4. Sort the pruned rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances. Step 3 :contd. Step 3: Prune (generalize) each rule by removing any preconditions that result in improving its estimated accuracy. Step 2: Convert the learned tree into an equivalent set of rules by creating one rule for each path from the root node to a leaf node. Step1: Infer the decision tree from the training set, growing the tree until the training data is fit as well as possible and allowing over- fitting to occur. In rule post-pruning, one rule is generated for each leaf node in the tree.

Each attribute test along the path from the root to the leaf becomes a rule antecedent (pre-condition) and the classification at the leaf node becomes the rule consequent (post-condition).

Let us consider an example: Post Pruning (contd.)

Prune each rule by removing any preconditions and check its estimated accuracy.

Sort the pruned rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances. Rule Post Pruning

The steps of Post Pruning are:

Infer the decision tree from the training set, and allow over-fitting to occur.

Convert the learned tree into an equivalent set of rules by creating one rule for each path. Rule Post Pruning Reduced Error Pruning Performance When training examples contain random errors/ noise.

When the training data consists of small number of examples. Causes of Over-fitting A hypothesis over-fits the training examples if some other hypothesis that fits the training example less well, actually performs better over the entire distribution of instances.

Given a hypothesis space H, a hypothesis h € H is said to over-fit the training data if there exists some alternative hypothesis h’ € H such that h has smaller error than h’ over the training ex, but h’ has a smaller error than h over the entire distribution of instances. Over-fitting Hence f11 and f00 are two cases which have been accurately predicted and f10 and f01 are the two cases which are predicted with an error….

ACCURACY=(f11 +f00))/(f11 + f10+ f01 + f00)

ERROR RATE=(f01 + f10)/(f11 + f10+ f01 + f00)

Clearly accuracy + error rate=1 Meanings of the terms Suppose let us assume that class 1 is +ve, class 0 is –ve

So f11 means that a +ve output – predicted as +ve

f10 means that a +ve output- predicted as -ve

f01 means that a –ve output- predicted as +ve

f00 means that a –ve output- predicted as -ve

SUPPOSE THERE ARE 2 CLASSES:

CLASS 1

CLASS 0 An Example: DT correctness evaluation Prefer the simplest hypothesis that fits the data.

Justification:

there are fewer short(hence simple) hypotheses than long ones, so it is less likely that one will find a short hypothesis that coincidentally fits the training data.

So, we might believe a 5- node tree is less likely to be a statistical coincidence and prefer this hypothesis over the 500-node hypothesis. Occam’s razor Maintains only a single current hypothesis as it searches the space of decision trees.

No backtracking at any step.

Uses all training examples at each step in the search. Features of ID3 Entropy of rain= ¾* lg (4/3) + ¼* log 4=0.810

Checking IG for Wind:

Entropy(weak)= 3/3* lg(3/3) =0

Entropy(strong)= lg 1 =0

Hence, IG(S sunny, wind) = 0.810 – 0 – 0 = 0.810

Checking IG of Humidity:

Entropy(High)= 1* lg 1 =0

Entropy(Normal)=1/3* lg 3 + 2/3* lg (3/2)=0.917

Hence, IG(S sunny, wind) =0.810 – 0 – 0.917*(3/4)=0.122 Further Calculations Zero because there is no randomness. All the examples that have Humidity= high have Play tennis= No and those having Humidity=low have Play Tennis=yes.

IG (S sunny, Humidity)= 0.324- 0 = 0.324

For Wind:

Sv/S * Entropy(weak)=¾*(2/3*lg 3/2) + 1/3*lg 3=0.687.

Sv/S* Entropy(strong)=1/4* 0=0

IG(S sunny, Wind) = 0.324 – 0.687 = -0.363

Clearly, humidity has a higher IG. Calculations: Since sunny and rain have both positive and negative examples, they have fair degrees of randomness and hence need to be classified further.

For sunny :

As computed earlier, Entropy of sunny= 0.324

Now, we need to find the corresponding humidity and wind for those training examples who have outlook=sunny. Further calculations: Sv/S for each of them are as follows:

sunny- 4/10 (means 4 out of 10 examples have sunny as their outlook)

rain - 4/10

overcast- 2/10

Hence, Information gain of outlook

= 0.970–( 4/10 *0.324*2 + 2/10*0)

= 0.711 Calculations: Entropy:

p log (1/p ) + p log ( 1/p )

Information Gain:

Gain(S,A)= Entropy(S) - ∑ |Sv|/|S| Entropy(Sv)

v->Values(A)

Where

S is the collection, A is a particular attribute.

Values(A): set of all positive values of an attribute A.

Sv: subset of S for which attribute A has value v. Quick recap of formulae Day outlook humidity wind play tennis

D1 sunny high weak no

D2 sunny high strong no

D3 overcast high weak yes

D4 rain high weak yes

D5 rain normal weak yes

D6 rain normal strong no

D7 overcast normal strong yes

D8 sunny high weak no

D9 sunny normal weak yes

D10 rain normal weak yes Play Tennis example: revisited Gain(S,A) is the information provided about the target function value, given the value of some other attribute A.

Example: S is a collection described by attributes including Wind, which can have the values Weak or Strong. Assume S has 14 examples.

Then S=[9+, 5-]

S weak= [6+, 2-]

S strong= [3+, 3-] Information Gain: contd. Greedy algorithm that grows the tree top-down.

Begins with the question "which attribute should be tested at the root of the tree?”

A statistical property called information gain is used. Algorithm ID3 Variations of a core algorithm that employs a top-down, greedy search through the space of possible decision trees.

Examples are Hunt’s Algorithm, CART, ID3, C4.5, SLIQ,SPRINT, Mars. Decision Tree Learning Algorithms

Entropy

GINI Index

Misclassification Error Measures of Node Impurity DT Classification Task Decision trees represent a disjunction(OR) of conjunctions(AND) of constraints on the attribute values of instances,

Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and the tree itself to a disjunction of these conjunctions.

Hence, DT represents a DNF. Decision Tree: contd. In Boolean Algebra, a formula is in CNF if it is a conjunction of clauses, where a clause is a disjunction of literals. Also known as Product of Sum.

Example: (A V B V C) ^ (B V C) Conjunctive Normal Form

Goal is to create a model that predicts the value of a target variable based on several input variables. Decision Tree Nominal / Categorical

Ordinal

Continuous Attribute Types Simple to understand and interpret.

Requires little data preparation.

Able to handle both numerical and categorical data.

Possible to validate a model using statistical tests.

Robust

Performs well with large data in a short time. Advantages of DT: Consider each of the decision nodes in the tree to be candidates for pruning.

Pruning a node consists of removing the sub-tree rooted at that node, making it a leaf node, and assigning it the most common classification of the training examples affiliated with that node. Reduced Error Pruning

ACTUAL

CLASS PREDICTED CLASS Roughly, ID3 search strategy-

Selects in favor of shorter trees over longer ones.

Selects trees that place attributes with highest information gain closest to the root.

ID3 employs preference bias. Inductive bias in decision tree learning Inductive bias is the set of assumptions that together with the

training data, deductively justify the classifications assigned by the

learner to future instances. Inductive bias in decision tree learning

Day Outlook Humidity wind Play tennis

D1 sunny high weak no

D2 sunny high strong no

D8 sunny high weak no

D9 sunny normal weak yes

For Humidity:

Sv/S* Entropy(high)= ¾*0=0

Sv/S* Entropy(low)=1/4*0=0 Further calculations Sv/S for High = 5/10

for normal = 5/10

Hence IG( Humidity)

=0.970 – (5/10*0.970 + 5/10*0.7195)

=0.125

Similarly for Wind, the IG is 0.0910.

Hence, IG(Outlook)=0.7110

IG(Humidity)=0.125

IG(Wind)= 0.0910

Comparing the IG s of the 3 attributes, we find Outlook has got the highest IG(0.7110) Calculations: For Humidity:

The training set has 6 positive and 4 negative examples. Hence entropy = 0.970

Humidity can take 2 values- High [3+,2-]

Normal [4+, 1-]

Entropy of High = 3/5* lg (5/3) + 2/5* lg (5/2)

= 0.970

Entropy of Normal = 1/5* lg 5 + 4/5* lg(5/4)

= 0.7195 Calculations: For Outlook:

The training set has 6 positive and 4 negative examples. Hence entropy

=4/10* lg(10/4) + 6/10* lg(10/6)= 0.970

Outlook can have 3 values- sunny [ 1+, 3- ]

rain [ 3+, 1- ]

overcast.[ 1+ ]

Entropy of sunny= ¼ * lg 4 + ¾* lg (4/3) =0.324

Entropy of rain = ¾ * lg (4/3) + ¼ * lg 4 =0.324

Entropy of overcast= 2/4* lg (2/2) =0 Calculations:

Gain(S, wind) = Entropy(S) – (8/14) Entropy(S weak) – (6/14) Entropy(S strong)

= 0.94 – (8/14) 0.811

= 0.048 Information Gain: contd. Expected reduction in entropy caused by partitioning the example according to a particular attribute.

Gain of an attribute A relative to a collection of example S is defined as-

Gain(S,A)= Entropy(S) - ∑ |Sv|/|S| Entropy(Sv)

v->Values(A)

where

Values(A): set of all positive values of an attribute A.

Sv: subset of S for which attribute A has value v. Information Gain Day outlook humidity wind play tennis

D1 sunny high weak no

D2 sunny high strong no

D3 overcast high weak yes

D4 rain high weak yes

D5 rain normal weak yes

D6 rain normal strong no

D7 overcast normal strong yes

D8 sunny high weak no

D9 sunny normal weak yes

D10 rain normal weak yes A set of training examples

When a node p is split into k partitions (children), the quality of split is computed as,

where, ni = number of records at child i,

n = number of records at node p. Splitting Based on GINI In a more general sense,

Entropy= 0, if all members belong to the same class.

= 1, if collection contains equal no. of positive

and negative examples.

= lies between 0 & 1, if there are unequal no. of

positive & negative examples. More on Entropy Let S is a collection of 14 examples, including 9 positive & 5 negative examples, denoted by [9+, 5-].

Then entropy[9+, 5-]

= -9/14 log(9/14) – 5/14 log(5/14)

= 0.94 An example of Entropy 2- way split

Multi- way split Attribute splitting 7 CNF = Conjunctive Normal Form

DNF = Disjunctive Normal Form A quick recap Each internal node tests an attribute.

Each branch corresponds to an attribute value.

Each leaf node assigns a classification. Decision tree representation Any hypothesis found to approximate the target function well over a sufficiently large set of training examples, will also approximate the target function well over other unobserved examples.

Any hypothesis h is said to be consistent with a set of training examples D of target concept c iff h(x)=c(x) for each training example <x , c(x)> Inductive learning hypothesis Positive Examples: those training examples that satisfy the target function, ie. For which c(x)=1 or TRUE.

Negative Examples: those training examples that do not satisfy the target function, ie. For which c(x)=0 or FALSE. Types of Training Examples The problem of learning an optimal decision tree is known to be NP-complete.

Prone to over-fitting.

There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems.

For data including categorical variables with different number of levels, information gain in DT are biased in favor of those attributes with more levels. Limitations of DT Solid line : accuracy over training data

Broken line : accuracy over independent set of test examples, not included in training examples Over-fitting This decision tree corresponds to the following expression:

(Outlook = Sunny ^ Humidity = Normal) V (Outlook = Overcast) V (Outlook = Rain ^ Wind = Weak)

As we can see this is actually a disjunction of conjunctions ( DNF ). Play Tennis : contd. Now for Rain[ 3+,1-],

Day Outlook Humidity Wind Play tennis

D4 Rain high weak yes

D5 Rain normal weak yes

D6 Rain normal strong no

D10 Rain normal weak yes Further Calculations There are 3 attributes- Outlook, humidity and Wind

We need to choose one of them as the root of the tree. We make this choice based on the information gain(IG) of each of the attributes.

The one with the highest IG gets to be the root.

The calculations are shown in the following slides. Application of ID3 on Play Tennis Algorithm ID3 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Gini = 1 – (0)^2 – (1)^2 = 1-0-1= 0

P(C1) = 1/6 P(C2) = 5/6

Gini = 1 – (1/6)^2 – (5/6)^2 = 0.278

P(C1) = 2/6 P(C2) = 4/6

Gini = 1 – (2/6)^2 – (4/6)^2 =0.444 Examples for computing GINI Two-way split

(find best partition of values) Multi-way split For each distinct value, gather counts for each class in the dataset

Use the count matrix to make decisions Categorical Attributes: Computing Gini Index (NOTE: p( j | t) is the relative frequency of class j at node t).

Maximum (1 - 1/nc) when records are equally distributed among all classes,

implying least interesting information

Minimum (0.0) when all records belong to one class,

implying most interesting information GINI Index for a given node t : GINI Index yes Rain [3+,1-] Overcast

[2+] Sunny [1+,3-] Hence outlook is chosen as the root of the decision tree.

The partially formed decision tree is as follows: Partially formed tree Result of Pruning Normal[1+] High[3+] Overcast[2+] Rain[3+,1-] Sunny[1+,3-] It characterizes the impurity of an arbitrary collection of examples. It is a measure of randomness.

Entropy(S)= -p log p - p log p

Where , S is a collection containing positive & negative examples of some target concept.

P is the proportion of positive ex in S.

P is the proportion of negative ex in S. Entropy OR Luxury Sports Family CarType Splitting Based on Nominal Attributes OR Large Medium Small Size Splitting Based on Ordinal Attributes {Large} Size {Small, Medium} Multi-way split: Use as many partitions as distinct values.

Binary split: Divides values into two subsets.

Need to find optimal partitioning. {Small} Size {Medium,

Large} Overcast YES NO NO YES Normal High YES Rain Sunny Wind Humidity Outlook Final Decision Tree GINI (Children)

= 7/12 * 0.408+

5/12 * 0.320

= 0.371 GINI(N1)

= 1 – (5/7)2 – (2/7)2

= 0.408

GINI (N2)

= 1 – (1/5)2 – (4/5)2

= 0.320 Node N2 Node N1 No Yes Splits into two partitions

Effect of Weighing partitions:

Larger and Purer Partitions are sought for. Binary Attributes: Computing GINI Index i

Multi-way split: Use as many partitions as distinct values.

Binary split: Divides values into two subsets.

Need to find optimal partitioning. CarType {Family,

Luxury} {Sports} {Family} {Sports, Luxury} CarType Example of a Decision Tree + + + - - - B? Outlook Outlook Humidity No Yes Yes Thank you! Just another

thing... Questions?

Full transcriptRanadeep Guha,

Niravra Kar &

Abik Halder. 2

A computer program is said to learn from experience E with respect to a class of tasks T and performance measure P if its performances at tasks T as measured by P, improves with experience E. Machine Learning 3 Task T : Playing checkers

Performance P : Percent of games won by opponents

Experience E : Gained by playing against itself An Example: Checkers learning Problem 4 Concept learning can be formulated as a problem of searching through a predefined space of potential hypotheses for the hypothesis that best fits the training examples.

Much of learning involves acquiring general concepts from specific training examples. Concept Learning Nodes are pruned iteratively, always choosing the node whose removal most increases the decision tree accuracy over the validation set.

Pruning is done until further pruning results in reducing the accuracy of decision tree over the validation set. Reduced Error Pruning (contd.) There are many methods by which a decision tree can be pruned:

Reduced Error Pruning

Rule Post Pruning Pruning the decision tree Use a separate set of examples, called Validation set distinct from the training examples, to evaluate the utility of post-pruning nodes from the tree.

Use all the available data for training, but apply a statistical test to estimate whether expanding (or pruning) a particular node is likely to produce an improvement beyond the training set. How to determine the correct final size of tree

Approach that stops growing the tree earlier, before it reached the point where it perfectly classifies the training data

Approach that first over-fits the data, then post prunes the tree. Ways to avoid over-fitting Precision = fff/ /(f11 +f01 )

Recall = f11 /(f11 + f10 )

Example- let there be 8 batsmen present …. Now the prediction is made and said that there are 7 batsmen. It is found that out of these 7 ,5 are batsmen and 2 were bowlers . So precision is 5/7 and recall is 5/8… ------ACCURACY

------ERROR RATE

------PRECISION

------RECALL A few terms: Splitting Based on Continuous Attributes In Boolean Algebra, a formula is in DNF if it is a disjunction of clauses, where a clause is a conjunction of literals. Also known as Sum of Products.

Example: (A ^ B ^ C) V (B ^ C) Disjunctive Normal Form Decision Tree based Methods

Rule-based Methods

Memory based reasoning

Neural Networks

Naïve Bayes and Bayesian Belief Networks

Support Vector Machines Classification Techniques Step 4. Sort the pruned rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances. Step 3 :contd. Step 3: Prune (generalize) each rule by removing any preconditions that result in improving its estimated accuracy. Step 2: Convert the learned tree into an equivalent set of rules by creating one rule for each path from the root node to a leaf node. Step1: Infer the decision tree from the training set, growing the tree until the training data is fit as well as possible and allowing over- fitting to occur. In rule post-pruning, one rule is generated for each leaf node in the tree.

Each attribute test along the path from the root to the leaf becomes a rule antecedent (pre-condition) and the classification at the leaf node becomes the rule consequent (post-condition).

Let us consider an example: Post Pruning (contd.)

Prune each rule by removing any preconditions and check its estimated accuracy.

Sort the pruned rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances. Rule Post Pruning

The steps of Post Pruning are:

Infer the decision tree from the training set, and allow over-fitting to occur.

Convert the learned tree into an equivalent set of rules by creating one rule for each path. Rule Post Pruning Reduced Error Pruning Performance When training examples contain random errors/ noise.

When the training data consists of small number of examples. Causes of Over-fitting A hypothesis over-fits the training examples if some other hypothesis that fits the training example less well, actually performs better over the entire distribution of instances.

Given a hypothesis space H, a hypothesis h € H is said to over-fit the training data if there exists some alternative hypothesis h’ € H such that h has smaller error than h’ over the training ex, but h’ has a smaller error than h over the entire distribution of instances. Over-fitting Hence f11 and f00 are two cases which have been accurately predicted and f10 and f01 are the two cases which are predicted with an error….

ACCURACY=(f11 +f00))/(f11 + f10+ f01 + f00)

ERROR RATE=(f01 + f10)/(f11 + f10+ f01 + f00)

Clearly accuracy + error rate=1 Meanings of the terms Suppose let us assume that class 1 is +ve, class 0 is –ve

So f11 means that a +ve output – predicted as +ve

f10 means that a +ve output- predicted as -ve

f01 means that a –ve output- predicted as +ve

f00 means that a –ve output- predicted as -ve

SUPPOSE THERE ARE 2 CLASSES:

CLASS 1

CLASS 0 An Example: DT correctness evaluation Prefer the simplest hypothesis that fits the data.

Justification:

there are fewer short(hence simple) hypotheses than long ones, so it is less likely that one will find a short hypothesis that coincidentally fits the training data.

So, we might believe a 5- node tree is less likely to be a statistical coincidence and prefer this hypothesis over the 500-node hypothesis. Occam’s razor Maintains only a single current hypothesis as it searches the space of decision trees.

No backtracking at any step.

Uses all training examples at each step in the search. Features of ID3 Entropy of rain= ¾* lg (4/3) + ¼* log 4=0.810

Checking IG for Wind:

Entropy(weak)= 3/3* lg(3/3) =0

Entropy(strong)= lg 1 =0

Hence, IG(S sunny, wind) = 0.810 – 0 – 0 = 0.810

Checking IG of Humidity:

Entropy(High)= 1* lg 1 =0

Entropy(Normal)=1/3* lg 3 + 2/3* lg (3/2)=0.917

Hence, IG(S sunny, wind) =0.810 – 0 – 0.917*(3/4)=0.122 Further Calculations Zero because there is no randomness. All the examples that have Humidity= high have Play tennis= No and those having Humidity=low have Play Tennis=yes.

IG (S sunny, Humidity)= 0.324- 0 = 0.324

For Wind:

Sv/S * Entropy(weak)=¾*(2/3*lg 3/2) + 1/3*lg 3=0.687.

Sv/S* Entropy(strong)=1/4* 0=0

IG(S sunny, Wind) = 0.324 – 0.687 = -0.363

Clearly, humidity has a higher IG. Calculations: Since sunny and rain have both positive and negative examples, they have fair degrees of randomness and hence need to be classified further.

For sunny :

As computed earlier, Entropy of sunny= 0.324

Now, we need to find the corresponding humidity and wind for those training examples who have outlook=sunny. Further calculations: Sv/S for each of them are as follows:

sunny- 4/10 (means 4 out of 10 examples have sunny as their outlook)

rain - 4/10

overcast- 2/10

Hence, Information gain of outlook

= 0.970–( 4/10 *0.324*2 + 2/10*0)

= 0.711 Calculations: Entropy:

p log (1/p ) + p log ( 1/p )

Information Gain:

Gain(S,A)= Entropy(S) - ∑ |Sv|/|S| Entropy(Sv)

v->Values(A)

Where

S is the collection, A is a particular attribute.

Values(A): set of all positive values of an attribute A.

Sv: subset of S for which attribute A has value v. Quick recap of formulae Day outlook humidity wind play tennis

D1 sunny high weak no

D2 sunny high strong no

D3 overcast high weak yes

D4 rain high weak yes

D5 rain normal weak yes

D6 rain normal strong no

D7 overcast normal strong yes

D8 sunny high weak no

D9 sunny normal weak yes

D10 rain normal weak yes Play Tennis example: revisited Gain(S,A) is the information provided about the target function value, given the value of some other attribute A.

Example: S is a collection described by attributes including Wind, which can have the values Weak or Strong. Assume S has 14 examples.

Then S=[9+, 5-]

S weak= [6+, 2-]

S strong= [3+, 3-] Information Gain: contd. Greedy algorithm that grows the tree top-down.

Begins with the question "which attribute should be tested at the root of the tree?”

A statistical property called information gain is used. Algorithm ID3 Variations of a core algorithm that employs a top-down, greedy search through the space of possible decision trees.

Examples are Hunt’s Algorithm, CART, ID3, C4.5, SLIQ,SPRINT, Mars. Decision Tree Learning Algorithms

Entropy

GINI Index

Misclassification Error Measures of Node Impurity DT Classification Task Decision trees represent a disjunction(OR) of conjunctions(AND) of constraints on the attribute values of instances,

Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and the tree itself to a disjunction of these conjunctions.

Hence, DT represents a DNF. Decision Tree: contd. In Boolean Algebra, a formula is in CNF if it is a conjunction of clauses, where a clause is a disjunction of literals. Also known as Product of Sum.

Example: (A V B V C) ^ (B V C) Conjunctive Normal Form

Goal is to create a model that predicts the value of a target variable based on several input variables. Decision Tree Nominal / Categorical

Ordinal

Continuous Attribute Types Simple to understand and interpret.

Requires little data preparation.

Able to handle both numerical and categorical data.

Possible to validate a model using statistical tests.

Robust

Performs well with large data in a short time. Advantages of DT: Consider each of the decision nodes in the tree to be candidates for pruning.

Pruning a node consists of removing the sub-tree rooted at that node, making it a leaf node, and assigning it the most common classification of the training examples affiliated with that node. Reduced Error Pruning

ACTUAL

CLASS PREDICTED CLASS Roughly, ID3 search strategy-

Selects in favor of shorter trees over longer ones.

Selects trees that place attributes with highest information gain closest to the root.

ID3 employs preference bias. Inductive bias in decision tree learning Inductive bias is the set of assumptions that together with the

training data, deductively justify the classifications assigned by the

learner to future instances. Inductive bias in decision tree learning

Day Outlook Humidity wind Play tennis

D1 sunny high weak no

D2 sunny high strong no

D8 sunny high weak no

D9 sunny normal weak yes

For Humidity:

Sv/S* Entropy(high)= ¾*0=0

Sv/S* Entropy(low)=1/4*0=0 Further calculations Sv/S for High = 5/10

for normal = 5/10

Hence IG( Humidity)

=0.970 – (5/10*0.970 + 5/10*0.7195)

=0.125

Similarly for Wind, the IG is 0.0910.

Hence, IG(Outlook)=0.7110

IG(Humidity)=0.125

IG(Wind)= 0.0910

Comparing the IG s of the 3 attributes, we find Outlook has got the highest IG(0.7110) Calculations: For Humidity:

The training set has 6 positive and 4 negative examples. Hence entropy = 0.970

Humidity can take 2 values- High [3+,2-]

Normal [4+, 1-]

Entropy of High = 3/5* lg (5/3) + 2/5* lg (5/2)

= 0.970

Entropy of Normal = 1/5* lg 5 + 4/5* lg(5/4)

= 0.7195 Calculations: For Outlook:

The training set has 6 positive and 4 negative examples. Hence entropy

=4/10* lg(10/4) + 6/10* lg(10/6)= 0.970

Outlook can have 3 values- sunny [ 1+, 3- ]

rain [ 3+, 1- ]

overcast.[ 1+ ]

Entropy of sunny= ¼ * lg 4 + ¾* lg (4/3) =0.324

Entropy of rain = ¾ * lg (4/3) + ¼ * lg 4 =0.324

Entropy of overcast= 2/4* lg (2/2) =0 Calculations:

Gain(S, wind) = Entropy(S) – (8/14) Entropy(S weak) – (6/14) Entropy(S strong)

= 0.94 – (8/14) 0.811

= 0.048 Information Gain: contd. Expected reduction in entropy caused by partitioning the example according to a particular attribute.

Gain of an attribute A relative to a collection of example S is defined as-

Gain(S,A)= Entropy(S) - ∑ |Sv|/|S| Entropy(Sv)

v->Values(A)

where

Values(A): set of all positive values of an attribute A.

Sv: subset of S for which attribute A has value v. Information Gain Day outlook humidity wind play tennis

D1 sunny high weak no

D2 sunny high strong no

D3 overcast high weak yes

D4 rain high weak yes

D5 rain normal weak yes

D6 rain normal strong no

D7 overcast normal strong yes

D8 sunny high weak no

D9 sunny normal weak yes

D10 rain normal weak yes A set of training examples

When a node p is split into k partitions (children), the quality of split is computed as,

where, ni = number of records at child i,

n = number of records at node p. Splitting Based on GINI In a more general sense,

Entropy= 0, if all members belong to the same class.

= 1, if collection contains equal no. of positive

and negative examples.

= lies between 0 & 1, if there are unequal no. of

positive & negative examples. More on Entropy Let S is a collection of 14 examples, including 9 positive & 5 negative examples, denoted by [9+, 5-].

Then entropy[9+, 5-]

= -9/14 log(9/14) – 5/14 log(5/14)

= 0.94 An example of Entropy 2- way split

Multi- way split Attribute splitting 7 CNF = Conjunctive Normal Form

DNF = Disjunctive Normal Form A quick recap Each internal node tests an attribute.

Each branch corresponds to an attribute value.

Each leaf node assigns a classification. Decision tree representation Any hypothesis found to approximate the target function well over a sufficiently large set of training examples, will also approximate the target function well over other unobserved examples.

Any hypothesis h is said to be consistent with a set of training examples D of target concept c iff h(x)=c(x) for each training example <x , c(x)> Inductive learning hypothesis Positive Examples: those training examples that satisfy the target function, ie. For which c(x)=1 or TRUE.

Negative Examples: those training examples that do not satisfy the target function, ie. For which c(x)=0 or FALSE. Types of Training Examples The problem of learning an optimal decision tree is known to be NP-complete.

Prone to over-fitting.

There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems.

For data including categorical variables with different number of levels, information gain in DT are biased in favor of those attributes with more levels. Limitations of DT Solid line : accuracy over training data

Broken line : accuracy over independent set of test examples, not included in training examples Over-fitting This decision tree corresponds to the following expression:

(Outlook = Sunny ^ Humidity = Normal) V (Outlook = Overcast) V (Outlook = Rain ^ Wind = Weak)

As we can see this is actually a disjunction of conjunctions ( DNF ). Play Tennis : contd. Now for Rain[ 3+,1-],

Day Outlook Humidity Wind Play tennis

D4 Rain high weak yes

D5 Rain normal weak yes

D6 Rain normal strong no

D10 Rain normal weak yes Further Calculations There are 3 attributes- Outlook, humidity and Wind

We need to choose one of them as the root of the tree. We make this choice based on the information gain(IG) of each of the attributes.

The one with the highest IG gets to be the root.

The calculations are shown in the following slides. Application of ID3 on Play Tennis Algorithm ID3 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Gini = 1 – (0)^2 – (1)^2 = 1-0-1= 0

P(C1) = 1/6 P(C2) = 5/6

Gini = 1 – (1/6)^2 – (5/6)^2 = 0.278

P(C1) = 2/6 P(C2) = 4/6

Gini = 1 – (2/6)^2 – (4/6)^2 =0.444 Examples for computing GINI Two-way split

(find best partition of values) Multi-way split For each distinct value, gather counts for each class in the dataset

Use the count matrix to make decisions Categorical Attributes: Computing Gini Index (NOTE: p( j | t) is the relative frequency of class j at node t).

Maximum (1 - 1/nc) when records are equally distributed among all classes,

implying least interesting information

Minimum (0.0) when all records belong to one class,

implying most interesting information GINI Index for a given node t : GINI Index yes Rain [3+,1-] Overcast

[2+] Sunny [1+,3-] Hence outlook is chosen as the root of the decision tree.

The partially formed decision tree is as follows: Partially formed tree Result of Pruning Normal[1+] High[3+] Overcast[2+] Rain[3+,1-] Sunny[1+,3-] It characterizes the impurity of an arbitrary collection of examples. It is a measure of randomness.

Entropy(S)= -p log p - p log p

Where , S is a collection containing positive & negative examples of some target concept.

P is the proportion of positive ex in S.

P is the proportion of negative ex in S. Entropy OR Luxury Sports Family CarType Splitting Based on Nominal Attributes OR Large Medium Small Size Splitting Based on Ordinal Attributes {Large} Size {Small, Medium} Multi-way split: Use as many partitions as distinct values.

Binary split: Divides values into two subsets.

Need to find optimal partitioning. {Small} Size {Medium,

Large} Overcast YES NO NO YES Normal High YES Rain Sunny Wind Humidity Outlook Final Decision Tree GINI (Children)

= 7/12 * 0.408+

5/12 * 0.320

= 0.371 GINI(N1)

= 1 – (5/7)2 – (2/7)2

= 0.408

GINI (N2)

= 1 – (1/5)2 – (4/5)2

= 0.320 Node N2 Node N1 No Yes Splits into two partitions

Effect of Weighing partitions:

Larger and Purer Partitions are sought for. Binary Attributes: Computing GINI Index i

Multi-way split: Use as many partitions as distinct values.

Binary split: Divides values into two subsets.

Need to find optimal partitioning. CarType {Family,

Luxury} {Sports} {Family} {Sports, Luxury} CarType Example of a Decision Tree + + + - - - B? Outlook Outlook Humidity No Yes Yes Thank you! Just another

thing... Questions?