**Solution**

**MOTIVATION**

**What products were often purchased together ?**

What are the subsequent purchases after buying a PC?

What kinds of DNA are sensitive to this new drug?

Can we automatically classify web documents?

What are the subsequent purchases after buying a PC?

What kinds of DNA are sensitive to this new drug?

Can we automatically classify web documents?

What is Data Mining ??

Data mining is also called knowledge discovery and data mining (KDD)

Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web, image.

Patterns must be: valid, novel, potentially useful, understandable

Induction or Predicting

Clustering or Segmentation

Associations

Sequence

Deviation identification

Data Mining Approaches

Based on the approach, the data available, and the study, we select a data mining method to apply

Methods to choose from include:

Classification

Conceptual clustering

Associations

Sequential patterns

Similar time sequence patterns

Visualization

Data Mining Methods

Uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data.

The algorithm terminates when no further successful extensions are found.

This approach suffers from redundancy in processing. Whole data set is to be scanned again and again for each set of candidate generation.

This approach is not efficient for lots of frequent patterns.

Repeated database scans costly

Low minimum support threshold

Issues in Apriori Approach

**Applications**

**-Retail outlets**

-Telecommunications

- Banks

- Insurance :- link analysis for fraud

- Medical :- symptom analysis

-Telecommunications

- Banks

- Insurance :- link analysis for fraud

- Medical :- symptom analysis

**Market Basket Analysis**

**Affinity Analysis**

Eg:

Amazon ,Wal-Mart

Eg:

Amazon ,Wal-Mart

Apriori Approach

Overcome the problems occurring in Apriori like avoiding candidate set generation in times of long frequent patterns plus avoiding repeated data scans.

FP-Growth

Uses a FP-tree to encode the data set and then extract the frequent item sets from this tree.

Not efficient for long patterns.

Support is calculated once the entire data-set is added to FP-Tree.

Traversing the tree to find all possible frequent item sets, adds extra overhead because same item set is retrieved repeatedly.

Issues in FP-Growth

Extends the idea of the FP-tree to improve storage compression, and allows frequent-pattern mining without the generation of candidate item sets

Aim is to build a CATS tree as compact as possible

FELINE Algorithm with the CATS Tree

Solves the problems/ weaknesses of the FELINE algorithm

Items arranged according to some canonical order that is unaffected by the item frequency

Any insertions, deletions, and/or modifications of transactions have no effect on the ordering of items in the tree

Canonical-Order Tree (Can Tree)

Requires a large amount of computation for searching common items and merge able paths during the construction of CATS trees

Needs extra downward traversals during the mining process

Needs lots of swapping, merging, and splitting of tree nodes, because items in the trees are arranged according to a frequency-dependent ordering

Issues with the CATS Tree

The time required to solve the problem using any currently known algorithm increases very quickly as the size of the problem grows.

We examined that there is trade-off between exact solutions versus time complexity.

We can get exact solution in more rigorous analysis of data which is time consuming while we can get faster optimal solution with minimum support factor

Requirement

Using hyper-graph, the graphical data representation was good but the number of comparisons were equivalent to Apriori approach due to repeated scanning of data.

For tree, we needed to scan the data only once and dynamic changes could be made easily

Present Investigation

**Requirement and investigation**

Consider the following Transaction

Consider support : 3

Scan the data and form a matrix

Step 1 :

After pruning

Count the number of ‘1’s in each row and re-arrange the rows in an ascending order based on that count. Also discard the rows having the count less than the support

Step 2 :

Now Sort the matrix after the pruning is done.

Scan row by row and make group of consecutive 1’s in column. And also find similar group of 1’s that has same parent trace and have same row as ending. We will also construct a tree with the scanning of rows as soon as we find a group.

Step 3 :

As shown two group are constructed which are group of column 3 and 5 are same. As there is no node except root node we will construct node(s) as necessary according to groups. Here both are similar groups so we will show no of count of similar group and show it in only one node. So the node will be named “FG” as consecutive 1’s are of row F and G. and count as “2” because there are two groups.

Now as there are two groups so as they both are same we will construct one node named “FGD” as the consecutive 1’s are of row F, G and D. Now to attach this node we will have to first find out if any existing node having same initials i.e. in our case “F” or “FG” then we will set this node as parent node to new node. Here one condition is that their parent trace according to matrix would be same. Means here both node’s group are starting from “F” so their parent is “ROOT”.

Now as groups of column 4 and 6 are same so we will construct a node “FGDE” and count as “2” and having parent trace as ROOT. Now we will find from tree if any existing node having parent trace “ROOT” and match the initials of name. We will select node having maximum initial matching node as parent to new node.

Traversal : Once the tree has been constructed, we can get the count of the frequent itemsets by traversing the tree in the following manner.

We can obtain the count of the frequent itemsets FG by adding the count of all the nodes having FG in their names. So we get FG as

FG:2 + FGD:2 + FGDE:2 => FG:6.

Final Answer

CONCLUSION

Frequent Item set Mining problem can be successfully implemented with our new approach based on tree.

Initially, we applied implicit pruning so the item sets with count less than support factor are removed.

Then we performed explicit pruning to remove the patterns after traversal.

Here, we observed that after explicit pruning we get optimal answers i.e. possible frequent item sets with minimum support.

To get the exact answer, which item set are frequently occurring and their count we adopt a traversal algorithm.

We implemented the procedures in JAVA language using DefaultMutableTreeNode API.

We extended our approach to synthetic data sets with thousands of transactions.

We observed that our approach can be implemented to as many transactions as much memory the computer provides to store the data for the tree.

Frequent item set mining is used to find underlying regularities in data. Scope of this work is widely applicable in different areas like

- Pattern Recognition

- Big Data Analysis and Web Intelligence

- Classification of web documents

- Association Rule Mining

- Incremental mining

- Basket data analysis, cross-marketing, catalogue design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis

**Scope**

Thank you

- Viral Patel (100170107005)

- Harsh Shukla (100170107030)

- Dhyey Shah (100170107048)

- Sharan Raghu (100170107056)

FREQUENT ITEM SET

APPROACHES

ISSUES IN THESE APPROACHES