Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Frequent Itemset Mining
Transcript of Frequent Itemset Mining
What products were often purchased together ?
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
What is Data Mining ??
Data mining is also called knowledge discovery and data mining (KDD)
Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web, image.
Patterns must be: valid, novel, potentially useful, understandable
Induction or Predicting
Clustering or Segmentation
Data Mining Approaches
Based on the approach, the data available, and the study, we select a data mining method to apply
Methods to choose from include:
Similar time sequence patterns
Data Mining Methods
Uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data.
The algorithm terminates when no further successful extensions are found.
This approach suffers from redundancy in processing. Whole data set is to be scanned again and again for each set of candidate generation.
This approach is not efficient for lots of frequent patterns.
Repeated database scans costly
Low minimum support threshold
Issues in Apriori Approach
- Insurance :- link analysis for fraud
- Medical :- symptom analysis
Market Basket Analysis
Overcome the problems occurring in Apriori like avoiding candidate set generation in times of long frequent patterns plus avoiding repeated data scans.
Uses a FP-tree to encode the data set and then extract the frequent item sets from this tree.
Not efficient for long patterns.
Support is calculated once the entire data-set is added to FP-Tree.
Traversing the tree to find all possible frequent item sets, adds extra overhead because same item set is retrieved repeatedly.
Issues in FP-Growth
Extends the idea of the FP-tree to improve storage compression, and allows frequent-pattern mining without the generation of candidate item sets
Aim is to build a CATS tree as compact as possible
FELINE Algorithm with the CATS Tree
Solves the problems/ weaknesses of the FELINE algorithm
Items arranged according to some canonical order that is unaffected by the item frequency
Any insertions, deletions, and/or modifications of transactions have no effect on the ordering of items in the tree
Canonical-Order Tree (Can Tree)
Requires a large amount of computation for searching common items and merge able paths during the construction of CATS trees
Needs extra downward traversals during the mining process
Needs lots of swapping, merging, and splitting of tree nodes, because items in the trees are arranged according to a frequency-dependent ordering
Issues with the CATS Tree
The time required to solve the problem using any currently known algorithm increases very quickly as the size of the problem grows.
We examined that there is trade-off between exact solutions versus time complexity.
We can get exact solution in more rigorous analysis of data which is time consuming while we can get faster optimal solution with minimum support factor
Using hyper-graph, the graphical data representation was good but the number of comparisons were equivalent to Apriori approach due to repeated scanning of data.
For tree, we needed to scan the data only once and dynamic changes could be made easily
Requirement and investigation
Consider the following Transaction
Consider support : 3
Scan the data and form a matrix
Step 1 :
Count the number of ‘1’s in each row and re-arrange the rows in an ascending order based on that count. Also discard the rows having the count less than the support
Step 2 :
Now Sort the matrix after the pruning is done.
Scan row by row and make group of consecutive 1’s in column. And also find similar group of 1’s that has same parent trace and have same row as ending. We will also construct a tree with the scanning of rows as soon as we find a group.
Step 3 :
As shown two group are constructed which are group of column 3 and 5 are same. As there is no node except root node we will construct node(s) as necessary according to groups. Here both are similar groups so we will show no of count of similar group and show it in only one node. So the node will be named “FG” as consecutive 1’s are of row F and G. and count as “2” because there are two groups.
Now as there are two groups so as they both are same we will construct one node named “FGD” as the consecutive 1’s are of row F, G and D. Now to attach this node we will have to first find out if any existing node having same initials i.e. in our case “F” or “FG” then we will set this node as parent node to new node. Here one condition is that their parent trace according to matrix would be same. Means here both node’s group are starting from “F” so their parent is “ROOT”.
Now as groups of column 4 and 6 are same so we will construct a node “FGDE” and count as “2” and having parent trace as ROOT. Now we will find from tree if any existing node having parent trace “ROOT” and match the initials of name. We will select node having maximum initial matching node as parent to new node.
Traversal : Once the tree has been constructed, we can get the count of the frequent itemsets by traversing the tree in the following manner.
We can obtain the count of the frequent itemsets FG by adding the count of all the nodes having FG in their names. So we get FG as
FG:2 + FGD:2 + FGDE:2 => FG:6.
Frequent Item set Mining problem can be successfully implemented with our new approach based on tree.
Initially, we applied implicit pruning so the item sets with count less than support factor are removed.
Then we performed explicit pruning to remove the patterns after traversal.
Here, we observed that after explicit pruning we get optimal answers i.e. possible frequent item sets with minimum support.
To get the exact answer, which item set are frequently occurring and their count we adopt a traversal algorithm.
We implemented the procedures in JAVA language using DefaultMutableTreeNode API.
We extended our approach to synthetic data sets with thousands of transactions.
We observed that our approach can be implemented to as many transactions as much memory the computer provides to store the data for the tree.
Frequent item set mining is used to find underlying regularities in data. Scope of this work is widely applicable in different areas like
- Pattern Recognition
- Big Data Analysis and Web Intelligence
- Classification of web documents
- Association Rule Mining
- Incremental mining
- Basket data analysis, cross-marketing, catalogue design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis
- Viral Patel (100170107005)
- Harsh Shukla (100170107030)
- Dhyey Shah (100170107048)
- Sharan Raghu (100170107056)
FREQUENT ITEM SET
ISSUES IN THESE APPROACHES