Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Frequent Itemset Mining

No description
by

Sharan Raghu

on 27 April 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Frequent Itemset Mining

Solution
MOTIVATION
What products were often purchased together ?
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?

What is Data Mining ??
Data mining is also called knowledge discovery and data mining (KDD)
Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web, image.
Patterns must be: valid, novel, potentially useful, understandable
Induction or Predicting
Clustering or Segmentation
Associations
Sequence
Deviation identification
Data Mining Approaches
Based on the approach, the data available, and the study, we select a data mining method to apply
Methods to choose from include:
Classification
Conceptual clustering
Associations
Sequential patterns
Similar time sequence patterns
Visualization
Data Mining Methods
Uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data.
The algorithm terminates when no further successful extensions are found.
This approach suffers from redundancy in processing. Whole data set is to be scanned again and again for each set of candidate generation.
This approach is not efficient for lots of frequent patterns.
Repeated database scans costly
Low minimum support threshold
Issues in Apriori Approach
Applications
-Retail outlets
-Telecommunications
- Banks
- Insurance :- link analysis for fraud
- Medical :- symptom analysis

Market Basket Analysis
Affinity Analysis
Eg:
Amazon ,Wal-Mart

Apriori Approach
Overcome the problems occurring in Apriori like avoiding candidate set generation in times of long frequent patterns plus avoiding repeated data scans.
FP-Growth
Uses a FP-tree to encode the data set and then extract the frequent item sets from this tree.
Not efficient for long patterns.
Support is calculated once the entire data-set is added to FP-Tree.
Traversing the tree to find all possible frequent item sets, adds extra overhead because same item set is retrieved repeatedly.
Issues in FP-Growth
Extends the idea of the FP-tree to improve storage compression, and allows frequent-pattern mining without the generation of candidate item sets
Aim is to build a CATS tree as compact as possible
FELINE Algorithm with the CATS Tree
Solves the problems/ weaknesses of the FELINE algorithm
Items arranged according to some canonical order that is unaffected by the item frequency
Any insertions, deletions, and/or modifications of transactions have no effect on the ordering of items in the tree
Canonical-Order Tree (Can Tree)
Requires a large amount of computation for searching common items and merge able paths during the construction of CATS trees
Needs extra downward traversals during the mining process
Needs lots of swapping, merging, and splitting of tree nodes, because items in the trees are arranged according to a frequency-dependent ordering
Issues with the CATS Tree
The time required to solve the problem using any currently known algorithm increases very quickly as the size of the problem grows.
We examined that there is trade-off between exact solutions versus time complexity.
We can get exact solution in more rigorous analysis of data which is time consuming while we can get faster optimal solution with minimum support factor
Requirement
Using hyper-graph, the graphical data representation was good but the number of comparisons were equivalent to Apriori approach due to repeated scanning of data.
For tree, we needed to scan the data only once and dynamic changes could be made easily
Present Investigation
Requirement and investigation
Consider the following Transaction
Consider support : 3
Scan the data and form a matrix
Step 1 :
After pruning
Count the number of ‘1’s in each row and re-arrange the rows in an ascending order based on that count. Also discard the rows having the count less than the support
Step 2 :
Now Sort the matrix after the pruning is done.
Scan row by row and make group of consecutive 1’s in column. And also find similar group of 1’s that has same parent trace and have same row as ending. We will also construct a tree with the scanning of rows as soon as we find a group.

Step 3 :
As shown two group are constructed which are group of column 3 and 5 are same. As there is no node except root node we will construct node(s) as necessary according to groups. Here both are similar groups so we will show no of count of similar group and show it in only one node. So the node will be named “FG” as consecutive 1’s are of row F and G. and count as “2” because there are two groups.

Now as there are two groups so as they both are same we will construct one node named “FGD” as the consecutive 1’s are of row F, G and D. Now to attach this node we will have to first find out if any existing node having same initials i.e. in our case “F” or “FG” then we will set this node as parent node to new node. Here one condition is that their parent trace according to matrix would be same. Means here both node’s group are starting from “F” so their parent is “ROOT”.

Now as groups of column 4 and 6 are same so we will construct a node “FGDE” and count as “2” and having parent trace as ROOT. Now we will find from tree if any existing node having parent trace “ROOT” and match the initials of name. We will select node having maximum initial matching node as parent to new node.
Traversal : Once the tree has been constructed, we can get the count of the frequent itemsets by traversing the tree in the following manner.
We can obtain the count of the frequent itemsets FG by adding the count of all the nodes having FG in their names. So we get FG as
FG:2 + FGD:2 + FGDE:2 => FG:6.

Final Answer
CONCLUSION
Frequent Item set Mining problem can be successfully implemented with our new approach based on tree.
Initially, we applied implicit pruning so the item sets with count less than support factor are removed.
Then we performed explicit pruning to remove the patterns after traversal.
Here, we observed that after explicit pruning we get optimal answers i.e. possible frequent item sets with minimum support.
To get the exact answer, which item set are frequently occurring and their count we adopt a traversal algorithm.
We implemented the procedures in JAVA language using DefaultMutableTreeNode API.
We extended our approach to synthetic data sets with thousands of transactions.
We observed that our approach can be implemented to as many transactions as much memory the computer provides to store the data for the tree.
Frequent item set mining is used to find underlying regularities in data. Scope of this work is widely applicable in different areas like

- Pattern Recognition
- Big Data Analysis and Web Intelligence
- Classification of web documents
- Association Rule Mining
- Incremental mining
- Basket data analysis, cross-marketing, catalogue design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis
Scope
Thank you
- Viral Patel (100170107005)
- Harsh Shukla (100170107030)
- Dhyey Shah (100170107048)
- Sharan Raghu (100170107056)
FREQUENT ITEM SET
APPROACHES
ISSUES IN THESE APPROACHES
Full transcript