Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Copy of Scaling up for high dimensional data and high speed data str

No description

Er Hardeep

on 16 March 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Copy of Scaling up for high dimensional data and high speed data str

Scaling up for high dimensional data and high speed data streams
High dimensional data
A data stream is defined as a sequence of unbounded, real time data items with a very high data rate that can only read once
Data stream clustering is deal with many problems, due to the memory usages and the processing speed.
A lots of stream data are high-dimensional in natural and high-dimensional data are inherently more complex in clustering.
There is an inherent temporal component to the stream mining process. This is because the data may evolve over time. This behavior of data streams is referred to as temporal locality.
Therefore, a straightforward adaptation of one-pass mining algorithms may not be an effective solution to the task.
Stream mining algorithms need to be carefully designed with a clear focus on the evolution of the underlying data.
Design Challenge
With increasing volume of the data, it is no longer possible to process the data efficiently by using multiple passes.
Rather, one can process a data item at most once. This leads to constraints on the implementation of the underlying algorithms.
Therefore, stream mining algorithms typically need to be designed so that the algorithms work with one pass of the data.
General Processing Model
Data streaming Mining
In single pass

Pruning Strategies
We can generate the FIXTree via:

For BIFI-mining, it’s unnecessary to generate and count all nodes. Instead, we try to generate as fewer nodes of the FIXTree as possible, so long as BIFI can be identified.

Search Strategies
Traversing the search space by --
Brute-force: 2|I|
Clever use of the Basic Property of itemsets:
A subset B => supp(A) >= supp(B)
Note 1: All subsets of a known frequent itemset are also frequent.
Note 2: All supersets of a known infrequent itemset are also infrequent.

Solution Approach
Discover all Bi frequent itemsets in a given transaction database
Traversing the search space -- subset lattice of I -- and count support for itemset in DB

Terminology and Notations
set of items: I = { i1, i2, …, in}
set of transactions: DB = {T1,T2,…,Tm},Ti I
(k-)itemset: N  I ( |N| = k )
support of itemset N: supp(N)
frequent itemset (fi)
bi frequent itemset (bifi)
set of all frequent (k-)itemsets: FI, FIk
set of all bifi

Sliding Window
In order to analysis recent data, sliding window protocol is used. Analysis over the new arrived data is done using summarized versions of previous data. This techniques are used in comprehensive data mining systems MAIDS [2].
Old items are removed and replaced with new data streams.
Two types of windows
Count-based windows
Using count-based windows the latest N items are stored.
Time-based windows
Using this window we can store only those items which have been generated or have arrived in the last T units of time [1].

Full transcript