Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

analysis of ordered datasets

No description
by

sujen shah

on 25 April 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of analysis of ordered datasets

Challenges
Efficient analysis of ordered data sets using Map-Reduce and Cassandra
Existing System
Efficient analysis of ordered data sets using MapReduce and Cassandra
As a retailer, you need to know how many visitors not only your entire store attracts, but how many visitors visit a specific product group or department in-stores in order to really discover, how effective a certain product group / category is in selling. Understanding what your conversion rate is per department or product group, you can achieve an understanding of what part your merchandising decisions have been successful.
We are working on a product based on indoor positioning to assist retail stores to track customers within their stores. This product can help them generate analytics like:
footfall
total time spent by customers within a store
also within a section in a store.

We are able to uniquely identify a customer and hence we can generate metrics like visit frequency and visit recency. Retail stores can identify paths traveled by a customer within the store.
Stores can use all this data to:
improve their marketing campaigns
improve their merchandising,
placement of products within a store etc.
Stores can manage staff requirement depending on trends of customer visits.
A store can use all this data to do relevant promotions as they know the customers visit patterns.
Just like we have GPS, imagine an indoor positioning system (IPS) that can track your position relative to the place you are currently in, and give you suggestions like - where would you get coffee in the mall you're currently in? Or better, how much time you need from your cabin to reach the meeting room on time, which is across your workplace?
Similarly the same information of tracking a person's movements can be so important in use-cases like retail-shops, recommender systems, high-security conferences and so on.
Efficiently analysing data on a large scale can be vital for data owners to gain useful business intelligence. But it becomes really difficult when the data received is not sorted. We would be working on data which is the output of wireless sensors.
The wireless sensors are used to perform triangulation in an indoor environment.
This data needs to be first sorted temporally. This increases analytical workload tremendously.
We here propose an solution for managing erratic data centrally, collected from disparate sources like hundreds of wireless sensors spread across cities as big as Mumbai.
A Indoor positioning systems are available already for a couple of years.
Many successful approaches utilize Wi-Fi radio signals as Wi-Fi networks are available on a large scale and in the meantime commercial solutions are offered by several companies.

Like Apple's parking lot app.
Euclid Analytics
IBM Cisco inside their offices.
1:
Mapping the floor plan as a database model for effectively getting required data. Moreover. The same design should be scalable for future uses for 3D space model.

Solution: We used Neo4j, an open source graph database. We realized that the floor plan was essentially a graph problem and mapping it onto a graph model makes it very easy for an abstract understanding of the indoor space. There are some repetitive, similar queries which will be fired. The expensive joins of MySQL makes the process slower, so NoSQL database is used.

2:
In Hadoop, the major bottleneck in providing efficient processing is the disk access at various stages of MapReduce. The input to the Map stage is usually a data file which after the Map stage produces another set of intermediate files to be acted upon by the Reduce stage. All this disk access increases the latency of the system. Finding a means to avoid all this disk access will provide better parallelization.

Solution:
We propose running batch jobs with Hadoop but storing the data in Cassandra. This guarantees low latency during data retrieval and computation. One potential problem to be handled is fault tolerance. Cassandra handles all these problems for end users.

Floor planning with Neo4j
Storm is a free and open source distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing.

Features:
Storm is simple and can be used with any programming language.
It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.
Storm integrates with the queuing and database technologies we are using.
Storm will handle the parallelization, partitioning, and retrying on failures when necessary.

Storm has many use cases: real-time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node.


Kestrel is a light-weight persistent message queue server written in Scala. It was originally developed by Twitter and is a port of Starling, which is written in Ruby.
Each server handles a set of reliable, ordered message queues, with no cross communication, resulting in a cluster of k-ordered ("loosely ordered") queues. Kestrel is fast, small, and reliable.

Features:
• Fan-out queues: one writer, many readers.
• Durable: Queues are stored in memory for speed, but logged into a journal on disk so that servers can be shut down or moved without losing any data.
Neo4j is a robust (fully ACID) transactional property graph database. Due to its graph data model, Neo4j is highly agile and blazing fast. For connected data operations, Neo4j runs a thousand times faster than relational databases.
Development tools
Software Requirements
For development purposes, a LINUX based OS is required. The IDE (Integrated Development Environment) used will be Eclipse. Also Git, a code repository and versioning system, provided by bitbucket.com will be used to maintain and develop the system code concurrently. The whole application will be written in Java.

Other Software Development Kits (SDK) and Application Programming Interfaces (API) used are:
1. Hector API for Cassandra.
2. Neo4j Spring Data Java API.

Software required to be installed on the machines are:
1. Cassandra
2. Hadoop
3. Storm
4. Git – distributed revision control and source code management (SCM).
5. Gradle – a build automation tool.

Hardware Requirements
The following are the hardware requirement of the project
1. Machines with LAN connectivity.
2. Machines should have minimum 4GB of RAM and Intel core processors or above.
3. Router to connect to the internet.
4. Back-up power supply in case of power failures.

Product Features
Here are just a few of the real-world metrics we plan to provide:

• CAPTURE RATE - How well do your window displays pull shoppers into the store? Find out your Capture Rate and actively improve it.

• REPEAT VISITOR RATIO - Are most of your visitors regular customers or first-timers? Optimize for the segment that drives the most sales.

• WALKBYS - How much business is walking right by the door? Launch your sales on the days with the highest Walk -by traffic.

• VISIT DURATION - How long do your shoppers spend in the store? Are they engaging with staff or waiting in line?

• VISIT FREQUENCY & RECENCY - How often do shoppers come in the door each month? Are your daily deals generating loyal customers?

• ENGAGEMENT & BOUNCE RATES - Are shoppers staying long enough to make a purchase? Or do they "bounce" in just a few minutes? Measure the percentage that stay for a period of time you specify.

Implemented System
CENTRAL SERVER
Data from Sensors
Kestrel Queue
Storm
HDFS
Floor Planning
Cassandra
The Apache Cassandra database helps when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.
Features:
Decentralized
: Every node in the cluster has the same role. There is no single point of failure. Data is distributed across the cluster (so each node contains different data), but there is no master as every node can service any request.
Supports replication
: Replication strategies are configurable.
Scalability Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to applications.
Fault-tolerant:
Data is automatically replicated to multiple nodes for fault-tolerance. Replication across multiple data centers is supported. Failed nodes can be replaced with no downtime.
Tunable consistency
: Writes and reads offer a tunable level of consistency, all the way from "writes never fail" to "block for all replicas to be readable", with the quorum level in the middle.
MapReduce support:
Cassandra has Hadoop integration, with MapReduce support.

Cassandra is in use at Netflix, eBay, Cisco, and more companies that have large, active data sets. The largest known Cassandra cluster has over 300 TB of data in over 400 machines


Project guide:
Prof. Kiran Bhowmick

By:
Nisarg Modi
Sujen Shah

Problems
We are watching your movements :)
Storm is a free and open source distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing.

Features:
Storm is simple and can be used with any programming language.
It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.
Storm integrates with the queuing and database technologies we are using.
Storm will handle the parallelization, partitioning, and retrying on failures when necessary.

Storm has many use cases: real-time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node.


Kestrel is a light-weight persistent message queue server written in Scala. It was originally developed by Twitter and is a port of Starling, which is written in Ruby.
Each server handles a set of reliable, ordered message queues, with no cross communication, resulting in a cluster of k-ordered ("loosely ordered") queues. Kestrel is fast, small, and reliable.

Features:
• Fan-out queues: one writer, many readers.
• Durable: Queues are stored in memory for speed, but logged into a journal on disk so that servers can be shut down or moved without losing any data.
Implemented System
CENTRAL SERVER
Data from Sensors
Kestrel Queue
Storm
HDFS
Floor Planning
Cassandra
The Apache Cassandra database helps when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.
Features:
Decentralized
: Every node in the cluster has the same role. There is no single point of failure. Data is distributed across the cluster (so each node contains different data), but there is no master as every node can service any request.
Supports replication
: Replication strategies are configurable.
Scalability Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to applications.
Fault-tolerant:
Data is automatically replicated to multiple nodes for fault-tolerance. Replication across multiple data centers is supported. Failed nodes can be replaced with no downtime.
Tunable consistency
: Writes and reads offer a tunable level of consistency, all the way from "writes never fail" to "block for all replicas to be readable", with the quorum level in the middle.
MapReduce support:
Cassandra has Hadoop integration, with MapReduce support.

Cassandra is in use at Netflix, eBay, Cisco, and more companies that have large, active data sets. The largest known Cassandra cluster has over 300 TB of data in over 400 machines

<< KestrelThrift -- ISpout >>
Reports for Analysis
Future Scope

This project can be extended to other use cases. For example, just like we have implemented it in retail shops, it can be implemented in theme parks, parking lots, and many other indoor locations.

Secondly, just like we have used WiFi sensors for collecting data, we can also use Bluetooth technology after the Bluetooth 4.0 protocol is released.
Also, by installing more number of sensors in a particular block, we can generate a 3D model of the whole floor plan.

According to a survey, around 40% of users forget to turn off their WiFi when they leave their houses. This number will only increase as more shops, restaurants, etc., start providing free WiFi. So, this will help in giving us more accurate analytics.
Full transcript