Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Big Data

No description

supreet singh

on 3 December 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Big Data

Big Data is not just data
Map Reduce

What is Big Data
“Big Data” is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it…

Whos generating Big Data?
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data) .
Mobile devices
(tracking all objects all the time).
Sensor technology and networks
(measuring all kinds of data)

The speed at which the data is flowing.
Stream Computing
- Perform analytics on volume and variety of data which is still in motion.
Late decisions > missing opportunities

Data volumes are becoming unmanageable
Data Volume
44x increase from 2009 2020
From 0.8 zettabytes to 35zb
Data volume is increasing exponentially
Various formats, types, and structures
Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc…
Static data vs. streaming data
A single application can be generating/collecting many types of data
Harnessing Big Data
Online Transaction Processing (DBMSs)
Online Analytical Processing (Data Warehousing)
: Real-Time Analytics Processing (Big Data Architecture & technology)

RDBMS, SQL are no more efficient for this data.

Use of more computational power in terms of multi core processors and high memory leads to high cost of system
Traditional Machine Learning
Big Data is Unstructured.
It is different from usual corporate data that can be stored in a database.
It includes emails, Web logs, instant messages, Presentations, documents, images, audio files, Flash content.

New architecture, algorithms, techniques are needed

Parallel Processing/ Map Reduce
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.
Simple View
Map Reduce
Input: key-value pairs
Output: Key-value pairs
map (k1,v1) -> list(k2,v2)
reduce (k2,list(v2)) -> list(v2)

Google first used Map reduce with GFS
Hadoop using Map reduce with HDFS
Facebook, Twitter, Linkedin, Netflix.
Mining Massive Datasets
Map Reduce: A Simplified data processing on Large Cluster by Jeffrey Dean and Sanjay Ghemawat,
http:// research.google.com/en//archive/mapreduce-osdi04.pdf
Video lectures by MOOC - Machine Learning Map Reduce and Data Parallelism
Apache Hadoop - Petabytes and Terawatts by LinkedInTalks

Fraud Dectection
Weather/Storm Predictions
Healthcare Industry - to find out Drug taking side effects
Multi-target prediction
Traditional methods in machine learning and statistics provide data-driven models for predicting one-dimensional targets, such as binary outputs in classification and real-valued outputs in regression.
We need multi target for recommender systems, tag prediction of images, videos and music
Suitable Algorithm, Infrastructure and data.
Scope in improvement of storage infrastructure and processing of data.
To train machine learning classifier in stream way
What’s driving Big Data?
Supreet Singh

Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

Full transcript