Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Do you really want to delete this prezi?
Neither you, nor the coeditors you shared it with will be able to recover it again.
Make your likes visible on Facebook?
Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.
Transcript of Big Data
Big Data is not just data
What is Big Data
“Big Data” is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it…
Whos generating Big Data?
Social media and networks
(all of us are generating data)
(collecting all sorts of data) .
(tracking all objects all the time).
Sensor technology and networks
(measuring all kinds of data)
The speed at which the data is flowing.
- Perform analytics on volume and variety of data which is still in motion.
Late decisions > missing opportunities
Data volumes are becoming unmanageable
44x increase from 2009 2020
From 0.8 zettabytes to 35zb
Data volume is increasing exponentially
Various formats, types, and structures
Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc…
Static data vs. streaming data
A single application can be generating/collecting many types of data
Harnessing Big Data
Online Transaction Processing (DBMSs)
Online Analytical Processing (Data Warehousing)
: Real-Time Analytics Processing (Big Data Architecture & technology)
RDBMS, SQL are no more efficient for this data.
Use of more computational power in terms of multi core processors and high memory leads to high cost of system
Traditional Machine Learning
Big Data is Unstructured.
It is different from usual corporate data that can be stored in a database.
It includes emails, Web logs, instant messages, Presentations, documents, images, audio files, Flash content.
New architecture, algorithms, techniques are needed
Parallel Processing/ Map Reduce
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.
Input: key-value pairs
Output: Key-value pairs
map (k1,v1) -> list(k2,v2)
reduce (k2,list(v2)) -> list(v2)
Google first used Map reduce with GFS
Hadoop using Map reduce with HDFS
Facebook, Twitter, Linkedin, Netflix.
Mining Massive Datasets
Map Reduce: A Simplified data processing on Large Cluster by Jeffrey Dean and Sanjay Ghemawat,
Video lectures by MOOC - Machine Learning Map Reduce and Data Parallelism
Apache Hadoop - Petabytes and Terawatts by LinkedInTalks
Healthcare Industry - to find out Drug taking side effects
Traditional methods in machine learning and statistics provide data-driven models for predicting one-dimensional targets, such as binary outputs in classification and real-valued outputs in regression.
We need multi target for recommender systems, tag prediction of images, videos and music
Suitable Algorithm, Infrastructure and data.
Scope in improvement of storage infrastructure and processing of data.
To train machine learning classifier in stream way
What’s driving Big Data?
Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.