Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Big Data : Definition and basics concepts
Transcript of Big Data : Definition and basics concepts
2.7 Zetabytes of data exist in the digital universe today.
Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data.
More than 5 billion people are calling, texting, tweeting and browsing the worldwide using mobile phones.
Google was processing 20,000 terabytes of data a day.
YouTube users upload 48 hours of new video every minute.
100 terabytes of data are daily uploaded to Facebook.
30 Billion pieces of content are shared on Facebook every month.
40,000 search queries are made in google every second.
Big Data : motivations
Big data : Motivations
What is Big Data ?
characteristics of Big Data
Retail transaction logs
Vehicle GPS traces
Social Data (Facebook, Twitter,etc...)
Big Data is similar to "small Data"
It has other factors that one can consider as the
Production data every moment
Data in structured and unstructured
Data in real time
Production of data every moment
Facebook ingests 500 terabytes of new data every day
Clickstreams and ad impressions capture user behavior at millions of events per second.
infrastructure and sensors generate massive log data in real-time
on-line gaming systems support millions of concurrent users, each producing multiple inputs per second.
Clickstreams capture user behavior at millions of events per second.
Infrastructure and sensors generate massive log data in real-time
Online gaming systems support millions of concurrent users, each producing multiple inputs per second.
Big Data isn't just numbers, dates, and strings. Big Data is also geospatial data, 3D data, audio and video and unstructured text (log files, data driven from social media, etc.).
RDBMS (Oracle, MySQL, Postgre, etc.)
Column-store (Vertica, MonetDB)
Key-values store (MongoDB, Bigtable, Cassandra,etc.)
Graph DB (Neo4j, etc.)
Big Data problems
Data storage, search, etc
A global clock
The MapReduce Model
GFS file system
The Hadoop project
A programming Model based on tow function Map & reduce created by Google research team in 2002.
Distributed File System
Based on master/slave architect
Large Data Files
Replication of data
Created in 2005 to support distribution search for the Nutch engine project.
The project was funded by Yahoo.
Yahoo gave the project to Apache Software Foundation in 2006.
Its implement many frameworks and services :
Who uses Hadoop?
Implement GFS file system
Based on a name node , a data node and a secondary name node
Namenode stores meta data
Data node stores physical data
Secondary name node to backup the name node
Dynamic and variable bloc size
Implementation of the MapReduce Model and based on tow services
is the system that is used to process data in the Hadoop cluster
Consists of two phases: Map, and then Reduce
Each Map task operates on a discrete portion of the overall dataset
Typically one HDFS data block
After all Maps are complete, the MapReduce system distributes
the intermediate data to nodes which perform the Reduce phase
Apache Spark is a open source project, originally developed at the university of California in AMLPlab.
It is considered as a
Framework but it uses the
to improve the response time compared with other implementations such as the above in Hadoop.
This MapReduce implementation is represented
as a Spark core. In this project, we have others
services such as :
Spark SQL and Data Frame
Who uses Hadoop?
More than 100,000 CPUs in >40,000 computers running Hadoop.
>60% of Hadoop Jobs within Yahoo are Apache Pig jobs.
use Apache Hadoop to store and process tweets, log files, and many other types of data generated across Twitter.
The New York Times
use EC2 to run hadoop on a large virtual cluster
use Apache Hadoop to store copies of internal log and dimension data sources.
Currently we have 2 major clusters( 1100 machines and 300 machines)