Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Big Data : Definition and basics concepts

No description

wissem inoubli

on 20 February 2017

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Big Data : Definition and basics concepts

Big Data : Definition and basic concepts
Inoubli Wissem

2.7 Zetabytes of data exist in the digital universe today.

Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data.

More than 5 billion people are calling, texting, tweeting and browsing the worldwide using mobile phones.

Google was processing 20,000 terabytes of data a day.

YouTube users upload 48 hours of new video every minute.

100 terabytes of data are daily uploaded to Facebook.

30 Billion pieces of content are shared on Facebook every month.

40,000 search queries are made in google every second.
Big Data : motivations
Big data : Motivations
What is Big Data ?
characteristics of Big Data
Machine/System Logs

Retail transaction logs

Vehicle GPS traces

Social Data (Facebook, Twitter,etc...)
Big Data is similar to "small Data"

It has other factors that one can consider as the
Big Data

Production data every moment

Data in structured and unstructured

Data in real time
Production of data every moment

Facebook ingests 500 terabytes of new data every day

Clickstreams and ad impressions capture user behavior at millions of events per second.

infrastructure and sensors generate massive log data in real-time

on-line gaming systems support millions of concurrent users, each producing multiple inputs per second.

Clickstreams capture user behavior at millions of events per second.

Infrastructure and sensors generate massive log data in real-time

Online gaming systems support millions of concurrent users, each producing multiple inputs per second.
Big Data isn't just numbers, dates, and strings. Big Data is also geospatial data, 3D data, audio and video and unstructured text (log files, data driven from social media, etc.).

RDBMS (Oracle, MySQL, Postgre, etc.)
Column-store (Vertica, MonetDB)
Key-values store (MongoDB, Bigtable, Cassandra,etc.)
Graph DB (Neo4j, etc.)
Big Data problems
Data storage, search, etc
Fault tolerance
Parallel/Distributed Computing
Parallel systems

Distributed Systems
A global clock
Shared memory
Distributed algorithms
Other solutions
The MapReduce Model
GFS file system
The Hadoop project
A programming Model based on tow function Map & reduce created by Google research team in 2002.

Distributed File System
Based on master/slave architect
Large Data Files
Replication of data
Fault Tolerance

Apache Hadoop
Created in 2005 to support distribution search for the Nutch engine project.

The project was funded by Yahoo.

Yahoo gave the project to Apache Software Foundation in 2006.

Its implement many frameworks and services :
MapReduce Implementation

Who uses Hadoop?
HDFS service
Implement GFS file system

Based on a name node , a data node and a secondary name node

Namenode stores meta data
Data node stores physical data
Secondary name node to backup the name node
Dynamic and variable bloc size
duplicate Data
MapReduce implementation

Implementation of the MapReduce Model and based on tow services
Applications manger
Resources manger

is the system that is used to process data in the Hadoop cluster

Consists of two phases: Map, and then Reduce

Each Map task operates on a discrete portion of the overall dataset
Typically one HDFS data block

After all Maps are complete, the MapReduce system distributes
the intermediate data to nodes which perform the Reduce phase

MapReduce Example
Apache Spark is a open source project, originally developed at the university of California in AMLPlab.
It is considered as a
Framework but it uses the
to improve the response time compared with other implementations such as the above in Hadoop.

This MapReduce implementation is represented
as a Spark core. In this project, we have others
services such as :

Spark Stream
Spark GraphX
Spark SQL and Data Frame
Spark architecture
Who uses Hadoop?
More than 100,000 CPUs in >40,000 computers running Hadoop.
>60% of Hadoop Jobs within Yahoo are Apache Pig jobs.
use Apache Hadoop to store and process tweets, log files, and many other types of data generated across Twitter.

The New York Times
use EC2 to run hadoop on a large virtual cluster
use Apache Hadoop to store copies of internal log and dimension data sources.
Currently we have 2 major clusters( 1100 machines and 300 machines)
Full transcript