Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


BigData@Globant Verizon

No description

Mariana Martinez

on 1 May 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of BigData@Globant Verizon

Clearly this volume of data cannot be processed or stored by a single computer How did all start? "Rather than spending money in high-end machines, Google prefers to invest its money in fault-tolerant software." Marissa Mayer, vice president of search products and user experience Ecosystem distributed configuration synchronization selecting a leader
5 out of 11 nodes may disconnect without incident can perform complex joins with minimal code
SQL like scripting; no need for programming skills
batch processing and transformations (ETLs) Hadoop Hive Hadoop Pig MapReduce requires Java skills, Pig presents a limited amount of keywords offering a non declarative or procedural language Let's look at what the Market says... When do we recommend using it and when not? Yes No Success cases @Globant Thank you! 22% annual growth for structured data 72% annual increase for unstructured data but... moving all this data through the network brings us to another problem! distribute and replicate the data [DFS]
bring the processing to the nodes [Map]
agregate the process result [Reduce] several software and infrastructure tools are ready to help with this! RDBMS vs NoSQL big volumes (tera and peta) processing
license cost
data locality
structure of the data to analyze
ACID vs BASE Distributed Memory Cache Key/Value databases Big Table databases What?

Business Statistic Reports
Large volumes of data aggregation


Unstructured data
For long term analysis: the more data you get, the better the analysis results Less considerable volumes

Complex data model

Considerable data change

Transactional data handling Distribution! Transferring these huge amounts of data is impossible And regardless of how fast we transfer the data
there is not much we can do to reduce the latency.
We don't yet know how to increase the speed of light As an example, even if we transmitted data at the speed of light (186,282 miles per second), during the time the data gets to a computer located just 30 feet away, a simple i7 intel computer could execute more than 5,000 instructions Hadoop - MapReduce Used to find the maximum expected improvement to an overall system when only part of the system is improved. It is often used in parallel computing to predict the theoretical maximum speedup using multiple processors. CAP Theorem Introducing Big Data analysis in a company involves:
taking data policy into account
technology and infrastructure preparation
talent incorporation
access to data Distributed FS GFS two processing phases: Map <key, values> and Reduce <key, value>
processing is moved to the nodes
production release last December Document Databases the keys multiple hadoop clusters
the biggest has 2500 cpu cores and 1 Petabyte of disk space
250 GB of data are loaded into the cluster every day
used to extract statistics, determine modules quality and fight spam with the aid of Hive 100 nodes Cluster to process millions of sessions daily for analytics
Amazon's product search indexes
using Java and Streaming API More than 100,000 CPUs in >40,000 computers running Hadoop and Pig
biggest cluster: 4500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM)
research for Ad Systems and Web Search to store and process tweets, log files
Pig heavily for both scheduled and ad-hoc jobs, due to its ability to accomplish a lot with few statements 120 Nehalem-based nodes, with 2x4 cores, 24GB RAM, 8x1TB storage using ext4 in a JBOD configuration on CentOS 5.5
520 Westmere-based nodes, with 2x4 cores, 24GB RAM, 6x2TB storage using ext4 in a JBOD configuration on CentOS 5.5
Apache's Hadoop and Pig distributions for discovering People You May Know and other fun facts Hadoop and HBase in several areas from social services to structured data storage and processing for internal use
30 nodes running HDFS, Hadoop and HBase in clusters ranging from 5 to 14 nodes on both production and development ETL style processing and statistics generation to running advanced algorithms for doing behavioral analysis and targeting
150 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB hard-disk 532 nodes cluster (8 * 532 cores, 5.3PB)
For search optimization and research Large scale image conversions
Used EC2 to run hadoop on a large virtual cluster 44 nodes
Dual quad-core Xeon L5520 (Nehalem) @ 2.27GHz, 16GB RAM, 4TB/node storage.
Used for charts calculation, log analysis, A/B testing
20 dual quad-core nodes, 32GB RAM , 5x1TB
Used for user profile analysis, statistical analysis,cookie level reporting tools. Blue Cloud
University Initiative to Address Internet-Scale Computing Challenges together with Google How would you integrate Apache Hadoop with your existing architecture? Real time Big Data Big Stream Processing Flume - Cloudera
Scribe - Facebook
Kafka - LinkedIn
Storm - Backtype Twitter McKinsey Global Institute projected that the United States needs 140,000 to 190,000 more workers with “deep analytical” expertise and 1.5 million more data-literate managers countless digital sensors worldwide in:
industrial equipment
electrical meters
shipping crates + = BIG DATA What if you look for a more Near Real Time Analysis? - single point of failure in NN
Development paradigm change for MapReduce implementations
not suitable for random access and transaccionability Easier and less expensive to scale the Cluster
High availability by data replication
Good resource usage by moving data processing to the nodes
Quick and accurate response from the Hadoop community
RDBMS have lots of restrictions: do we need transactionality for business reporting? + Provides fresh up-to-the-minute Real Time Bidding data delivered on the web through the Cloud
Allow automatic anomalies detection
Drill downs on Dashboards
Predictive analytics using R and Python Sentiment analysis on Social Networks to try to predict Market values behavior can take HDFS as storage (data HFiles or logs WAL) for high speed lookups
relies on HDFS chunks replication
Column oriented
Cells are addressed with a row key/column family/cell qualifier/timestamp tuple
splits the virtual table space into regions that can be spread out across many servers
The default size of individual files in a region is 256MB middle Job scheduler
support Pig, Hive, Java Processes, Scoop
started by events: time, new data
DAG of jobs and actions
oozie by Yahoo
Azkaban by LinkedIn Azkaban + - Query execution via MapReduce from Query Languague (HiveQL)
Data is organized in databases, tables, partitions and buckets
Hierarchy of Data types and implicit conversions + <STRING> to <DOUBLE>
Complex types Structs, Maps and Arrays
Needs traditional relational database for the metastore as it should be optimized for online transactions with random accesses and updates. A file system like HDFS is not suited Near Real Time response because of the latency introduced to the Hadoop jobs submission
Online transactions processing
HiveQL currently does not support updating and deleting rows in existing tables - + you can store pretty much anything in a column family without having to know what it will be in advance
allows you to essentially store one-to-many relationships in a single row
eliminates joins ( if your child entities are truly subordinate, they can be stored with their parent, eliminating all join operations)
good for sparsed and versioned data
scaling by partition horizontally
batch processing
suitable for small updates vs HDFS append only
random reads
doesn’t allow for querying on non-primary-key values, at least directly
not good for storing large amounts of binary data
functions best when the rows are (relatively) small
The closer to the region limit you make each row, the more overhead you are paying to host those rows
for big files, use HDFS for that
near real time applications data organized in keyspaces (schema), column families (tables) and super columns
no Master node; gossip protocol to exchange status nodes data
supports per family column compression
support MR jobs using cluster as input and outpus - + no secondary indexes except for column families
queries should match your structures in column (families)
Hive support is not mature
big files storage
bug in deletion No SPOF
CQL as query languague
stores only needed data, no unused space
supports Java, Thrift and MR jobs Virtual EMR
New 4x and 8x large instances for HPC requirements
elasticity on demand
security restrictions
Load Balancer
Replication in different coasts
DynamoDB storage
Supports MongoDB Json format database
data organized in collection of documents with fields as key-value pairs
Javascript queries
integrates MapReduce functions
all NoSQL advantages
provides file system storage for binary files - indexes consume considerable amount of RAM (B-Tree)
doesn't work with the network topology
writes are unsafe by default + Very similar to interacting to a database
simple for development (like serializing objects)
querying by indexes is very fast
no SPOF by replica-sets
very good documentation It is recommended to use RAID on Namenode
The Jobtracker sends the task to only one datanode where the replica is. It will take the node with less load. If all are really busy, it sends the data to another node (non-local).
Map-only jobs: Good for filtering/cleaning data. Recommendations
Full transcript