Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


BigData@Globant (cambios de pb) Roadshow WC EC

No description

Pablo Brenner

on 24 March 2012

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of BigData@Globant (cambios de pb) Roadshow WC EC

Low cost Infrastructure to scale
Distributed data volumes are growing more and more every day
Single machine processing power is not enough
Avoid data transfer between nodes find out what's your product position in
customize customer experience
know your customer better
generate information to make better decisions
innovate by being closer to the market evolution By 2009, nearly all sectors in the US economy had at least an average of 200 terabytes of stored data (twice the size of US retailer data warehouse in 1999) per company with more than 1,000 employees. (McKinsey Study 2011) Youtube claims to upload 24 hours of video every minute introduction huge torrents of data flow
through the Internet the use of smartphones, social media and other devices like laptops and PCs is growing at an accelerated pace companies tend to interact with clients
through the Internet multimedia content is increasing the stored and transferred data For example... Each second of high-definition video generates
more than 2,000 times as many bytes as required
to store a single text page What's going on? imagine the value of the information hiding beneath the sheer volume of data, for the public and private sectors alike! Clearly this volume of data cannot be processed or stored by a single computer the ability to store, aggregate, combine the data in a distributed way and perform analysis then becomes vital What are the benefits of Big Data? Which is the OpenSource option? DFS and MapReduce Architecture HDFS Facts open-source software for reliable, scalable, distributed computing moves computation to the data nodes good for processing large volumes of data in a distributed parallel way reliable data storage Who were the first ones there? "Rather than spending money in high-end machines, Google prefers to invest its money in fault-tolerant software." Marissa Mayer, vice president of search products and user experience What's the trend in the Market? Ecosystem distributed configuration synchronization selecting a leader
5 out of 11 nodes may disconnect without incident Can perform complex joins with minimal code
SQL like scripting Hadoop Hive Hadoop Pig MapReduce requires Java skills, Pig presents a limited amount of keywords offering a non declarative or procedural language Let's look at what the Market says... When do we recommend using it and when not? Yes No Advantages Abstraction on top of Hadoop
Easier ramp up
Reporting tool Success cases @Globant Thank you! 22% annual growth for structured data 72% annual increase for unstructured data but... moving all this data through the network brings us to another problem! distribute and replicate the data [DFS]
bring the processing to the nodes [Map]
agregate the process result [Reduce] Examples several software and infrastructure tools are ready to help with this! RDBMS vs NoSQL big volumes (tera and peta) processing
license cost
data locality
structure of the data to analyze
ACID vs BASE Distributed Memory Cache Key/Value databases Big Table databases What?

Business Statistic Reports
Large volumes of data aggregation


Unstructured data
For long term analysis: the more data you get, the better the analysis results Less considerable volumes

Complex data model

Considerable data change

Transactional data handling Easier and less expensive to scale the Cluster

High availability by data replication

Good resource usage by moving data processing to the nodes

Quick and accurate response from the Hadoop community

Tools like Pig and Hive to perform an abstraction on top of MapReduce

Saves one step over Datawarehousing (data aggregation)

RDBMS have lots of restrictions: do we need transactionality for business reporting? But you should keep in mind... Development API being changed

The complete ecosystem needs to be upgraded

Development paradigm change for MapReduce implementations

Bear in mind the development of a Partitioner if you want a collection of sorted keys

Name Node high availability still in development Distribution! The benefits... Transferring these huge amounts of data is impossible And regardless of how fast we transfer the data
there is not much we can do to reduce the latency.
We don't yet know how to increase the speed of light As an example, even if we transmitted data at the speed of light (186,282 miles per second), during the time the data gets to a computer located just 30 feet away, a simple i7 intel computer could execute more than 5,000 instructions and more than 2 billion instructions
while the data travels from SFO to NY Hadoop - MapReduce Used to find the maximum expected improvement to an overall system when only part of the system is improved. It is often used in parallel computing to predict the theoretical maximum speedup using multiple processors. CAP Theorem Introducing Big Data analysis in a company involves: taking data policy into account, technology and infrastructure preparation, talent incorporation and access to data. Distributed FS GFS two processing phases: Map <key, values> and Reduce <key, value>
processing is moved to the nodes Document Databases Every day, we hear about people doing amazing things with Hadoop. The variety of applications across industries is clear evidence that Hadoop is radically changing the way data is processed at scale. To drive that point home, we’re excited to host a guest blog post from the University of Maryland’s Michael Schatz. Michael and his team have built a system using Hadoop that drives the cost of analyzing a human genome below $100 — and there’s more to come! Like Michael, we’re excited about the power that Hadoop offers biotech researchers. Thanks, Michael! -Christophe Loyalty Plan “you may also like ...” Spam filter Flight Delay Predictions Finance Analysis Human Genome analysis Google Apps the keys stock market data analysis
finance data patterns detection
financial behavior prediction from Twitter multiple hadoop clusters
the biggest has 2500 cpu cores and 1 Petabyte of disk space
250 GB of data are loaded into the cluster every day
used to extract statistics, determine modules quality and fight spam with the aid of Hive 100 nodes Cluster to process millions of sessions daily for analytics
Amazon's product search indexes
using Java and Streaming API More than 100,000 CPUs in >40,000 computers running Hadoop and Pig
biggest cluster: 4500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM)
research for Ad Systems and Web Search to store and process tweets, log files
Pig heavily for both scheduled and ad-hoc jobs, due to its ability to accomplish a lot with few statements 120 Nehalem-based nodes, with 2x4 cores, 24GB RAM, 8x1TB storage using ext4 in a JBOD configuration on CentOS 5.5
520 Westmere-based nodes, with 2x4 cores, 24GB RAM, 6x2TB storage using ext4 in a JBOD configuration on CentOS 5.5
Apache's Hadoop and Pig distributions for discovering People You May Know and other fun facts Hadoop and HBase in several areas from social services to structured data storage and processing for internal use
30 nodes running HDFS, Hadoop and HBase in clusters ranging from 5 to 14 nodes on both production and development ETL style processing and statistics generation to running advanced algorithms for doing behavioral analysis and targeting
150 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB hard-disk 532 nodes cluster (8 * 532 cores, 5.3PB)
For search optimization and research Large scale image conversions
Used EC2 to run hadoop on a large virtual cluster 44 nodes
Dual quad-core Xeon L5520 (Nehalem) @ 2.27GHz, 16GB RAM, 4TB/node storage.
Used for charts calculation, log analysis, A/B testing
20 dual quad-core nodes, 32GB RAM , 5x1TB
Used for user profile analysis, statistical analysis,cookie level reporting tools. Blue Cloud
University Initiative to Address Internet-Scale Computing Challenges together with Google takes 10% random Twitter feeds and compares positive with negative comments, plus defines six moods calm, alert, sure, vital, kind and happy http://www.telegraph.co.uk/technology/twitter/8755587/Twitter-becomes-latest-tool-for-hedge-fund-managers.html
Full transcript