Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.



No description

Sabina A. Schneider

on 11 September 2011

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of BigData@Globant

introduction big torrents of data volumes
are flowing in the internet the use of smartphones, social media and other devices like laptops and PCs is growing at an accelerated pace companies tend to interact with clients
trough the Internet multimedia contents have helped in
increasing the stored and transfered data For example... Each second of high-definition video, generates
more than 2,000 times as many bytes as required
to store a single text page What is going on? ...imagine the very important and valuable
information for the private and public sectors, which is hiding under those big data volumes! Nevertheless, all this generated volumes of data cannot continue being
either processed or stored by only one PC the ability to store, aggregate, combine data
and perform analysis is then vital the keys... What are the benefits of Big Data? By 2009, nearly all sectors in the US economy had at least an average of 200 terabytes of stored data (twice the size of US retailer Wal-Mart’s data warehouse in 1999) per company with more than 1,000 employees. (McKinsey Study 2011) find out what's your product position in the social networks
customize customer experience
know your customer better
generate information to make better decisions
innovate by being closer to the market evolution Introducing Big Data analysis in a company involves: taking data policy into account, technology and infrastructure preparation, talent incorporation and access to data. Which is the OpenSource option? Low cost Infraestructure to escale
Distributed data volumes are growing more and more every day
One machine processing power is not enough HDFC and Map Reduce Architecture HDFC Facts open-source software for reliable, scalable, distributed computing moved computation to the data nodes good for processing big volumes of data in a distributed parallel way provide distributed access to large volumes of data by spreading data between different machines reliable data storage fast access to data Who were the first ones there? "Rather than spending money in high-end machines, Google prefers to invest its money in fault-tolerant software." Marissa Mayer, vice president of search products and user experience Going back to Hadoop.... What's the tendency in the Market? Ecosystem distributed configuration synchronization selecting a leader
5 out of 11 nodes may disconnect without incident Can perform complex joins with minimal code
SQL like scripting Hadoop Hive Hadoop Pig MapReduce requires Java skills, Pig presents a limited amount of keywords ofering a non declarative or procedural languague Let's look at what the Market says... When do we recommend using it and when not? Yes No Advantages Abstraction on top of Hadoop
Easier ramp up
Reporting tool Success cases @Globant Thank you! 22% anual growth for structured data 72% anual increase for unstructured data but... moving all this data through the network brings you to another problem! distribute and replicate the data [DFS]
bring the processing to the nodes [Map]
agregate the process result [Reduce] Examples Tesco’s loyalty program
“you may also like ...” Amazon
Spam filter (Yahoo)
Flight delays prediction (Flight Caster)
Finance Paterns detection
Human Genome analysis (University of Maryland’s Michael Schatz) several software and infraestructure tools are ready to help with this! RDBMS vs NoSQL big volumes (tera and peta) processing
license cost
data locality
structure of the data to analyze
ACID vs BASE cloud computing is evolution, not revolution Distributed Memory Cache (Oracle Coherence, memcache) Key/Value databases (HBase, MemcacheDB) Big Table databases (Google Big table, Cassandra) Document Databases (couched, mongodb) What?
Business Statistic Reports
Generate value aggregating big volumes of data
Unstructured data
For long term analysis: the more data you get, the better the anlysis results
Yahoo: spam filter
Analyze customers navegability logs Less considerable volumes
Structured data
Considerable data change: it's not thought for moving and updating data
Transactional Easier and less expensive to scalate the Cluster
High availability
Processing goes to the nodes: good resource use
Quick and accurate response from the Hadoop community
You have tools like Pig and Hive to perform an abstraction on top of MapReduce
Saves one step over Datawarehousing (data aggregation)
ORMS have lots of restrictions: do we need transactionality for business reporting?
Customization to the extreme
Development stages ease debugging of the Cluster Lessons learnt Development API being changed
The complete ecosystem needs to be upgraded
Development paradigm change for MapReduce implementations
Bear in mind the development of a Partitioner if you want a collection of sorted keys
Name Node high availability still in development Distribution! YouTube claims to upload 24 hours of video every minute The benefits...
Full transcript