Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Do you really want to delete this prezi?
Neither you, nor the coeditors you shared it with will be able to recover it again.
Make your likes visible on Facebook?
Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.
Big Data Brighton: Hadoop and its ecosystem
Transcript of Big Data Brighton: Hadoop and its ecosystem
Open source Hadoop Distributed File System (HDFS) MapReduce + Scaling is linear and predictable What makes Hadoop different? Fault tolerant Unstructured / structured data Cheaper Ability to ask future questions of data Ability to get deeper insights into data Turns data into a strategic asset User Interface (HUE) Workflow
(OOZIE) FS Mount
(FUSE-DFS) Languages / Compilers
(PIG, HIVE) Fast read / write
(HBase) Coordination (ZOOKEEPER) Scheduling (OOZIE) 2002 2008 2005 Origin of Hadoop: How does an elephant sneak up on you? Open source, web crawler project created by Doug Cutting Google publishes GFS & MapReduce papers Open source MapReduce and HDFS project created by Doug Cutting Runs 4,000 node Hadoop cluster Hadoop wins Terabyte sort benchmark Launches SQL support for Hadoop Releases CDH3 and Cloudera Enterprise 2011 Transfers bulk data between Hadoop and structured datastores such as relational databases Data Integration
(FLUME, SQOOP) Distributed, reliable, and available system for collecting, aggregating, and moving large amounts of data from many different sources to a centalised data store... Import/export any JDBC-supported database High-performance connectors available Apache PIG Apache HIVE High-level language PIG latin Compiler Grunt shell grunt> A = LOAD 'data' USING PigStorage() AS (f1:int, f2:int, f3:int);
grunt> B = GROUP A BY f1;
grunt> C = FOREACH B GENERATE COUNT ($0);
grunt> DUMP C; Data warehouse Query files stored in HDFS with SQL-like language Query execution via MapReduce hive> CREATE TABLE pokes (foo INT, bar STRING);
hive> LOAD DATA LOCAL INPATH './examples/kv1.txt'
hive> OVERWRITE INTO TABLE pokes;
hive> SELECT a.foo FROM pokes a WHERE a.bar='2008-08-15'; Workflow/coordination system to manage Hadoop job pipelines Workflows Coordinators Bundles <workflow-app name="foo-wf" xmlns="uri:oozie:workflow:0.1">
</workflow-app> workflow.xml Triggers workflows based on:
time (cron) and/or
data availability Batch of coordinators to create data pipelines A distributed, scalable, big data store Random, realtime read/write access to big data Use Cases Volume Petabyte scale What's next? Dremel: Interactive Analysis of Web-Scale Datasets Spanner: Google's Globally-Distributed Database Applications Distributions CDH: Cloudera's distribution of Hadoop Technology HDP: Hortonworks data platform M3: MapR Storm: distributed, realtime, stream processing firstname.lastname@example.org Thanks! @jrkinley Based on Google's BigTable The platform is just the beginning