Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Big Data Brighton: Hadoop and its ecosystem

No description
by

James Kinley

on 26 September 2012

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Big Data Brighton: Hadoop and its ecosystem

and its ecosystem Hadoop Solutions Architect at About me Big Data Brighton Committer on Apache MRUnit Contributor to Apache Oozie Background in UK & US government Live in Southampton Today's talk What is Big Data? The Hadoop ecosystem Use cases Capture What is Big Data? Storage Search Analysis Visualization Petabytes Exabytes Terabytes Velocity Variety CLOUDERA Flexible Scalable Low cost Store & process any type of data Scale out architecture Commodity hardware
Open source Hadoop Distributed File System (HDFS) MapReduce + Scaling is linear and predictable What makes Hadoop different? Fault tolerant Unstructured / structured data Cheaper Ability to ask future questions of data Ability to get deeper insights into data Turns data into a strategic asset User Interface (HUE) Workflow
(OOZIE) FS Mount
(FUSE-DFS) Languages / Compilers
(PIG, HIVE) Fast read / write
(HBase) Coordination (ZOOKEEPER) Scheduling (OOZIE) 2002 2008 2005 Origin of Hadoop: How does an elephant sneak up on you? Open source, web crawler project created by Doug Cutting Google publishes GFS & MapReduce papers Open source MapReduce and HDFS project created by Doug Cutting Runs 4,000 node Hadoop cluster Hadoop wins Terabyte sort benchmark Launches SQL support for Hadoop Releases CDH3 and Cloudera Enterprise 2011 Transfers bulk data between Hadoop and structured datastores such as relational databases Data Integration
(FLUME, SQOOP) Distributed, reliable, and available system for collecting, aggregating, and moving large amounts of data from many different sources to a centalised data store... Import/export any JDBC-supported database High-performance connectors available Apache PIG Apache HIVE High-level language PIG latin Compiler Grunt shell grunt> A = LOAD 'data' USING PigStorage() AS (f1:int, f2:int, f3:int);
grunt> B = GROUP A BY f1;
grunt> C = FOREACH B GENERATE COUNT ($0);
grunt> DUMP C; Data warehouse Query files stored in HDFS with SQL-like language Query execution via MapReduce hive> CREATE TABLE pokes (foo INT, bar STRING);
hive> LOAD DATA LOCAL INPATH './examples/kv1.txt'
hive> OVERWRITE INTO TABLE pokes;
hive> SELECT a.foo FROM pokes a WHERE a.bar='2008-08-15'; Workflow/coordination system to manage Hadoop job pipelines Workflows Coordinators Bundles <workflow-app name="foo-wf" xmlns="uri:oozie:workflow:0.1">
...
<action name="myfirstHadoopJob">
<map-reduce>
<job-tracker>foo:9001</job-tracker>
<name-node>bar:9000</name-node>
<prepare>
<delete path="hdfs://foo:9000/usr/tucu/output-data"/>
</prepare>
<job-xml>/myfirstjob.xml</job-xml>
<configuration>
<property>
<name>mapred.input.dir</name>
<value>/usr/tucu/input-data</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>/usr/tucu/input-data</value>
</property>
...
</configuration>
</map-reduce>
<ok to="myNextAction"/>
<error to="errorCleanup"/>
</action>
...
</workflow-app> workflow.xml Triggers workflows based on:
time (cron) and/or
data availability Batch of coordinators to create data pipelines A distributed, scalable, big data store Random, realtime read/write access to big data Use Cases Volume Petabyte scale What's next? Dremel: Interactive Analysis of Web-Scale Datasets Spanner: Google's Globally-Distributed Database Applications Distributions CDH: Cloudera's distribution of Hadoop Technology HDP: Hortonworks data platform M3: MapR Storm: distributed, realtime, stream processing kinley@cloudera.com Thanks! @jrkinley Based on Google's BigTable The platform is just the beginning
Full transcript