Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Big Data with Hadoop

No description
by

Brian Hickey

on 10 October 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Big Data with Hadoop


BIG DATA

A little look at
WITH HADOOP AND ITS ECOSYSTEM
Velocity
- generating data faster than ever

Variety
- producing a wide variety (and not always clean)

Volume
- shares, Facebook likes, photos, tweets, google indexing, credit card transactions, etc.
Goals of Hadoop
- Designed to be scalable and economical
Adding load results in graceful decline of performance of individual jobs, not failure of the system
Increasing resource supports proportional increase in load capacity
- Distributed and fault-tolerant from day one
Developers don't need to worry about network programming, temporal
- No special hardware required
- "Shared Nothing" architecture
- Originated in Google

3 Key Parts
Hadoop Distributed File System (HDFS)
- Storage
MapReduce
=> Data Processing
Infrastructure
to make them work (file system administration utilities, job scheduling, monitoring, etc.)
Key Terms
Cluster
- a collection of servers running Hadoop software. More nodes, increased scale. Scale out incrementally (and also elastically, e.g. AWS). Cluster can hold several thousand nodes (e.g. Yahoo in 2010 had a 4000 node cluster using 14PB).
Node
- an individual server within a cluster. A node stores and processes data

But...
More nodes => greater chance that one will fail
Redundancy
built in to system and it handles failure automatically
If a node fails, its workload will be assumed by another still-functioning node
If the failed node recovers, it can rejoin the system without a full restart of the entire system
Files loaded in HDFS are
replicated
across nodes in cluster as they enter the system
If a node fails, its data is re-replicated using one of the other copies
Data processing jobs are broken into small tasks
Each takes a small amount of data
Tasks run in parallel across nodes
If a node fails, its tasks are rescheduled elsewhere
All automatically
Redundancy configurable - default 4 copies
HDFS
Inexpensive (standard industry commodity hardware - cost per GB continues to decrease) and reliable (redundancy) storage for massive amounts of data
Optimised for sequential access to small number of large files (>100MB - >1GB)
Highly available, with no single point of failure
Use similar to UNIX (e.g. /data/lpt/returns.txt)
Similar file ownership/permissions
But: cannot modify files once written
Master/Slave architecture
Master(s) = NameNodes - manage namespace and metadata, and monitor slaves
Slaves = read and write actual data
MapReduce
Programming model/pattern, not a language (typically Java, but can be others)
Simple, efficient, flexible and scalable
Spreads load efficiently across many nodes
Two functions - map and reduce (run in that order)
Map used to filter, transform or parse the data
Reduce (optional) used to summarise data from the map function (e.g. aggregation)
Each is simple, but powerful when combined
Master/slave architecture (YARN cluster management framework)
Master is "Resource Manager" - allocates cluster resources for a job, starts application master
Slaves: one Application Master - divides jobs into tasks and assigns to Node Managers, which start tasks, do actual work, report status back to application master
Ecosystem
Apache Pig
: high-level data processing (abstraction to low-level MapReduce code). Especially good at joining and transforming data. Written in Pig Latin language

Apache Hive
: Another MapReduce abstraction, uses SQL-like language called HiveQL

Both Pig and Hive run on a local client machine, convert into MapReduce jobs and submit the job to the cluster

Cloudera Impala
: Massively parallel SQL engine on Hadoop; Can query data stored in HDFS or HBase; typically 10 times faster than Hive or MapReduce; Uses high-level query language

Apache HBase
: "the Hadoop database" - up to PB in a table; 1000s of columns in table; scale provides high write throughput (hundreds of thousands of inserts per second); NoSQL, uses API, not a querying language;

Apache Sqoop
: For exchanging data between relational DB and Hadoop (and vice versa)

Apache Flume
: imports data into HDFS as it's being generated by the source (program output, UNIX syslog, log files, IoT, sensors, etc

Apache Oozie
: for managing process workflows - coordinates execution and control of individual jobs; includes MR jobs, Pig/Hive scripts, java or shell programs, HDFS commands, remote commands, email messages;
APACHE PIG
Platform for data analysis and processing on Hadoop (alternative to writing MapReduce code directly)

Originally a Yahoo project
Main components:
Data flow language - Pig Latin
Interactive shell for executing Pig Latin statements - Grunt
Pig interpreter and execution engine
KEY FEATURES
HDFS manipulation

UNIX shell commands

Relational operations

Positional references for fields

Common mathematical functions

Support for custom functions and data formats

Complex data structures
USE CASES
For
data analysis
Finding records in a massive data set - e.g. extracting valuable info out of web server log files
Querying multiple data sets
Calculating values from input data
Data sampling

For
data processing
Reorganising data
Joining data from multiple sources
ETL

USING PIG
Interactively, using Grunt shell
Useful for ad-hoc inspection of data
Useful for interacting with HDFS and UNIX





A Pig script (".pig")
Useful for automation and batch execution
Run it directly from UNIX shell

Use the "Local mode" for development/testing before deploying a job to production

PIG LATIN
A data flow language - a flow of data is expressed as a sequence of statements
PIG LATIN
Made up of:
Keywords
-- LOAD AS, STORE INTO, DUMP, FILTER BY, DESCRIBE, FOREACH, GENERATE, DISTINCT, ORDER BY, LIMIT, GROUP BY, GROUP ALL, FLATTEN, SPLIT, etc.
The interpreter knows how to convert these into MapReduce

Identifiers
- used like variables in a traditional program

Comments

Built-in field-level functions
- UPPER, TRIM, RANDOM, ROUND, SUBSTRING, SIZE, TOKENISE , etc.

Built-in aggregation functions
- SUM, AVG, MIN, MAX, COUNT, COUNT_STAR, DIFF, ISEMPTY, CROSS, UNION, COGROUP, JOIN (LEFT|RIGHT|FULL) OUTER, etc.

Operators
-- arithmetic, comparison, is (not) null, boolean-- similar to SQL

Data types
-- int, long, float, double, boolean, datetime, chararray, bytearrays, maps

Loaders
-- e.g. PigStorage, Hcatalog

Fields
,
Tuples
(collection of Fields),
Bags
(collection of Tuples),
Relations
(named Bags)
EXTENDING PIG
Supports
parameters
, passed in at runtime (either from command line or from a file)


Supports
macros
(similar(!!!) to a function in a programming language)
Can import it into whichever scripts need it


Supports
user-defined functions
from Java, Python, JavaScript, Ruby, Groovy
For Java, package into JAR file, REGISTER it in a Pig script (optionally provide an alias for it), and invoke it using fully-qualified classname


Community-contributed UDFs available from the
Piggy Bank
and
DataFu
Piggy Bank ships with Pig
Examples from Piggy Bank include ISOToUnix, UnixToISO, LENGTH, DiffDate
Examples from DataFu include Quantile, Median, Sessionize, HaversineDistInMiles

ADVANCED USAGE
Data sampling -
SAMPLE
and
ILLUSTRATE
(more intelligent)

Dry run
command options - used to see script after parameters and macros are processed

EXPLAIN plan
APACHE HIVE
High-level abstraction on top of map-reduce

Uses a SQL-like language called HiveQL (or HQL)
Generates MR jobs that run on Hadoop cluster








Originally developed at Facebook, now an open-source Apache project

WHY USE IT?
More productive than writing MR directly


Re-use existing SQL knowledge / no software development experience required


Interoperable (extensible via Java and external scripts)
HOW DOES IT WORK?
Queries operate on tables just like an RDBMS
A table is simply a HDFS directory containing one or more files
Table belongs to a 'database' (i.e. schema in Oracle land)
Views supported
Many formats for data storage/retrieval
Structure and location of files specified when you create the table
=> Metadata, stored in Hive's metastore [09-11]
Maps raw data in HDFS to named columns of specific types

BUT... IT'S NOT A DATABASE!
No transactions

No modifications of existing records

USE CASES
Log file analysis

Sentiment analysis
HOW TO USE IT
Hive Shell
interactive command line tool,
enter statements at the "hive>" prompt
or call scripts to execute them
Also execute system commands

HOW TO USE IT
HUE

Web-based UI called Beeswax
Features include creating/browsing tables, running queries, saving queries for later execution

HOW TO USE IT
HiveServer2
Centralised Hive server
JDBC/ODBC connection
Kerberos authentication
SYNTAX
DATA TYPES
BUILT-IN FUNCTIONS
DATA MANAGEMENT
'Schema on read' - data not validated on insert - files simply moved into place
Loading data into tables very fast
Errors in file format discovered when queries performed
Can load from Hadoop (command line), Hive commands, sqoop (from an existing RDBMS)
Altering
Usual DROP database/table supported
RENAME columns and tables
MODIFY col type
REORDER and REPLACE columns
Can store query results in another table (INSERT OVERWRITE and INSERT INTO )
And create a table based on a select (CTAS)
Can write output to filesystem (HDFS and local client)

TABLES
ADVANCED TEXT PROCESSING
MORE TEXT PROCESSING
SERDE
SENTIMENT ANALYSIS
OPTIMISATION
When it comes to Query speed...
Fastest involve only metadata
DESCRIBE table

Next fastest simply read from HDFS
SELECT * FROM table

Then, a query that requires a map-only job
SELECT * FROM table WHERE col = value

Then, a query that requires both map and reduce phases
SELECT COUNT(x) FROM table WHERE col = value

Slowest require multiple MapReduce jobs
SELECT x, COUNT(y) AS alias FROM table GROUP BY x ORDER BY y LIMIT 10
OPTIMISATION
Other tools:
Execution plan (EXPLAIN)

Sorting (ORDER BY vs SORT BY)

Parallel execution

Local execution

Partitioning

Bucketing

Indexing (much more limited than a RDBMS and comes at cost of increased disk and CPU usage)
EXTENDING HIVE
SerDe's for reading and writing records - controls row format of table
Field delimiters, Regex, Columnar (e.g. for RCFile), HBase

TRANSFORM USING
Transform and manipulate data through external scripts/programs
Like STREAM we saw earlier with Pig

User-defined functions
Written in Java (only, for the moment)
Import and register a JAR
Plenty of UDFs available on the web

Parameterised queries
CLOUDERA IMPALA
High performance SQL engine designed for vast amounts of data

Massively parallel-processing (MPP)

Query latency measured in milliseconds

Runs on Hadoop clusters, can query HDFS or HBase tables

100% open source

Supports a subset of SQL-92, with a few extensions (almost identical to HiveQL)

Similar interfaces - command line, Hue web app, ODBC/JDBC

Uses same metastore as Hive (so tables created there are visible in Impala, and vice versa)

WHY USE IT?
More productive than writing MapReduce

No software development required, just SQL

Leverage existing SQL knowledge

Highly optimised for querying

Supports User-Defined Functions in Java or C++
and can re-use Hive's

Almost always at least 5x faster than Pig or Hive (often >20x)
SURELY NOT! HOW?
Hive and Pig answer queries by running MapReduce jobs
MR is a general purpose computation framework, not an optimised interactive query execution tool
It therefore introduces overhead and latency (even a trivial query takes 10 seconds or more)

Impala does not use MapReduce
Uses a custom execution engine allowing queries to complete in a fraction of a second
Exploits its metastore to go direct to HDFS for the data it needs

Hive and Pig suited to long-running batch processes such as data transformation tasks

Impala best for interactive/ad-hoc queries
USE CASES
Business Intelligence

Ad-hoc queries and reports
COMPARED TO AN RDBMS
BUT IT'S CURRENTLY MISSING SOME STUFF...
Complex data types such as arrays, maps and structs

BINARY data types

External transformations

Custom SerDe's

Indexing

Bucketing and table sampling

ORDER BY requires a LIMIT

Fault tolerance - if a node fails, the query fails

TO SUMMARISE
MapReduce = low-level processing and analysis

Pig = procedural data flow language executed using MR

Hive = SQL-based queries executed using MR

Impala = high performance SQL-based queries using custom execution engine
COMPARISON
COMPARED AGAINST RDBMS
WOULD YOU USE INSTEAD OF AN RDBMS?
Not if RDBMS is used for its intended purpose:
Small amounts of data (relatively)
Immediate results
In-place modification (update/delete) of data

Pig, Hive and Impala optimised for
large amounts of read-only data
Extensive scalability at low cost

Pig and Hive better for batch processing

Impala and RDBMSs better for interactive use

A NoSQL database that runs of top of HDFS
What is NoSQL? Defined by an absence of a property? Then NoRel would be more appropriate!
No, read it as Not Only SQL.
Departs from relational model altogether
Most NoSQL implementations do not attempt to provide ACID properties, in order to improve performance and scale
Different types
Column, Document, Key-Value, Graph
HBase is a column-oriented data store
High speed, highly scalable, low complexity, moderate flexibility

Makes it highly available and fault tolerant

Open source Apache project
Based originally on Google's BigTable (what Google uses to search the internet)
WHAT YOU USE IT FOR
Massive amount of data - 100s of GB to petabytes (e.g. storing the entire internet)

High read and write throughput - 1000s/second per node (e.g. Facebook messages - 75billion operations per day, 1.5 million operations per second)

Scalable in-memory caching

Key lookups (no penalty for sparse columns)

Random write, random read, or both (but not neither)

1000s of operations per second on very large datasets

Access patterns are well understood and simple

Missing or different columns
WHEN NOT TO USE IT
Only appends done to dataset, and whole thing is read when processing

Ad-hoc analysis (i.e. ill-defined access patterns)

Small enough data that it all fits on one large node

Need transactions
WHO USES IT?
eBay

Facebook

StumbleUpon

TrendMicro

Twitter

etc...
COMPARISON WITH RDBMS
It is a database, but not a traditional RDBMS
Nicely laid-out tables and columns
Powerful query language
Transaction

To scale RDBMS typically need to partition or shard
HBase does this automatically

Normalisation and joining to retrieve data is commonplace
HBase doesn't support explicit joins
Implicit joins from column families if necessary
Not normalised - redundancy and duplication is common
SHOULD HBASE REPLACE MY RDBMS?
Not a drop-in replacement!
Requires different mindset
Significant re-architecting of application

Major differences
Data layout
Data access

RDBMS schema design:
Determine all types of data to be stored
Determine relationships between the data elements
Create tables, columns and foreign keys
"Relationship-centric" design
HBase design:
Determine types of data to be stored
Create data layouts and keys
"Data-centric" design

SO, SHOULD I USE IT?
Not a silver bullet

Will solve data access issues, but not issues further up the stack (i.e. application scaling or design limitations)

It's very good at certain tasks, but like any tool, difficult/impossible/inefficient to use for others

Approaching it with an RDBMS mindset
will lead to failure
.

HBASE IN DEPTH
Stores data in tables (it is a database)

Tables in HBase are
sorted, distributed maps
(think of associative arrays or hashes in your programming language)

Data is split into HDFS blocks and stored on multiple nodes in the cluster

Tables themselves are split into Regions
Region = section of the table
Similar to a shard or partition in a RDBMS

HBASE IN DEPTH
Served to clients by
RegionServer
daemons
A RegionServer runs on each slave node in the cluster
Typically serves multiple regions, from several tables
Unlikely that one RS will serve all regions for a particular table

HBase master daemon coordinates RegionServers
And which Regions are managed by each RegionServer
This changes as data gets added/deleted
Handles new table creation and other housekeeping ops
Can be multiple Masters for high availability, but only one runs the cluster at a time

HBASE IN DEPTH
Zookeeper
Handles coordination of master
ZK daemons run on master nodes
Masters compete to run the cluster - first to connect wins control

If controlling master fails, remaining masters compete again to run cluster
HBASE TABLES
Rows, columns and column families

Every row has a row key (analogous to primary key in RDBMS)

Stored sorted
by row key for fast lookup

Columns hold the data for the table

Columns can be
created on the fly

A column exists for a particular row
ONLY IF
it has data in that row

The table's cell are arbitrary arrays of bytes
HBASE TABLES
All columns belong to a column family

A column family is a collection of columns

A table has one or more column families, created as part of table definition

All column family members have the same prefix
e.g. for "contactinfo:fname", the ":" is a delimiter between the column family name ("contactinfo") and the column name ("fname")

Tuning and storage settings can be specified for each column family
E.g. number of versions of each cell to be stored

Columns within a family are sorted and stored together

SO, WHY COLUMN FAMILIES?
Separate column families are useful for:
Data that is not frequently accessed together

Data that uses different column family options, e.g. compression

Some attributes
Compression - GZ, LZO, Snappy
Versions to be kept (>=1)
TTL - automatically deleted after specified time (X seconds to FOREVER)
Min_Versions (if TTL set) (0+)
Blocksize (1b - 2GB)
In_Memory
BlockCache
BloomFilter - improves read performance

DATA IN HBASE
HBase can store anything that can be converted to an array of bytes
Strings, numbers, complex objects, images, etc.
Practical limits to size of values
Typically, cell size should not consistently be above 10MB
Physically stored on a per-column family basis
Empty cells are
not stored

HBase table = distributed sorted map
HBASE OPERATIONS
Get = retrieves a single row using the row key

Scan = retrieves all rows
Scan can be constrained to retrieve rows between a start row key and end row key

Put = add a new row, identified by row key

Delete = marks data as having been deleted
Removes the row identified by row key
Not removed from HDFS during the call, but is marked for deletion, which happens later

Increment = set or increment an atomic counter
Only for cells containing values of the long datatype
Atomicity allows for concurrent access without fear of corruption or overwriting with a previous number
Consistency synchronised by RegionServer NOT the client

HOW TO USE IT
HBase Shell
Interactive shell for sending commands to HBase and viewing responses
Uses JRuby
Wraps Java client calls in Ruby
Allows Ruby syntax for commands
Means parameter usage a little different than most shells
E.g. command parameters are single-quoted
HBase> command 'parameter1', 'parameter2'
Allows scripting written in Ruby/JRuby
Parameters can be passed in to expand functionality
Use for administration, running scripts and small data accesses and quick tests
HOW TO USE IT
Java API
Only first class citizen
Other languages supported via other ecosystem projects (Thrift and a REST interface)
Java API also augmented by other projects
Simply add the HBase JARs and instantiate objects just like any other
Note: HBase stored everything as byte arrays
Many API methods require byte array as value
Use utility class to do conversions

Apache Thrift
Originally developed at Facebook
Supports 14 languages including Java, C++, Python, PHP, Ruby, C#
MANAGING TABLES
Creation
Only tables and column families must be specified at creation

Every table must have at least one column family

Any optional settings can be changed later

All tables are in the same namespace

Different from RDBMS table creation
No columns necessary
No strict relationships
No constraints or foreign keys
No namespaces

TABLE CREATION
In the shell
In the Java API
ALTERING TABLES
Entire table can be modified

Column families added, changed or deleted

Table must be in maintenance mode for changes, then enabled when complete

HANDLING TABLES
List - lists all tables in the database

Describe - shows the structure of a table

Disable/enable - allows table to be maintained and blocks all client access

DELETING TABLES
'
drop
' command
Must be disabled first
removes the table from HBase, then deletes all related files in HDFS

'
truncate
' command
Removes all rows
Does not need to be disabled
ACCESSING DATA
Get
Used to retrieve a single row (or specific versions of columns)

Must know the rowkey

Try to constrain to the least amount of column families or descriptors as possible
Speeds up queries
Reduces transfer times
Java API allows for get batching (not a query - grouping a bunch of gets together into a single request)

ACCESSING DATA
Scan
Used when exact row key is not known, or a group of rows needs to be accessed

Can be bounded by a start and stop row key

Can be limited to certain column families or column descriptors

ACCESSING DATA
Put (Adding/Updating)
HBase doesn't distinguish between an insert and an update

If the rowkey doesn't exist, it gets added; if does exist, it's an update

ACCESSING DATA
Delete/DeleteAll
Marks row/column/column family/column versions for deletion - physically performed later

Can be batched as with Get

Deleteall removes entire row

ACCESSING DATA
Counters
Automatic-increment integer/long counters


Atomic Put and Delete
Checks and verifies previous data
ACCESSING DATA
Co-Processors
Allows developers to extend functionality. Two types:
Endpoints - analogous to stored procedures in RDBMSs, exposes a new method to HBase. Mostly algorithmic features, such as sums, counts, averages, group by or any other more complicated logic that you want performed in HBase and not the client

Observers - analogous to a trigger, hooks into HBase operations. Get, Put and Scan all ship with pre and post methods that observers can hook into to perform an operation before and after the call. E.g. more granular security checks (pre) or secondary indexes (post)

However - very easy to destabilise the entire cluster! Debugging support is also currently quite poor


ACCESSING DATA
Filters
Scans pass back all relevant rows - may have work to do on the client
Filters augment scans allowing logic to be run before the data are returned, and only send when passes the logic
i.e. reducing data sent back to client
Filters made up of
Which piece of information is to be processed (e.g. column family, column value, etc)
How to process the filter
The operand for the result (equal, not equal, greater than, etc)

Can be combined
Lots of built-in ones
Binary, null, regex, substring comparators

Can create your own (java class implementing the Filter interface - packaged as a jar)

ACCESSING DATA
Filters

SO WHAT?
WHY WOULD I USE ALL THAT INSTEAD OF MY TRADITIONAL FAMILIAR RDBMS?
WHAT ADVANTAGE(S) DOES HBASE GIVE ME?
THAT ALL SOUNDS WONDERFUL, BUT...
MOSTLY....
ITS ABILITY TO
SCALE
LET'S GO BACK TO THE ARCHITECTURE
SPECIFICALLY, LET'S LOOK AT REGIONS
A TABLE'S REGIONS
A Region is served to a client by a Region Server
Tables are broken into smaller pieces called Regions

A Region is defined by the start and end rows that it serves
Let's say our table looks like this...
So, how do we know where the data are?
CATALOG TABLES
HBase uses special tables called Catalog tables [-ROOT- and .META.] to keep track of the locations of Region Servers and what Regions these hold
Up to now, we've been talking about userspace tables - those that store our data
1. Ask Zookeeper where -ROOT- table is

2. -ROOT- table indicates where .META. table is

3. .META. lists all the Regions and where they're located

4. Query RegionServer where the region for the data is

Client caches information from Steps 1-3
THE ROLE OF HDFS
Remember that HDFS gives us:
High availability for NameNodes to avoid SPoF

Durability for data (stored 3 times across 3 nodes)

Regions can be written/read from anywhere across HDFS
Region Server can run anywhere on the cluster

Storage scalability
Add more nodes
Regions are stored as files in HDFS
AND COLUMN FAMILIES?
Column families are divided into Stores
Files in HFile format, in HDFS
STORE FILE REPLICATION
MEMSTORE
HOW DOES HBASE HANDLE & MANAGE DATA
WRITE PATH
HOW DOES HBASE HANDLE & MANAGE DATA
READ PATH
REGION SPLITS
Regions automatically split when they get too big (>10GB):
At compaction stage, original data rewritten into separate files and original files deleted
The .META. table then updated to reflect newly split region

Region management is a key part of an efficient scalable HBase implementation
Don't want a high number of regions for a small amount of data
But high region count (>3000) has an impact on performance
Low region count prevents parallel scalability
Preferred range is 20 to 200 regions per Region Server



REGION SPLITS
Load Balancer
Automatically moves regions around to balance cluster load

Compactions
Minor compactions merge small Store Files into a big Store file
Major compaction (usually once per day) merges all store files in a Region and performs other housekeeping (physical deletions, etc)
Provides eventual data locality

Pre-splitting during large data loads
E.g. ensuring 100GB of data into a new table splits into regions of ~10GB each
ROW-KEY DESIGN
RDBMS schema design favours normalisation - aim for no redundant data
Leads to lots of smaller tables (to favour joins)
HBase typically has only a few, large tables
Denormalised to avoid joins
Small number of column families
With hundreds (even thousands) of columns
Single row key (not many index columns as with RDBMSs)
More effort goes into row key planning for HBase than with RDBMSs
This is a big reason why retrofitting HBase into existing code is not trivial
WHICH MEANS...
Schema design for HBase is
application-centric
, not data/relationship-centric as with RDBMSs

We
must
look closely at how the applications will
access
the data

Understanding the
access pattern
is integral to success
how frequent?
what type of data?
what is the nature of the operations?
more reads than writes?
are writes additions or updates?
etc ...
ACCESS PATTERN: READ-HEAVY
Must
avoid joins

Aim to place all data in a single row (or set of contiguous rows)

Denormalise
(
yes, that's not a typo,
Denormalise
) one to many relationships by pre-materialising them
But remember your updates to denormalised data is now a larger task

Optimise

row key and column family layouts
Row keys need to
allow for quick reading
Column families need to optimised to
use the block cache efficiently

ACCESS PATTERN: WRITE-HEAVY
Optimise row key to spread the load
evenly across the RegionServers
(maximise the I/O throughput of the cluster)

Decide
when joins make sense
depending on how the application needs the data
Denorming is a
tradeoff
between speed and write performance

ACCESS PATTERN: HYBRID ACCESS PATTERN
An app may be both read and write heavy on a table in different areas of the application
Or more than one app uses the table

Must consider the primary use and the heavier load, but try optimise settings as best as possible to improve the rest

SOME EXAMPLE PATTERNS
Time Series and Sequential Data Types
E.g. timestamped log data
Use timestamp as the row key??

Geospatial Data Types

Social Data Types
A high-profile celebrity's twitter feeds in a single region
High read on that region as a result
Other regions without high-profile accounts have much fewer reads


SOME EXAMPLE PATTERNS - ROWKEY DESIGN
For above examples, composite rowkey might be a better approach
Instead of <timestamp>, use <timestamp><source><event>
Provides more granularity for scans

Rowkeys cannot be changed - row must be deleted and re-inserted with the new key

Rows are sorted by rowkey as they are put into the table (means the scans don't have to do this)

Ordered lexicographically, so numbers need to be padded


ROWKEY DESIGN - IMPACT ON PERFORMANCE
Querying on row keys has highest performance
Can skip store files

Querying on column values has the worst performance
Value-based filtering is a full table scan - each column's value has to be checked
Therefore,
commonly-queried data should be part of the row key


ROWKEY TYPES
Sequential/incremental
ID or time series
Best scan performance
Worst write performance
All writes are likely to hit one Region Server





Salted

Calculated hash in front of the real data to randomise the row key
Scans can ignore the salt, so scans still fast
Write performance improved as salt prefix distributes across RegionServers
ROWKEY TYPES
Promoted field key
Another field is used to prefix the ID or time series key
E.g. <sourceid><timestamp> instead of <timestamp><sourceid>
Still allows scanning by ignoring (or using) promoted field
Improves write performance as prefix can distribute writes across multiple RegionServers




Salted

E.g. using a one way hash such as MD5
Worst read performance (have to read all values)
Best write performance as random values distributed evenly across all RegionServers

ROWKEY TYPES - SUMMARY
MAINTENANCE
AND
ADMINISTRATION
HBASE
MONITORING - MASTER
HBase
Master
handles many critical functions for the cluster
Coordinates RegionServers and manages failovers
Handles changes to tables and column families (updating .META. Accordingly)
Manages region changes (region splits or assignments)

Multiple masters as same time, but only one active
Others compete if active master fails (updating -ROOT- accordingly)

Has a
LoadBalancer
process
Balances based on number of regions served by a RegionServer
Table-aware and will spread a table's regions across many RegionServers

CatalogJanitor
checks for unused regions to garbage collect

LogCleaner
deletes old WAL files



MONITORING - REGION SERVER
Handles data movement
Retrieves the data for Gets and Scans and returns to client
Stores all data from Puts
Record and marks deletions

Handles major and minor compactions

Handles region splitting

Maintains block cache

Manages Memstore and WAL

Receives new regions from Master (replaying its WAL if necessary)




MONITORING - ZOOKEEPER
Distributed coordination service making HBase highly available

Used to maintain state and configuration information and provide failure detection

SETUP & CONFIG - SIZING
Sizing the cluster is complex

Depends on access pattern

A read-heavy workload has different requirements to a write-heavy workload

Some rules of thumb exist (relating raw disk space to Java heap usage)

SETUP & CONFIG - SECURITY
Built-in security available to limit access and permissions

Requires Kerberos-enabled Hadoop underneath

User permissions granted globally or at a table, column family or column descriptor level
Read, write, execute, create or administer (i.e. manage tables)

Can decrease throughput by 5-10%


REPLICATION
Use built-in
inter-cluster replication
Done on a per column-family basis
Done by WAL-shipping => asynchronous copy and eventually consistent (so any write that bypasses the WAL will not be auto-replicated
Zookeeper maintains state about the replication
BUT:
Time differences
between cluster
Bandwidth
(remember, we could be serving lots of data requests while replication ongoing)
Column family setting changes not replicated
No 'canary' service or verification process

Have applications write to all clusters
Maybe you don't want the data structured the same way in the replica
E.g. Primary service database with a reporting database

Manually copy tables
Several methods...
BACKUP & RESTORE
CopyTable
- copies a table within a cluster or to another

Point in time backup using
mapreduce export
(only data, no metadata)
Restore using mapreduce import

ImportTSV
- delimited data files

LoadIncrementalHFiles
- Hfile import for efficient bulk loading

Full backup -
copies HBase directory
in HDFS (cluster must be offline)

Snapshot
- metadata-only backup (references to the files, not specific row/column data)
Can then be exported to another cluster (bypasses regionServer, using HDFS directly)

Several methods...
HBASE ECOSYSTEM
A bunch of projects and tools exist to extend/enhance native HBase functionality

OpenTSB - a time-series database

Kiji

Hive - allows SQL and database knowledge to be re-used, instead of programming
OpenTSB
Good for managing metrics (real-time stats, SLA times and data points (StumbleUpon, Box, Tumblr)
KIJI
Centralises data types and deserialises them
AND LASTLY... CERTIFICATION
Cloudera offer HBase certification - the only certification for HBase

"
Cloudera Certified Specialist in Apache HBase (CCSHB)
"
Full transcript