Present Remotely
Send the link below via email or IM
Present to your audience
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Big Data with Hadoop
No description
BIG DATA
A little look at
WITH HADOOP AND ITS ECOSYSTEM
Velocity
- generating data faster than ever
Variety
- producing a wide variety (and not always clean)
Volume
- shares, Facebook likes, photos, tweets, google indexing, credit card transactions, etc.
Goals of Hadoop
- Designed to be scalable and economical
Adding load results in graceful decline of performance of individual jobs, not failure of the system
Increasing resource supports proportional increase in load capacity
- Distributed and fault-tolerant from day one
Developers don't need to worry about network programming, temporal
- No special hardware required
- "Shared Nothing" architecture
- Originated in Google
3 Key Parts
Hadoop Distributed File System (HDFS)
- Storage
MapReduce
=> Data Processing
Infrastructure
to make them work (file system administration utilities, job scheduling, monitoring, etc.)
Key Terms
Cluster
- a collection of servers running Hadoop software. More nodes, increased scale. Scale out incrementally (and also elastically, e.g. AWS). Cluster can hold several thousand nodes (e.g. Yahoo in 2010 had a 4000 node cluster using 14PB).
Node
- an individual server within a cluster. A node stores and processes data
But...
More nodes => greater chance that one will fail
Redundancy
built in to system and it handles failure automatically
If a node fails, its workload will be assumed by another still-functioning node
If the failed node recovers, it can rejoin the system without a full restart of the entire system
Files loaded in HDFS are
replicated
across nodes in cluster as they enter the system
If a node fails, its data is re-replicated using one of the other copies
Data processing jobs are broken into small tasks
Each takes a small amount of data
Tasks run in parallel across nodes
If a node fails, its tasks are rescheduled elsewhere
All automatically
Redundancy configurable - default 4 copies
HDFS
Inexpensive (standard industry commodity hardware - cost per GB continues to decrease) and reliable (redundancy) storage for massive amounts of data
Optimised for sequential access to small number of large files (>100MB - >1GB)
Highly available, with no single point of failure
Use similar to UNIX (e.g. /data/lpt/returns.txt)
Similar file ownership/permissions
But: cannot modify files once written
Master/Slave architecture
Master(s) = NameNodes - manage namespace and metadata, and monitor slaves
Slaves = read and write actual data
MapReduce
Programming model/pattern, not a language (typically Java, but can be others)
Simple, efficient, flexible and scalable
Spreads load efficiently across many nodes
Two functions - map and reduce (run in that order)
Map used to filter, transform or parse the data
Reduce (optional) used to summarise data from the map function (e.g. aggregation)
Each is simple, but powerful when combined
Master/slave architecture (YARN cluster management framework)
Master is "Resource Manager" - allocates cluster resources for a job, starts application master
Slaves: one Application Master - divides jobs into tasks and assigns to Node Managers, which start tasks, do actual work, report status back to application master
Ecosystem
Apache Pig
: high-level data processing (abstraction to low-level MapReduce code). Especially good at joining and transforming data. Written in Pig Latin language
Apache Hive
: Another MapReduce abstraction, uses SQL-like language called HiveQL
Both Pig and Hive run on a local client machine, convert into MapReduce jobs and submit the job to the cluster
Cloudera Impala
: Massively parallel SQL engine on Hadoop; Can query data stored in HDFS or HBase; typically 10 times faster than Hive or MapReduce; Uses high-level query language
Apache HBase
: "the Hadoop database" - up to PB in a table; 1000s of columns in table; scale provides high write throughput (hundreds of thousands of inserts per second); NoSQL, uses API, not a querying language;
Apache Sqoop
: For exchanging data between relational DB and Hadoop (and vice versa)
Apache Flume
: imports data into HDFS as it's being generated by the source (program output, UNIX syslog, log files, IoT, sensors, etc
Apache Oozie
: for managing process workflows - coordinates execution and control of individual jobs; includes MR jobs, Pig/Hive scripts, java or shell programs, HDFS commands, remote commands, email messages;
APACHE PIG
Platform for data analysis and processing on Hadoop (alternative to writing MapReduce code directly)
Originally a Yahoo project
Main components:
Data flow language - Pig Latin
Interactive shell for executing Pig Latin statements - Grunt
Pig interpreter and execution engine
KEY FEATURES
HDFS manipulation
UNIX shell commands
Relational operations
Positional references for fields
Common mathematical functions
Support for custom functions and data formats
Complex data structures
USE CASES
For
data analysis
Finding records in a massive data set - e.g. extracting valuable info out of web server log files
Querying multiple data sets
Calculating values from input data
Data sampling
For
data processing
Reorganising data
Joining data from multiple sources
ETL
USING PIG
Interactively, using Grunt shell
Useful for ad-hoc inspection of data
Useful for interacting with HDFS and UNIX
A Pig script (".pig")
Useful for automation and batch execution
Run it directly from UNIX shell
Use the "Local mode" for development/testing before deploying a job to production
PIG LATIN
A data flow language - a flow of data is expressed as a sequence of statements
PIG LATIN
Made up of:
Keywords
-- LOAD AS, STORE INTO, DUMP, FILTER BY, DESCRIBE, FOREACH, GENERATE, DISTINCT, ORDER BY, LIMIT, GROUP BY, GROUP ALL, FLATTEN, SPLIT, etc.
The interpreter knows how to convert these into MapReduce
Identifiers
- used like variables in a traditional program
Comments
Built-in field-level functions
- UPPER, TRIM, RANDOM, ROUND, SUBSTRING, SIZE, TOKENISE , etc.
Built-in aggregation functions
- SUM, AVG, MIN, MAX, COUNT, COUNT_STAR, DIFF, ISEMPTY, CROSS, UNION, COGROUP, JOIN (LEFT|RIGHT|FULL) OUTER, etc.
Operators
-- arithmetic, comparison, is (not) null, boolean-- similar to SQL
Data types
-- int, long, float, double, boolean, datetime, chararray, bytearrays, maps
Loaders
-- e.g. PigStorage, Hcatalog
Fields
,
Tuples
(collection of Fields),
Bags
(collection of Tuples),
Relations
(named Bags)
EXTENDING PIG
Supports
parameters
, passed in at runtime (either from command line or from a file)
Supports
macros
(similar(!!!) to a function in a programming language)
Can import it into whichever scripts need it
Supports
user-defined functions
from Java, Python, JavaScript, Ruby, Groovy
For Java, package into JAR file, REGISTER it in a Pig script (optionally provide an alias for it), and invoke it using fully-qualified classname
Community-contributed UDFs available from the
Piggy Bank
and
DataFu
Piggy Bank ships with Pig
Examples from Piggy Bank include ISOToUnix, UnixToISO, LENGTH, DiffDate
Examples from DataFu include Quantile, Median, Sessionize, HaversineDistInMiles
ADVANCED USAGE
Data sampling -
SAMPLE
and
ILLUSTRATE
(more intelligent)
Dry run
command options - used to see script after parameters and macros are processed
EXPLAIN plan
APACHE HIVE
High-level abstraction on top of map-reduce
Uses a SQL-like language called HiveQL (or HQL)
Generates MR jobs that run on Hadoop cluster
Originally developed at Facebook, now an open-source Apache project
WHY USE IT?
More productive than writing MR directly
Re-use existing SQL knowledge / no software development experience required
Interoperable (extensible via Java and external scripts)
HOW DOES IT WORK?
Queries operate on tables just like an RDBMS
A table is simply a HDFS directory containing one or more files
Table belongs to a 'database' (i.e. schema in Oracle land)
Views supported
Many formats for data storage/retrieval
Structure and location of files specified when you create the table
=> Metadata, stored in Hive's metastore [09-11]
Maps raw data in HDFS to named columns of specific types
BUT... IT'S NOT A DATABASE!
No transactions
No modifications of existing records
USE CASES
Log file analysis
Sentiment analysis
HOW TO USE IT
Hive Shell
interactive command line tool,
enter statements at the "hive>" prompt
or call scripts to execute them
Also execute system commands
HOW TO USE IT
HUE
Web-based UI called Beeswax
Features include creating/browsing tables, running queries, saving queries for later execution
HOW TO USE IT
HiveServer2
Centralised Hive server
JDBC/ODBC connection
Kerberos authentication
SYNTAX
DATA TYPES
BUILT-IN FUNCTIONS
DATA MANAGEMENT
'Schema on read' - data not validated on insert - files simply moved into place
Loading data into tables very fast
Errors in file format discovered when queries performed
Can load from Hadoop (command line), Hive commands, sqoop (from an existing RDBMS)
Altering
Usual DROP database/table supported
RENAME columns and tables
MODIFY col type
REORDER and REPLACE columns
Can store query results in another table (INSERT OVERWRITE and INSERT INTO )
And create a table based on a select (CTAS)
Can write output to filesystem (HDFS and local client)
TABLES
ADVANCED TEXT PROCESSING
MORE TEXT PROCESSING
SERDE
SENTIMENT ANALYSIS
OPTIMISATION
When it comes to Query speed...
Fastest involve only metadata
DESCRIBE table
Next fastest simply read from HDFS
SELECT * FROM table
Then, a query that requires a map-only job
SELECT * FROM table WHERE col = value
Then, a query that requires both map and reduce phases
SELECT COUNT(x) FROM table WHERE col = value
Slowest require multiple MapReduce jobs
SELECT x, COUNT(y) AS alias FROM table GROUP BY x ORDER BY y LIMIT 10
OPTIMISATION
Other tools:
Execution plan (EXPLAIN)
Sorting (ORDER BY vs SORT BY)
Parallel execution
Local execution
Partitioning
Bucketing
Indexing (much more limited than a RDBMS and comes at cost of increased disk and CPU usage)
EXTENDING HIVE
SerDe's for reading and writing records - controls row format of table
Field delimiters, Regex, Columnar (e.g. for RCFile), HBase
TRANSFORM USING
Transform and manipulate data through external scripts/programs
Like STREAM we saw earlier with Pig
User-defined functions
Written in Java (only, for the moment)
Import and register a JAR
Plenty of UDFs available on the web
Parameterised queries
CLOUDERA IMPALA
High performance SQL engine designed for vast amounts of data
Massively parallel-processing (MPP)
Query latency measured in milliseconds
Runs on Hadoop clusters, can query HDFS or HBase tables
100% open source
Supports a subset of SQL-92, with a few extensions (almost identical to HiveQL)
Similar interfaces - command line, Hue web app, ODBC/JDBC
Uses same metastore as Hive (so tables created there are visible in Impala, and vice versa)
WHY USE IT?
More productive than writing MapReduce
No software development required, just SQL
Leverage existing SQL knowledge
Highly optimised for querying
Supports User-Defined Functions in Java or C++
and can re-use Hive's
Almost always at least 5x faster than Pig or Hive (often >20x)
SURELY NOT! HOW?
Hive and Pig answer queries by running MapReduce jobs
MR is a general purpose computation framework, not an optimised interactive query execution tool
It therefore introduces overhead and latency (even a trivial query takes 10 seconds or more)
Impala does not use MapReduce
Uses a custom execution engine allowing queries to complete in a fraction of a second
Exploits its metastore to go direct to HDFS for the data it needs
Hive and Pig suited to long-running batch processes such as data transformation tasks
Impala best for interactive/ad-hoc queries
USE CASES
Business Intelligence
Ad-hoc queries and reports
COMPARED TO AN RDBMS
BUT IT'S CURRENTLY MISSING SOME STUFF...
Complex data types such as arrays, maps and structs
BINARY data types
External transformations
Custom SerDe's
Indexing
Bucketing and table sampling
ORDER BY requires a LIMIT
Fault tolerance - if a node fails, the query fails
TO SUMMARISE
MapReduce = low-level processing and analysis
Pig = procedural data flow language executed using MR
Hive = SQL-based queries executed using MR
Impala = high performance SQL-based queries using custom execution engine
COMPARISON
COMPARED AGAINST RDBMS
WOULD YOU USE INSTEAD OF AN RDBMS?
Not if RDBMS is used for its intended purpose:
Small amounts of data (relatively)
Immediate results
In-place modification (update/delete) of data
Pig, Hive and Impala optimised for
large amounts of read-only data
Extensive scalability at low cost
Pig and Hive better for batch processing
Impala and RDBMSs better for interactive use
A NoSQL database that runs of top of HDFS
What is NoSQL? Defined by an absence of a property? Then NoRel would be more appropriate!
No, read it as Not Only SQL.
Departs from relational model altogether
Most NoSQL implementations do not attempt to provide ACID properties, in order to improve performance and scale
Different types
Column, Document, Key-Value, Graph
HBase is a column-oriented data store
High speed, highly scalable, low complexity, moderate flexibility
Makes it highly available and fault tolerant
Open source Apache project
Based originally on Google's BigTable (what Google uses to search the internet)
WHAT YOU USE IT FOR
Massive amount of data - 100s of GB to petabytes (e.g. storing the entire internet)
High read and write throughput - 1000s/second per node (e.g. Facebook messages - 75billion operations per day, 1.5 million operations per second)
Scalable in-memory caching
Key lookups (no penalty for sparse columns)
Random write, random read, or both (but not neither)
1000s of operations per second on very large datasets
Access patterns are well understood and simple
Missing or different columns
WHEN NOT TO USE IT
Only appends done to dataset, and whole thing is read when processing
Ad-hoc analysis (i.e. ill-defined access patterns)
Small enough data that it all fits on one large node
Need transactions
WHO USES IT?
eBay
Facebook
StumbleUpon
TrendMicro
Twitter
etc...
COMPARISON WITH RDBMS
It is a database, but not a traditional RDBMS
Nicely laid-out tables and columns
Powerful query language
Transaction
To scale RDBMS typically need to partition or shard
HBase does this automatically
Normalisation and joining to retrieve data is commonplace
HBase doesn't support explicit joins
Implicit joins from column families if necessary
Not normalised - redundancy and duplication is common
SHOULD HBASE REPLACE MY RDBMS?
Not a drop-in replacement!
Requires different mindset
Significant re-architecting of application
Major differences
Data layout
Data access
RDBMS schema design:
Determine all types of data to be stored
Determine relationships between the data elements
Create tables, columns and foreign keys
"Relationship-centric" design
HBase design:
Determine types of data to be stored
Create data layouts and keys
"Data-centric" design
SO, SHOULD I USE IT?
Not a silver bullet
Will solve data access issues, but not issues further up the stack (i.e. application scaling or design limitations)
It's very good at certain tasks, but like any tool, difficult/impossible/inefficient to use for others
Approaching it with an RDBMS mindset
will lead to failure
.
HBASE IN DEPTH
Stores data in tables (it is a database)
Tables in HBase are
sorted, distributed maps
(think of associative arrays or hashes in your programming language)
Data is split into HDFS blocks and stored on multiple nodes in the cluster
Tables themselves are split into Regions
Region = section of the table
Similar to a shard or partition in a RDBMS
HBASE IN DEPTH
Served to clients by
RegionServer
daemons
A RegionServer runs on each slave node in the cluster
Typically serves multiple regions, from several tables
Unlikely that one RS will serve all regions for a particular table
HBase master daemon coordinates RegionServers
And which Regions are managed by each RegionServer
This changes as data gets added/deleted
Handles new table creation and other housekeeping ops
Can be multiple Masters for high availability, but only one runs the cluster at a time
HBASE IN DEPTH
Zookeeper
Handles coordination of master
ZK daemons run on master nodes
Masters compete to run the cluster - first to connect wins control
If controlling master fails, remaining masters compete again to run cluster
HBASE TABLES
Rows, columns and column families
Every row has a row key (analogous to primary key in RDBMS)
Stored sorted
by row key for fast lookup
Columns hold the data for the table
Columns can be
created on the fly
A column exists for a particular row
ONLY IF
it has data in that row
The table's cell are arbitrary arrays of bytes
HBASE TABLES
All columns belong to a column family
A column family is a collection of columns
A table has one or more column families, created as part of table definition
All column family members have the same prefix
e.g. for "contactinfo:fname", the ":" is a delimiter between the column family name ("contactinfo") and the column name ("fname")
Tuning and storage settings can be specified for each column family
E.g. number of versions of each cell to be stored
Columns within a family are sorted and stored together
SO, WHY COLUMN FAMILIES?
Separate column families are useful for:
Data that is not frequently accessed together
Data that uses different column family options, e.g. compression
Some attributes
Compression - GZ, LZO, Snappy
Versions to be kept (>=1)
TTL - automatically deleted after specified time (X seconds to FOREVER)
Min_Versions (if TTL set) (0+)
Blocksize (1b - 2GB)
In_Memory
BlockCache
BloomFilter - improves read performance
DATA IN HBASE
HBase can store anything that can be converted to an array of bytes
Strings, numbers, complex objects, images, etc.
Practical limits to size of values
Typically, cell size should not consistently be above 10MB
Physically stored on a per-column family basis
Empty cells are
not stored
HBase table = distributed sorted map
HBASE OPERATIONS
Get = retrieves a single row using the row key
Scan = retrieves all rows
Scan can be constrained to retrieve rows between a start row key and end row key
Put = add a new row, identified by row key
Delete = marks data as having been deleted
Removes the row identified by row key
Not removed from HDFS during the call, but is marked for deletion, which happens later
Increment = set or increment an atomic counter
Only for cells containing values of the long datatype
Atomicity allows for concurrent access without fear of corruption or overwriting with a previous number
Consistency synchronised by RegionServer NOT the client
HOW TO USE IT
HBase Shell
Interactive shell for sending commands to HBase and viewing responses
Uses JRuby
Wraps Java client calls in Ruby
Allows Ruby syntax for commands
Means parameter usage a little different than most shells
E.g. command parameters are single-quoted
HBase> command 'parameter1', 'parameter2'
Allows scripting written in Ruby/JRuby
Parameters can be passed in to expand functionality
Use for administration, running scripts and small data accesses and quick tests
HOW TO USE IT
Java API
Only first class citizen
Other languages supported via other ecosystem projects (Thrift and a REST interface)
Java API also augmented by other projects
Simply add the HBase JARs and instantiate objects just like any other
Note: HBase stored everything as byte arrays
Many API methods require byte array as value
Use utility class to do conversions
Apache Thrift
Originally developed at Facebook
Supports 14 languages including Java, C++, Python, PHP, Ruby, C#
MANAGING TABLES
Creation
Only tables and column families must be specified at creation
Every table must have at least one column family
Any optional settings can be changed later
All tables are in the same namespace
Different from RDBMS table creation
No columns necessary
No strict relationships
No constraints or foreign keys
No namespaces
TABLE CREATION
In the shell
In the Java API
ALTERING TABLES
Entire table can be modified
Column families added, changed or deleted
Table must be in maintenance mode for changes, then enabled when complete
HANDLING TABLES
List - lists all tables in the database
Describe - shows the structure of a table
Disable/enable - allows table to be maintained and blocks all client access
DELETING TABLES
'
drop
' command
Must be disabled first
removes the table from HBase, then deletes all related files in HDFS
'
truncate
' command
Removes all rows
Does not need to be disabled
ACCESSING DATA
Get
Used to retrieve a single row (or specific versions of columns)
Must know the rowkey
Try to constrain to the least amount of column families or descriptors as possible
Speeds up queries
Reduces transfer times
Java API allows for get batching (not a query - grouping a bunch of gets together into a single request)
ACCESSING DATA
Scan
Used when exact row key is not known, or a group of rows needs to be accessed
Can be bounded by a start and stop row key
Can be limited to certain column families or column descriptors
ACCESSING DATA
Put (Adding/Updating)
HBase doesn't distinguish between an insert and an update
If the rowkey doesn't exist, it gets added; if does exist, it's an update
ACCESSING DATA
Delete/DeleteAll
Marks row/column/column family/column versions for deletion - physically performed later
Can be batched as with Get
Deleteall removes entire row
ACCESSING DATA
Counters
Automatic-increment integer/long counters
Atomic Put and Delete
Checks and verifies previous data
ACCESSING DATA
Co-Processors
Allows developers to extend functionality. Two types:
Endpoints - analogous to stored procedures in RDBMSs, exposes a new method to HBase. Mostly algorithmic features, such as sums, counts, averages, group by or any other more complicated logic that you want performed in HBase and not the client
Observers - analogous to a trigger, hooks into HBase operations. Get, Put and Scan all ship with pre and post methods that observers can hook into to perform an operation before and after the call. E.g. more granular security checks (pre) or secondary indexes (post)
However - very easy to destabilise the entire cluster! Debugging support is also currently quite poor
ACCESSING DATA
Filters
Scans pass back all relevant rows - may have work to do on the client
Filters augment scans allowing logic to be run before the data are returned, and only send when passes the logic
i.e. reducing data sent back to client
Filters made up of
Which piece of information is to be processed (e.g. column family, column value, etc)
How to process the filter
The operand for the result (equal, not equal, greater than, etc)
Can be combined
Lots of built-in ones
Binary, null, regex, substring comparators
Can create your own (java class implementing the Filter interface - packaged as a jar)
ACCESSING DATA
Filters
SO WHAT?
WHY WOULD I USE ALL THAT INSTEAD OF MY TRADITIONAL FAMILIAR RDBMS?
WHAT ADVANTAGE(S) DOES HBASE GIVE ME?
THAT ALL SOUNDS WONDERFUL, BUT...
MOSTLY....
ITS ABILITY TO
SCALE
LET'S GO BACK TO THE ARCHITECTURE
SPECIFICALLY, LET'S LOOK AT REGIONS
A TABLE'S REGIONS
A Region is served to a client by a Region Server
Tables are broken into smaller pieces called Regions
A Region is defined by the start and end rows that it serves
Let's say our table looks like this...
So, how do we know where the data are?
CATALOG TABLES
HBase uses special tables called Catalog tables [-ROOT- and .META.] to keep track of the locations of Region Servers and what Regions these hold
Up to now, we've been talking about userspace tables - those that store our data
1. Ask Zookeeper where -ROOT- table is
2. -ROOT- table indicates where .META. table is
3. .META. lists all the Regions and where they're located
4. Query RegionServer where the region for the data is
Client caches information from Steps 1-3
THE ROLE OF HDFS
Remember that HDFS gives us:
High availability for NameNodes to avoid SPoF
Durability for data (stored 3 times across 3 nodes)
Regions can be written/read from anywhere across HDFS
Region Server can run anywhere on the cluster
Storage scalability
Add more nodes
Regions are stored as files in HDFS
AND COLUMN FAMILIES?
Column families are divided into Stores
Files in HFile format, in HDFS
STORE FILE REPLICATION
MEMSTORE
HOW DOES HBASE HANDLE & MANAGE DATA
WRITE PATH
HOW DOES HBASE HANDLE & MANAGE DATA
READ PATH
REGION SPLITS
Regions automatically split when they get too big (>10GB):
At compaction stage, original data rewritten into separate files and original files deleted
The .META. table then updated to reflect newly split region
Region management is a key part of an efficient scalable HBase implementation
Don't want a high number of regions for a small amount of data
But high region count (>3000) has an impact on performance
Low region count prevents parallel scalability
Preferred range is 20 to 200 regions per Region Server
REGION SPLITS
Load Balancer
Automatically moves regions around to balance cluster load
Compactions
Minor compactions merge small Store Files into a big Store file
Major compaction (usually once per day) merges all store files in a Region and performs other housekeeping (physical deletions, etc)
Provides eventual data locality
Pre-splitting during large data loads
E.g. ensuring 100GB of data into a new table splits into regions of ~10GB each
ROW-KEY DESIGN
RDBMS schema design favours normalisation - aim for no redundant data
Leads to lots of smaller tables (to favour joins)
HBase typically has only a few, large tables
Denormalised to avoid joins
Small number of column families
With hundreds (even thousands) of columns
Single row key (not many index columns as with RDBMSs)
More effort goes into row key planning for HBase than with RDBMSs
This is a big reason why retrofitting HBase into existing code is not trivial
WHICH MEANS...
Schema design for HBase is
application-centric
, not data/relationship-centric as with RDBMSs
We
must
look closely at how the applications will
access
the data
Understanding the
access pattern
is integral to success
how frequent?
what type of data?
what is the nature of the operations?
more reads than writes?
are writes additions or updates?
etc ...
ACCESS PATTERN: READ-HEAVY
Must
avoid joins
Aim to place all data in a single row (or set of contiguous rows)
Denormalise
(
yes, that's not a typo,
Denormalise
) one to many relationships by pre-materialising them
But remember your updates to denormalised data is now a larger task
Optimise
row key and column family layouts
Row keys need to
allow for quick reading
Column families need to optimised to
use the block cache efficiently
ACCESS PATTERN: WRITE-HEAVY
Optimise row key to spread the load
evenly across the RegionServers
(maximise the I/O throughput of the cluster)
Decide
when joins make sense
depending on how the application needs the data
Denorming is a
tradeoff
between speed and write performance
ACCESS PATTERN: HYBRID ACCESS PATTERN
An app may be both read and write heavy on a table in different areas of the application
Or more than one app uses the table
Must consider the primary use and the heavier load, but try optimise settings as best as possible to improve the rest
SOME EXAMPLE PATTERNS
Time Series and Sequential Data Types
E.g. timestamped log data
Use timestamp as the row key??
Geospatial Data Types
Social Data Types
A high-profile celebrity's twitter feeds in a single region
High read on that region as a result
Other regions without high-profile accounts have much fewer reads
SOME EXAMPLE PATTERNS - ROWKEY DESIGN
For above examples, composite rowkey might be a better approach
Instead of <timestamp>, use <timestamp><source><event>
Provides more granularity for scans
Rowkeys cannot be changed - row must be deleted and re-inserted with the new key
Rows are sorted by rowkey as they are put into the table (means the scans don't have to do this)
Ordered lexicographically, so numbers need to be padded
ROWKEY DESIGN - IMPACT ON PERFORMANCE
Querying on row keys has highest performance
Can skip store files
Querying on column values has the worst performance
Value-based filtering is a full table scan - each column's value has to be checked
Therefore,
commonly-queried data should be part of the row key
ROWKEY TYPES
Sequential/incremental
ID or time series
Best scan performance
Worst write performance
All writes are likely to hit one Region Server
Salted
Calculated hash in front of the real data to randomise the row key
Scans can ignore the salt, so scans still fast
Write performance improved as salt prefix distributes across RegionServers
ROWKEY TYPES
Promoted field key
Another field is used to prefix the ID or time series key
E.g. <sourceid><timestamp> instead of <timestamp><sourceid>
Still allows scanning by ignoring (or using) promoted field
Improves write performance as prefix can distribute writes across multiple RegionServers
Salted
E.g. using a one way hash such as MD5
Worst read performance (have to read all values)
Best write performance as random values distributed evenly across all RegionServers
ROWKEY TYPES - SUMMARY
MAINTENANCE
AND
ADMINISTRATION
HBASE
MONITORING - MASTER
HBase
Master
handles many critical functions for the cluster
Coordinates RegionServers and manages failovers
Handles changes to tables and column families (updating .META. Accordingly)
Manages region changes (region splits or assignments)
Multiple masters as same time, but only one active
Others compete if active master fails (updating -ROOT- accordingly)
Has a
LoadBalancer
process
Balances based on number of regions served by a RegionServer
Table-aware and will spread a table's regions across many RegionServers
CatalogJanitor
checks for unused regions to garbage collect
LogCleaner
deletes old WAL files
MONITORING - REGION SERVER
Handles data movement
Retrieves the data for Gets and Scans and returns to client
Stores all data from Puts
Record and marks deletions
Handles major and minor compactions
Handles region splitting
Maintains block cache
Manages Memstore and WAL
Receives new regions from Master (replaying its WAL if necessary)
MONITORING - ZOOKEEPER
Distributed coordination service making HBase highly available
Used to maintain state and configuration information and provide failure detection
SETUP & CONFIG - SIZING
Sizing the cluster is complex
Depends on access pattern
A read-heavy workload has different requirements to a write-heavy workload
Some rules of thumb exist (relating raw disk space to Java heap usage)
SETUP & CONFIG - SECURITY
Built-in security available to limit access and permissions
Requires Kerberos-enabled Hadoop underneath
User permissions granted globally or at a table, column family or column descriptor level
Read, write, execute, create or administer (i.e. manage tables)
Can decrease throughput by 5-10%
REPLICATION
Use built-in
inter-cluster replication
Done on a per column-family basis
Done by WAL-shipping => asynchronous copy and eventually consistent (so any write that bypasses the WAL will not be auto-replicated
Zookeeper maintains state about the replication
BUT:
Time differences
between cluster
Bandwidth
(remember, we could be serving lots of data requests while replication ongoing)
Column family setting changes not replicated
No 'canary' service or verification process
Have applications write to all clusters
Maybe you don't want the data structured the same way in the replica
E.g. Primary service database with a reporting database
Manually copy tables
Several methods...
BACKUP & RESTORE
CopyTable
- copies a table within a cluster or to another
Point in time backup using
mapreduce export
(only data, no metadata)
Restore using mapreduce import
ImportTSV
- delimited data files
LoadIncrementalHFiles
- Hfile import for efficient bulk loading
Full backup -
copies HBase directory
in HDFS (cluster must be offline)
Snapshot
- metadata-only backup (references to the files, not specific row/column data)
Can then be exported to another cluster (bypasses regionServer, using HDFS directly)
Several methods...
HBASE ECOSYSTEM
A bunch of projects and tools exist to extend/enhance native HBase functionality
OpenTSB - a time-series database
Kiji
Hive - allows SQL and database knowledge to be re-used, instead of programming
OpenTSB
Good for managing metrics (real-time stats, SLA times and data points (StumbleUpon, Box, Tumblr)
KIJI
Centralises data types and deserialises them
AND LASTLY... CERTIFICATION
Cloudera offer HBase certification - the only certification for HBase
"
Cloudera Certified Specialist in Apache HBase (CCSHB)
"Full transcript
by
TweetBrian Hickey
on 10 October 2014Transcript of Big Data with Hadoop
BIG DATA
A little look at
WITH HADOOP AND ITS ECOSYSTEM
Velocity
- generating data faster than ever
Variety
- producing a wide variety (and not always clean)
Volume
- shares, Facebook likes, photos, tweets, google indexing, credit card transactions, etc.
Goals of Hadoop
- Designed to be scalable and economical
Adding load results in graceful decline of performance of individual jobs, not failure of the system
Increasing resource supports proportional increase in load capacity
- Distributed and fault-tolerant from day one
Developers don't need to worry about network programming, temporal
- No special hardware required
- "Shared Nothing" architecture
- Originated in Google
3 Key Parts
Hadoop Distributed File System (HDFS)
- Storage
MapReduce
=> Data Processing
Infrastructure
to make them work (file system administration utilities, job scheduling, monitoring, etc.)
Key Terms
Cluster
- a collection of servers running Hadoop software. More nodes, increased scale. Scale out incrementally (and also elastically, e.g. AWS). Cluster can hold several thousand nodes (e.g. Yahoo in 2010 had a 4000 node cluster using 14PB).
Node
- an individual server within a cluster. A node stores and processes data
But...
More nodes => greater chance that one will fail
Redundancy
built in to system and it handles failure automatically
If a node fails, its workload will be assumed by another still-functioning node
If the failed node recovers, it can rejoin the system without a full restart of the entire system
Files loaded in HDFS are
replicated
across nodes in cluster as they enter the system
If a node fails, its data is re-replicated using one of the other copies
Data processing jobs are broken into small tasks
Each takes a small amount of data
Tasks run in parallel across nodes
If a node fails, its tasks are rescheduled elsewhere
All automatically
Redundancy configurable - default 4 copies
HDFS
Inexpensive (standard industry commodity hardware - cost per GB continues to decrease) and reliable (redundancy) storage for massive amounts of data
Optimised for sequential access to small number of large files (>100MB - >1GB)
Highly available, with no single point of failure
Use similar to UNIX (e.g. /data/lpt/returns.txt)
Similar file ownership/permissions
But: cannot modify files once written
Master/Slave architecture
Master(s) = NameNodes - manage namespace and metadata, and monitor slaves
Slaves = read and write actual data
MapReduce
Programming model/pattern, not a language (typically Java, but can be others)
Simple, efficient, flexible and scalable
Spreads load efficiently across many nodes
Two functions - map and reduce (run in that order)
Map used to filter, transform or parse the data
Reduce (optional) used to summarise data from the map function (e.g. aggregation)
Each is simple, but powerful when combined
Master/slave architecture (YARN cluster management framework)
Master is "Resource Manager" - allocates cluster resources for a job, starts application master
Slaves: one Application Master - divides jobs into tasks and assigns to Node Managers, which start tasks, do actual work, report status back to application master
Ecosystem
Apache Pig
: high-level data processing (abstraction to low-level MapReduce code). Especially good at joining and transforming data. Written in Pig Latin language
Apache Hive
: Another MapReduce abstraction, uses SQL-like language called HiveQL
Both Pig and Hive run on a local client machine, convert into MapReduce jobs and submit the job to the cluster
Cloudera Impala
: Massively parallel SQL engine on Hadoop; Can query data stored in HDFS or HBase; typically 10 times faster than Hive or MapReduce; Uses high-level query language
Apache HBase
: "the Hadoop database" - up to PB in a table; 1000s of columns in table; scale provides high write throughput (hundreds of thousands of inserts per second); NoSQL, uses API, not a querying language;
Apache Sqoop
: For exchanging data between relational DB and Hadoop (and vice versa)
Apache Flume
: imports data into HDFS as it's being generated by the source (program output, UNIX syslog, log files, IoT, sensors, etc
Apache Oozie
: for managing process workflows - coordinates execution and control of individual jobs; includes MR jobs, Pig/Hive scripts, java or shell programs, HDFS commands, remote commands, email messages;
APACHE PIG
Platform for data analysis and processing on Hadoop (alternative to writing MapReduce code directly)
Originally a Yahoo project
Main components:
Data flow language - Pig Latin
Interactive shell for executing Pig Latin statements - Grunt
Pig interpreter and execution engine
KEY FEATURES
HDFS manipulation
UNIX shell commands
Relational operations
Positional references for fields
Common mathematical functions
Support for custom functions and data formats
Complex data structures
USE CASES
For
data analysis
Finding records in a massive data set - e.g. extracting valuable info out of web server log files
Querying multiple data sets
Calculating values from input data
Data sampling
For
data processing
Reorganising data
Joining data from multiple sources
ETL
USING PIG
Interactively, using Grunt shell
Useful for ad-hoc inspection of data
Useful for interacting with HDFS and UNIX
A Pig script (".pig")
Useful for automation and batch execution
Run it directly from UNIX shell
Use the "Local mode" for development/testing before deploying a job to production
PIG LATIN
A data flow language - a flow of data is expressed as a sequence of statements
PIG LATIN
Made up of:
Keywords
-- LOAD AS, STORE INTO, DUMP, FILTER BY, DESCRIBE, FOREACH, GENERATE, DISTINCT, ORDER BY, LIMIT, GROUP BY, GROUP ALL, FLATTEN, SPLIT, etc.
The interpreter knows how to convert these into MapReduce
Identifiers
- used like variables in a traditional program
Comments
Built-in field-level functions
- UPPER, TRIM, RANDOM, ROUND, SUBSTRING, SIZE, TOKENISE , etc.
Built-in aggregation functions
- SUM, AVG, MIN, MAX, COUNT, COUNT_STAR, DIFF, ISEMPTY, CROSS, UNION, COGROUP, JOIN (LEFT|RIGHT|FULL) OUTER, etc.
Operators
-- arithmetic, comparison, is (not) null, boolean-- similar to SQL
Data types
-- int, long, float, double, boolean, datetime, chararray, bytearrays, maps
Loaders
-- e.g. PigStorage, Hcatalog
Fields
,
Tuples
(collection of Fields),
Bags
(collection of Tuples),
Relations
(named Bags)
EXTENDING PIG
Supports
parameters
, passed in at runtime (either from command line or from a file)
Supports
macros
(similar(!!!) to a function in a programming language)
Can import it into whichever scripts need it
Supports
user-defined functions
from Java, Python, JavaScript, Ruby, Groovy
For Java, package into JAR file, REGISTER it in a Pig script (optionally provide an alias for it), and invoke it using fully-qualified classname
Community-contributed UDFs available from the
Piggy Bank
and
DataFu
Piggy Bank ships with Pig
Examples from Piggy Bank include ISOToUnix, UnixToISO, LENGTH, DiffDate
Examples from DataFu include Quantile, Median, Sessionize, HaversineDistInMiles
ADVANCED USAGE
Data sampling -
SAMPLE
and
ILLUSTRATE
(more intelligent)
Dry run
command options - used to see script after parameters and macros are processed
EXPLAIN plan
APACHE HIVE
High-level abstraction on top of map-reduce
Uses a SQL-like language called HiveQL (or HQL)
Generates MR jobs that run on Hadoop cluster
Originally developed at Facebook, now an open-source Apache project
WHY USE IT?
More productive than writing MR directly
Re-use existing SQL knowledge / no software development experience required
Interoperable (extensible via Java and external scripts)
HOW DOES IT WORK?
Queries operate on tables just like an RDBMS
A table is simply a HDFS directory containing one or more files
Table belongs to a 'database' (i.e. schema in Oracle land)
Views supported
Many formats for data storage/retrieval
Structure and location of files specified when you create the table
=> Metadata, stored in Hive's metastore [09-11]
Maps raw data in HDFS to named columns of specific types
BUT... IT'S NOT A DATABASE!
No transactions
No modifications of existing records
USE CASES
Log file analysis
Sentiment analysis
HOW TO USE IT
Hive Shell
interactive command line tool,
enter statements at the "hive>" prompt
or call scripts to execute them
Also execute system commands
HOW TO USE IT
HUE
Web-based UI called Beeswax
Features include creating/browsing tables, running queries, saving queries for later execution
HOW TO USE IT
HiveServer2
Centralised Hive server
JDBC/ODBC connection
Kerberos authentication
SYNTAX
DATA TYPES
BUILT-IN FUNCTIONS
DATA MANAGEMENT
'Schema on read' - data not validated on insert - files simply moved into place
Loading data into tables very fast
Errors in file format discovered when queries performed
Can load from Hadoop (command line), Hive commands, sqoop (from an existing RDBMS)
Altering
Usual DROP database/table supported
RENAME columns and tables
MODIFY col type
REORDER and REPLACE columns
Can store query results in another table (INSERT OVERWRITE and INSERT INTO )
And create a table based on a select (CTAS)
Can write output to filesystem (HDFS and local client)
TABLES
ADVANCED TEXT PROCESSING
MORE TEXT PROCESSING
SERDE
SENTIMENT ANALYSIS
OPTIMISATION
When it comes to Query speed...
Fastest involve only metadata
DESCRIBE table
Next fastest simply read from HDFS
SELECT * FROM table
Then, a query that requires a map-only job
SELECT * FROM table WHERE col = value
Then, a query that requires both map and reduce phases
SELECT COUNT(x) FROM table WHERE col = value
Slowest require multiple MapReduce jobs
SELECT x, COUNT(y) AS alias FROM table GROUP BY x ORDER BY y LIMIT 10
OPTIMISATION
Other tools:
Execution plan (EXPLAIN)
Sorting (ORDER BY vs SORT BY)
Parallel execution
Local execution
Partitioning
Bucketing
Indexing (much more limited than a RDBMS and comes at cost of increased disk and CPU usage)
EXTENDING HIVE
SerDe's for reading and writing records - controls row format of table
Field delimiters, Regex, Columnar (e.g. for RCFile), HBase
TRANSFORM USING
Transform and manipulate data through external scripts/programs
Like STREAM we saw earlier with Pig
User-defined functions
Written in Java (only, for the moment)
Import and register a JAR
Plenty of UDFs available on the web
Parameterised queries
CLOUDERA IMPALA
High performance SQL engine designed for vast amounts of data
Massively parallel-processing (MPP)
Query latency measured in milliseconds
Runs on Hadoop clusters, can query HDFS or HBase tables
100% open source
Supports a subset of SQL-92, with a few extensions (almost identical to HiveQL)
Similar interfaces - command line, Hue web app, ODBC/JDBC
Uses same metastore as Hive (so tables created there are visible in Impala, and vice versa)
WHY USE IT?
More productive than writing MapReduce
No software development required, just SQL
Leverage existing SQL knowledge
Highly optimised for querying
Supports User-Defined Functions in Java or C++
and can re-use Hive's
Almost always at least 5x faster than Pig or Hive (often >20x)
SURELY NOT! HOW?
Hive and Pig answer queries by running MapReduce jobs
MR is a general purpose computation framework, not an optimised interactive query execution tool
It therefore introduces overhead and latency (even a trivial query takes 10 seconds or more)
Impala does not use MapReduce
Uses a custom execution engine allowing queries to complete in a fraction of a second
Exploits its metastore to go direct to HDFS for the data it needs
Hive and Pig suited to long-running batch processes such as data transformation tasks
Impala best for interactive/ad-hoc queries
USE CASES
Business Intelligence
Ad-hoc queries and reports
COMPARED TO AN RDBMS
BUT IT'S CURRENTLY MISSING SOME STUFF...
Complex data types such as arrays, maps and structs
BINARY data types
External transformations
Custom SerDe's
Indexing
Bucketing and table sampling
ORDER BY requires a LIMIT
Fault tolerance - if a node fails, the query fails
TO SUMMARISE
MapReduce = low-level processing and analysis
Pig = procedural data flow language executed using MR
Hive = SQL-based queries executed using MR
Impala = high performance SQL-based queries using custom execution engine
COMPARISON
COMPARED AGAINST RDBMS
WOULD YOU USE INSTEAD OF AN RDBMS?
Not if RDBMS is used for its intended purpose:
Small amounts of data (relatively)
Immediate results
In-place modification (update/delete) of data
Pig, Hive and Impala optimised for
large amounts of read-only data
Extensive scalability at low cost
Pig and Hive better for batch processing
Impala and RDBMSs better for interactive use
A NoSQL database that runs of top of HDFS
What is NoSQL? Defined by an absence of a property? Then NoRel would be more appropriate!
No, read it as Not Only SQL.
Departs from relational model altogether
Most NoSQL implementations do not attempt to provide ACID properties, in order to improve performance and scale
Different types
Column, Document, Key-Value, Graph
HBase is a column-oriented data store
High speed, highly scalable, low complexity, moderate flexibility
Makes it highly available and fault tolerant
Open source Apache project
Based originally on Google's BigTable (what Google uses to search the internet)
WHAT YOU USE IT FOR
Massive amount of data - 100s of GB to petabytes (e.g. storing the entire internet)
High read and write throughput - 1000s/second per node (e.g. Facebook messages - 75billion operations per day, 1.5 million operations per second)
Scalable in-memory caching
Key lookups (no penalty for sparse columns)
Random write, random read, or both (but not neither)
1000s of operations per second on very large datasets
Access patterns are well understood and simple
Missing or different columns
WHEN NOT TO USE IT
Only appends done to dataset, and whole thing is read when processing
Ad-hoc analysis (i.e. ill-defined access patterns)
Small enough data that it all fits on one large node
Need transactions
WHO USES IT?
eBay
StumbleUpon
TrendMicro
etc...
COMPARISON WITH RDBMS
It is a database, but not a traditional RDBMS
Nicely laid-out tables and columns
Powerful query language
Transaction
To scale RDBMS typically need to partition or shard
HBase does this automatically
Normalisation and joining to retrieve data is commonplace
HBase doesn't support explicit joins
Implicit joins from column families if necessary
Not normalised - redundancy and duplication is common
SHOULD HBASE REPLACE MY RDBMS?
Not a drop-in replacement!
Requires different mindset
Significant re-architecting of application
Major differences
Data layout
Data access
RDBMS schema design:
Determine all types of data to be stored
Determine relationships between the data elements
Create tables, columns and foreign keys
"Relationship-centric" design
HBase design:
Determine types of data to be stored
Create data layouts and keys
"Data-centric" design
SO, SHOULD I USE IT?
Not a silver bullet
Will solve data access issues, but not issues further up the stack (i.e. application scaling or design limitations)
It's very good at certain tasks, but like any tool, difficult/impossible/inefficient to use for others
Approaching it with an RDBMS mindset
will lead to failure
.
HBASE IN DEPTH
Stores data in tables (it is a database)
Tables in HBase are
sorted, distributed maps
(think of associative arrays or hashes in your programming language)
Data is split into HDFS blocks and stored on multiple nodes in the cluster
Tables themselves are split into Regions
Region = section of the table
Similar to a shard or partition in a RDBMS
HBASE IN DEPTH
Served to clients by
RegionServer
daemons
A RegionServer runs on each slave node in the cluster
Typically serves multiple regions, from several tables
Unlikely that one RS will serve all regions for a particular table
HBase master daemon coordinates RegionServers
And which Regions are managed by each RegionServer
This changes as data gets added/deleted
Handles new table creation and other housekeeping ops
Can be multiple Masters for high availability, but only one runs the cluster at a time
HBASE IN DEPTH
Zookeeper
Handles coordination of master
ZK daemons run on master nodes
Masters compete to run the cluster - first to connect wins control
If controlling master fails, remaining masters compete again to run cluster
HBASE TABLES
Rows, columns and column families
Every row has a row key (analogous to primary key in RDBMS)
Stored sorted
by row key for fast lookup
Columns hold the data for the table
Columns can be
created on the fly
A column exists for a particular row
ONLY IF
it has data in that row
The table's cell are arbitrary arrays of bytes
HBASE TABLES
All columns belong to a column family
A column family is a collection of columns
A table has one or more column families, created as part of table definition
All column family members have the same prefix
e.g. for "contactinfo:fname", the ":" is a delimiter between the column family name ("contactinfo") and the column name ("fname")
Tuning and storage settings can be specified for each column family
E.g. number of versions of each cell to be stored
Columns within a family are sorted and stored together
SO, WHY COLUMN FAMILIES?
Separate column families are useful for:
Data that is not frequently accessed together
Data that uses different column family options, e.g. compression
Some attributes
Compression - GZ, LZO, Snappy
Versions to be kept (>=1)
TTL - automatically deleted after specified time (X seconds to FOREVER)
Min_Versions (if TTL set) (0+)
Blocksize (1b - 2GB)
In_Memory
BlockCache
BloomFilter - improves read performance
DATA IN HBASE
HBase can store anything that can be converted to an array of bytes
Strings, numbers, complex objects, images, etc.
Practical limits to size of values
Typically, cell size should not consistently be above 10MB
Physically stored on a per-column family basis
Empty cells are
not stored
HBase table = distributed sorted map
HBASE OPERATIONS
Get = retrieves a single row using the row key
Scan = retrieves all rows
Scan can be constrained to retrieve rows between a start row key and end row key
Put = add a new row, identified by row key
Delete = marks data as having been deleted
Removes the row identified by row key
Not removed from HDFS during the call, but is marked for deletion, which happens later
Increment = set or increment an atomic counter
Only for cells containing values of the long datatype
Atomicity allows for concurrent access without fear of corruption or overwriting with a previous number
Consistency synchronised by RegionServer NOT the client
HOW TO USE IT
HBase Shell
Interactive shell for sending commands to HBase and viewing responses
Uses JRuby
Wraps Java client calls in Ruby
Allows Ruby syntax for commands
Means parameter usage a little different than most shells
E.g. command parameters are single-quoted
HBase> command 'parameter1', 'parameter2'
Allows scripting written in Ruby/JRuby
Parameters can be passed in to expand functionality
Use for administration, running scripts and small data accesses and quick tests
HOW TO USE IT
Java API
Only first class citizen
Other languages supported via other ecosystem projects (Thrift and a REST interface)
Java API also augmented by other projects
Simply add the HBase JARs and instantiate objects just like any other
Note: HBase stored everything as byte arrays
Many API methods require byte array as value
Use utility class to do conversions
Apache Thrift
Originally developed at Facebook
Supports 14 languages including Java, C++, Python, PHP, Ruby, C#
MANAGING TABLES
Creation
Only tables and column families must be specified at creation
Every table must have at least one column family
Any optional settings can be changed later
All tables are in the same namespace
Different from RDBMS table creation
No columns necessary
No strict relationships
No constraints or foreign keys
No namespaces
TABLE CREATION
In the shell
In the Java API
ALTERING TABLES
Entire table can be modified
Column families added, changed or deleted
Table must be in maintenance mode for changes, then enabled when complete
HANDLING TABLES
List - lists all tables in the database
Describe - shows the structure of a table
Disable/enable - allows table to be maintained and blocks all client access
DELETING TABLES
'
drop
' command
Must be disabled first
removes the table from HBase, then deletes all related files in HDFS
'
truncate
' command
Removes all rows
Does not need to be disabled
ACCESSING DATA
Get
Used to retrieve a single row (or specific versions of columns)
Must know the rowkey
Try to constrain to the least amount of column families or descriptors as possible
Speeds up queries
Reduces transfer times
Java API allows for get batching (not a query - grouping a bunch of gets together into a single request)
ACCESSING DATA
Scan
Used when exact row key is not known, or a group of rows needs to be accessed
Can be bounded by a start and stop row key
Can be limited to certain column families or column descriptors
ACCESSING DATA
Put (Adding/Updating)
HBase doesn't distinguish between an insert and an update
If the rowkey doesn't exist, it gets added; if does exist, it's an update
ACCESSING DATA
Delete/DeleteAll
Marks row/column/column family/column versions for deletion - physically performed later
Can be batched as with Get
Deleteall removes entire row
ACCESSING DATA
Counters
Automatic-increment integer/long counters
Atomic Put and Delete
Checks and verifies previous data
ACCESSING DATA
Co-Processors
Allows developers to extend functionality. Two types:
Endpoints - analogous to stored procedures in RDBMSs, exposes a new method to HBase. Mostly algorithmic features, such as sums, counts, averages, group by or any other more complicated logic that you want performed in HBase and not the client
Observers - analogous to a trigger, hooks into HBase operations. Get, Put and Scan all ship with pre and post methods that observers can hook into to perform an operation before and after the call. E.g. more granular security checks (pre) or secondary indexes (post)
However - very easy to destabilise the entire cluster! Debugging support is also currently quite poor
ACCESSING DATA
Filters
Scans pass back all relevant rows - may have work to do on the client
Filters augment scans allowing logic to be run before the data are returned, and only send when passes the logic
i.e. reducing data sent back to client
Filters made up of
Which piece of information is to be processed (e.g. column family, column value, etc)
How to process the filter
The operand for the result (equal, not equal, greater than, etc)
Can be combined
Lots of built-in ones
Binary, null, regex, substring comparators
Can create your own (java class implementing the Filter interface - packaged as a jar)
ACCESSING DATA
Filters
SO WHAT?
WHY WOULD I USE ALL THAT INSTEAD OF MY TRADITIONAL FAMILIAR RDBMS?
WHAT ADVANTAGE(S) DOES HBASE GIVE ME?
THAT ALL SOUNDS WONDERFUL, BUT...
MOSTLY....
ITS ABILITY TO
SCALE
LET'S GO BACK TO THE ARCHITECTURE
SPECIFICALLY, LET'S LOOK AT REGIONS
A TABLE'S REGIONS
A Region is served to a client by a Region Server
Tables are broken into smaller pieces called Regions
A Region is defined by the start and end rows that it serves
Let's say our table looks like this...
So, how do we know where the data are?
CATALOG TABLES
HBase uses special tables called Catalog tables [-ROOT- and .META.] to keep track of the locations of Region Servers and what Regions these hold
Up to now, we've been talking about userspace tables - those that store our data
1. Ask Zookeeper where -ROOT- table is
2. -ROOT- table indicates where .META. table is
3. .META. lists all the Regions and where they're located
4. Query RegionServer where the region for the data is
Client caches information from Steps 1-3
THE ROLE OF HDFS
Remember that HDFS gives us:
High availability for NameNodes to avoid SPoF
Durability for data (stored 3 times across 3 nodes)
Regions can be written/read from anywhere across HDFS
Region Server can run anywhere on the cluster
Storage scalability
Add more nodes
Regions are stored as files in HDFS
AND COLUMN FAMILIES?
Column families are divided into Stores
Files in HFile format, in HDFS
STORE FILE REPLICATION
MEMSTORE
HOW DOES HBASE HANDLE & MANAGE DATA
WRITE PATH
HOW DOES HBASE HANDLE & MANAGE DATA
READ PATH
REGION SPLITS
Regions automatically split when they get too big (>10GB):
At compaction stage, original data rewritten into separate files and original files deleted
The .META. table then updated to reflect newly split region
Region management is a key part of an efficient scalable HBase implementation
Don't want a high number of regions for a small amount of data
But high region count (>3000) has an impact on performance
Low region count prevents parallel scalability
Preferred range is 20 to 200 regions per Region Server
REGION SPLITS
Load Balancer
Automatically moves regions around to balance cluster load
Compactions
Minor compactions merge small Store Files into a big Store file
Major compaction (usually once per day) merges all store files in a Region and performs other housekeeping (physical deletions, etc)
Provides eventual data locality
Pre-splitting during large data loads
E.g. ensuring 100GB of data into a new table splits into regions of ~10GB each
ROW-KEY DESIGN
RDBMS schema design favours normalisation - aim for no redundant data
Leads to lots of smaller tables (to favour joins)
HBase typically has only a few, large tables
Denormalised to avoid joins
Small number of column families
With hundreds (even thousands) of columns
Single row key (not many index columns as with RDBMSs)
More effort goes into row key planning for HBase than with RDBMSs
This is a big reason why retrofitting HBase into existing code is not trivial
WHICH MEANS...
Schema design for HBase is
application-centric
, not data/relationship-centric as with RDBMSs
We
must
look closely at how the applications will
access
the data
Understanding the
access pattern
is integral to success
how frequent?
what type of data?
what is the nature of the operations?
more reads than writes?
are writes additions or updates?
etc ...
ACCESS PATTERN: READ-HEAVY
Must
avoid joins
Aim to place all data in a single row (or set of contiguous rows)
Denormalise
(
yes, that's not a typo,
Denormalise
) one to many relationships by pre-materialising them
But remember your updates to denormalised data is now a larger task
Optimise
row key and column family layouts
Row keys need to
allow for quick reading
Column families need to optimised to
use the block cache efficiently
ACCESS PATTERN: WRITE-HEAVY
Optimise row key to spread the load
evenly across the RegionServers
(maximise the I/O throughput of the cluster)
Decide
when joins make sense
depending on how the application needs the data
Denorming is a
tradeoff
between speed and write performance
ACCESS PATTERN: HYBRID ACCESS PATTERN
An app may be both read and write heavy on a table in different areas of the application
Or more than one app uses the table
Must consider the primary use and the heavier load, but try optimise settings as best as possible to improve the rest
SOME EXAMPLE PATTERNS
Time Series and Sequential Data Types
E.g. timestamped log data
Use timestamp as the row key??
Geospatial Data Types
Social Data Types
A high-profile celebrity's twitter feeds in a single region
High read on that region as a result
Other regions without high-profile accounts have much fewer reads
SOME EXAMPLE PATTERNS - ROWKEY DESIGN
For above examples, composite rowkey might be a better approach
Instead of <timestamp>, use <timestamp><source><event>
Provides more granularity for scans
Rowkeys cannot be changed - row must be deleted and re-inserted with the new key
Rows are sorted by rowkey as they are put into the table (means the scans don't have to do this)
Ordered lexicographically, so numbers need to be padded
ROWKEY DESIGN - IMPACT ON PERFORMANCE
Querying on row keys has highest performance
Can skip store files
Querying on column values has the worst performance
Value-based filtering is a full table scan - each column's value has to be checked
Therefore,
commonly-queried data should be part of the row key
ROWKEY TYPES
Sequential/incremental
ID or time series
Best scan performance
Worst write performance
All writes are likely to hit one Region Server
Salted
Calculated hash in front of the real data to randomise the row key
Scans can ignore the salt, so scans still fast
Write performance improved as salt prefix distributes across RegionServers
ROWKEY TYPES
Promoted field key
Another field is used to prefix the ID or time series key
E.g. <sourceid><timestamp> instead of <timestamp><sourceid>
Still allows scanning by ignoring (or using) promoted field
Improves write performance as prefix can distribute writes across multiple RegionServers
Salted
E.g. using a one way hash such as MD5
Worst read performance (have to read all values)
Best write performance as random values distributed evenly across all RegionServers
ROWKEY TYPES - SUMMARY
MAINTENANCE
AND
ADMINISTRATION
HBASE
MONITORING - MASTER
HBase
Master
handles many critical functions for the cluster
Coordinates RegionServers and manages failovers
Handles changes to tables and column families (updating .META. Accordingly)
Manages region changes (region splits or assignments)
Multiple masters as same time, but only one active
Others compete if active master fails (updating -ROOT- accordingly)
Has a
LoadBalancer
process
Balances based on number of regions served by a RegionServer
Table-aware and will spread a table's regions across many RegionServers
CatalogJanitor
checks for unused regions to garbage collect
LogCleaner
deletes old WAL files
MONITORING - REGION SERVER
Handles data movement
Retrieves the data for Gets and Scans and returns to client
Stores all data from Puts
Record and marks deletions
Handles major and minor compactions
Handles region splitting
Maintains block cache
Manages Memstore and WAL
Receives new regions from Master (replaying its WAL if necessary)
MONITORING - ZOOKEEPER
Distributed coordination service making HBase highly available
Used to maintain state and configuration information and provide failure detection
SETUP & CONFIG - SIZING
Sizing the cluster is complex
Depends on access pattern
A read-heavy workload has different requirements to a write-heavy workload
Some rules of thumb exist (relating raw disk space to Java heap usage)
SETUP & CONFIG - SECURITY
Built-in security available to limit access and permissions
Requires Kerberos-enabled Hadoop underneath
User permissions granted globally or at a table, column family or column descriptor level
Read, write, execute, create or administer (i.e. manage tables)
Can decrease throughput by 5-10%
REPLICATION
Use built-in
inter-cluster replication
Done on a per column-family basis
Done by WAL-shipping => asynchronous copy and eventually consistent (so any write that bypasses the WAL will not be auto-replicated
Zookeeper maintains state about the replication
BUT:
Time differences
between cluster
Bandwidth
(remember, we could be serving lots of data requests while replication ongoing)
Column family setting changes not replicated
No 'canary' service or verification process
Have applications write to all clusters
Maybe you don't want the data structured the same way in the replica
E.g. Primary service database with a reporting database
Manually copy tables
Several methods...
BACKUP & RESTORE
CopyTable
- copies a table within a cluster or to another
Point in time backup using
mapreduce export
(only data, no metadata)
Restore using mapreduce import
ImportTSV
- delimited data files
LoadIncrementalHFiles
- Hfile import for efficient bulk loading
Full backup -
copies HBase directory
in HDFS (cluster must be offline)
Snapshot
- metadata-only backup (references to the files, not specific row/column data)
Can then be exported to another cluster (bypasses regionServer, using HDFS directly)
Several methods...
HBASE ECOSYSTEM
A bunch of projects and tools exist to extend/enhance native HBase functionality
OpenTSB - a time-series database
Kiji
Hive - allows SQL and database knowledge to be re-used, instead of programming
OpenTSB
Good for managing metrics (real-time stats, SLA times and data points (StumbleUpon, Box, Tumblr)
KIJI
Centralises data types and deserialises them
AND LASTLY... CERTIFICATION
Cloudera offer HBase certification - the only certification for HBase
"
Cloudera Certified Specialist in Apache HBase (CCSHB)
"