Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Big Data and NoSQL: Overview

No description
by

Valentin Kropov

on 3 September 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Big Data and NoSQL: Overview

Big Data
Gartner's definition of BigData
"Big data are high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization."
Volume
Velocity:
input
2012 year: 200 GB/day
2016 year: 28 TB/day
Sloan Digital Sky Survey
NASA Center for Climate Simulation
Stores 32 petabytes of climate observations and simulations.
Big Data: problems
Data
storing
(cheap and fast storage)
Variety
Velocity:
output
150 million
sensors delivering data 40 million times per second.
Large Hadron Collider
when you can,
keep everything

Data sources
Does it worth it?
$100 billion industry
in 2010, 10% growth per year
Groups of Big Data Technologies
Massively Parallel Processing (
MPP
)

Distributed Data Storage and Processing Frameworks (
Hadoop
)

NoSQL technologies (Not-Only SQL)
Massively Parallel Processing (MPP) Data Store / Database
Data partitioned
across multiple servers or nodes
EMC
Greenplum
Distributed Data Storage and Processing Frameworks
A
software framework
for distributed processing of large data sets
Apache Hadoop: main modules
Hadoop Common:
core libraries

HDFS:
distributed file system

Hadoop YARN:
job scheduler and resource manager

Hadoop MapReduce:
parallel processing of large data sets
Hadoop: MapReduce
Hadoop: who uses it?
http://wiki.apache.org/hadoop/PoweredBy
Use Hadoop when:
You have a perfect Java-programming skills
Hadoop: how it works
NoSQL ("Not Only SQL")
Use MPP when:
Performance
is very important (get answers quick)
A NoSQL database provides a mechanism for storage and retrieval of data that use
looser consistency
models than traditional relational databases in order to achieve
horizontal scaling and higher availability
.

Useful when working with a
huge quantity of data
(especially big data) when the data's nature
does not require a relational model
(e.g. Twitter posts, server farm logs, etc) and when
performance is most important
.
Key-value
stores
Document-oriented
databases
Graph databases
Big table
structures
Caching data stores
NoSQL
Document store: overview
The central concept of a document store is the notion of a "document". While each document-oriented database implementation differs on the details of this definition, in general, they all assume that documents encapsulate and encode data (or information) in some standard formats or encodings. Encodings in use include XML, YAML, and JSON as well as binary forms like BSON, PDF and Microsoft Office documents (MS Word, Excel, and so on).
{
FirstName:"Bob",
Address:"5 Oak St.",
Hobby:"sailing"
}
Document store: examples
{
FirstName:"Jonathan",
Address:"15 Wanamassa Point Road",
Children:[
{Name:"Michael",Age:10},
{Name:"Jennifer", Age:8},
{Name:"Samantha", Age:5},
{Name:"Elena", Age:2}
]
}
Document store: main players
NoSQL by popularity
http://db-engines.com/en/ranking
Key-value stores: overview
Key value stores allow the application developer to store schema-less data. This data is usually consisting of a string which represents the key and the actual data which is considered to be the value in the "key - value" relationship
Key-value stores: sub-types (page 1)
Key-value stores: sub-types (page 2)
Object database:
db4o
Eloquera
GemStone/S
InterSystems Caché
JADE
NeoDatis ODB
ObjectDB
Objectivity/DB
ObjectStore
OpenLink Virtuoso
Versant Object Database
Wakanda
ZODB
Eventually‐consistent
key‐value store:
Apache Cassandra
Dynamo
Hibari
OpenLink Virtuoso
Project Voldemort
Riak
Hierarchical key–value store:
GT.M
InterSystems Caché
Hosted services:
Freebase
OpenLink Virtuoso
Datastore on Google Appengine
Amazon DynamoDB
Cloudant Data Layer (CouchDB)
Key–value cache in RAM:
memcached
OpenLink Virtuoso
Oracle Coherence
Redis
Hazelcast
Tuple space
Velocity
IBM WebSphere eXtreme Scale
JBoss Infinispan
Key–value stores on solid state or
rotating disk:
Aerospike
BigTable
CDB
Couchbase Server
Keyspace
LevelDB
MemcacheDB (using Berkeley DB)
MongoDB
OpenLink Virtuoso
Tarantool
Tokyo Cabinet
Tuple space
Oracle NoSQL Database
Ordered key–value stores:
Berkeley DB
FoundationDB
IBM Informix C-ISAM
InfinityDB
MemcacheDB
NDBM
Multivalue databases:
Northgate Information Solutions Reality
Extensible Storage Engine
(ESE/NT)
jBASE
OpenQM
Revelation Software's OpenInsight
Rocket U2
D3 Pick database
InterSystems Caché
InfinityDB
RDF database:
Meronymy SPARQL Database Server
Tuple store:
Apache River
OpenLink Virtuoso
Tarantool
Tabular:
Apache Accumulo
BigTable
Apache Hbase
Hypertable
Mnesia
OpenLink Virtuoso
Key
User Data
App Config Data
Garbage
User1321_description
User1344_description
Graph databases: overview
This kind of database is designed for data whose relations are well represented as a graph (elements interconnected with an undetermined number of relations between them). The kind of data could be social relations, public transport links, road maps or network topologies, for example.
Graph databases: example
Graph databases: main players
SAP HANA: in-memory database
IBM SAP HANA Server
(100TB of main memory)
SAP HANA DB
(or HANA DB) refers to the database technology itself.

SAP HANA Studio
refers to the suite of tools provided by SAP for modeling.

SAP HANA Appliance
refers to HANA DB as delivered on partner certified hardware (see below) as an appliance. It also includes the modeling tools from HANA Studio as well as replication and data transformation tools to move data into HANA DB.

SAP HANA One
refers to a deployment of SAP HANA certified for production use on the Amazon Web Services (AWS) cloud.

SAP HANA Application Cloud
refers to the cloud based infrastructure for delivery of applications (typically existing SAP applications rewritten to run on HANA).
Reference / links
http://en.wikipedia.org/wiki/Big_data
http://strata.oreilly.com/2012/01/what-is-big-data.html
http://www.gartner.com/newsroom/id/2207915
http://habrahabr.ru/company/jelastic/blog/166845/
http://techcrunch.com/2012/12/06/big-data-leader-cloudera-raises-65m-to-fuel-further-hadoop-adoption/
http://mashable.com/2012/06/19/big-data-myths/
http://www.cubrid.org/blog/dev-platform/database-technology-for-large-scale-data/
http://hadoop.apache.org/
http://www.bytemining.com/2011/08/hadoop-fatigue-alternatives-to-hadoop/
http://www.ibm.com/developerworks/ru/library/l-hadoop-1/
http://www.cloudera.com/content/cloudera/en/why-cloudera/hadoop-and-big-data.html
http://www.informationweek.co.uk/software/information-management/ibm-answers-oracle-exadata/240008709
http://dwarehouse.wordpress.com/2012/12/28/introduction-to-massively-parallel-processing-mpp-database/
http://www.oracle.com/us/industries/communications/brm-exadata-performance-wp-362789.pdf
http://www-01.ibm.com/software/data/infosphere/hadoop/
http://developer.yahoo.com/blogs/hadoop
http://gigaom.com/2010/09/29/survey-hadoop-is-great-but-challenges-remain-2/
http://www.jaspersoft.com/blog-entry/hadoop-challenges
https://en.wikipedia.org/wiki/NoSQL
http://db-engines.com/en/ranking
https://en.wikipedia.org/wiki/Graph_database
http://www.graph-database.org/
http://www.saphana.com/community/blogs/blog/2013/04/22/massively-parallel-processing-on-hana
http://fcw.com/articles/2013/12/12/cobol-legacy.aspx
Oracle
Big Data Appliance
Pre-integrated full rack configuration with 18 of Oracle's Sun

Cloudera distribution including Apache Hadoop to acquire and organize data

Oracle NoSQL Database Community Edition to acquire data

Additional system software including Oracle Linux, Oracle Java Hotspot VM, and an open source distribution of R
Next Steps
Try
installing Hadoop
cluster yourself:

http://hortonworks.com/blog/building-hadoop-vm-quickly-ambari-vagrant/

Download
HortonWorks Hadoop Sandbox
play with Hadoop products by passing tutorials

http://hortonworks.com/products/hortonworks-sandbox/
Agenda
Introduction into Big Data

Massive Parallel Processing Systems

Distributed Storage and Processing Frameworks

NoSQL Systems Overview
2006:
161 Exabytes (161 millions of TB)
2012:
2.8 Zettabytes (2.8 billions of TB)
2015:
8.5 Zettabytes
2020:
40 Zettabytes
Why it is important?
When you withdraw 100$ in ATM will you
take only 20$ and leave 80$?
Key-value stores: distributed databases
328 CPU cores
2TB of RAM
14 Storage Servers
22TB of SDD cache
40GB/sec InfiniBand
224TB of storage
Up to 18 Exadata boxes in cluster
Produced more data in it's first few weeks than the entire history of Astronomy preceding it.
600 times the information in all books ever written.
If they would not filter all the unnecessary data, it would be
500 exabytes per day
.
Oracle
Exadata: overview
MPP usually...
Proprietary enterprise-level
software
4PB of space! (4096TB)
Really?
Facebook has 100PB
(Aug, 2012)
Usually:
Open-source
Facebook
(1100-machine cluster with 8800 cores and about 12 PB raw storage).
eBay
532 nodes cluster (8 * 532 cores, 5.3PB).
LinkedIn
(~4100 machines in different clusters
NetSeer
(1000 instances on Amazon EC2, Data storage in Amazon S3)
Yahoo!
(More than 100,000 CPUs in >40,000 computers running Hadoop)
Real BigData required (Hundreds of PB)
Your data is structured, semi-structured or unstructured
Especially unstructured
OR YOU DON'T?
Hadoop is not an ugly duckling anymore
Pig
- shell-script-like language to access data
Hive
— SQL-like language to access data
Flume
— a tool to load log-file data to Hadoop
Sqoop
— to load structured data from RDBMS to Hadoop and vice versa
Ambari
— to automate provisioning, maintenance and support
HBase
— NoSQL database
Quick Start (Hortonworks Hadoop Sandbox)
http://hortonworks.com/products/hortonworks-sandbox/
http://goo.gl/LIVxfr
Value
Proprietary enterprise-level
hardware
Integrates
with existing enterprise systems
Designed to work with
structured data
Later
evolved to work with unstructured data
Commodity
hardware (try to buy 1000 Exadata)
Doesn't require
to be structured
Better not to
be structured at all
Each node have it's
own memory/processors
to process data locally
All communication is via a
network interconnect
No disk-level sharing or contention to be concerned with (
'shared-nothing'
architecture).
4.4 million new jobs
by the end of 2015 (Gartner, 2012)
US initiates
84 different Big Data programs
Agenda
Introduction into Big Data

Massive Parallel Processing Systems

Distributed Storage and Processing Frameworks

NoSQL Systems Overview
Agenda
Introduction into Big Data

Massive Parallel Processing Systems

Distributed Storage and Processing Frameworks

NoSQL Systems Overview
Agenda
Introduction into Big Data

Massive Parallel Processing Systems

Distributed Storage and Processing Frameworks

NoSQL Systems Overview
Data is
structure or semi-structured
rather then unstructured
... and amount of
data is "small"
(TB rather than PB)
Integration
with existing enterprise software is important
Need to get a
solution ASAP
(days rather than months)
Support
contract is required
You have
money
... a lot of money (average Oracle Exadata server
costs $500.000
)
Scales up
very well (thousands of machines rather then tens)
Very high degree of
fault tolerance
based on
software
’s ability to detect and handle failures at the application layer rather than relying on high-end hardware
Stay in
touch
valentin.kropov
@vallkor
valentine.kropov@gmail.com
99.9999999999%
of data is being filtered.
Which is equal to
all data available on Internet
(2009).
Still leaving them with
50GB per day :)
17 times more than IT engineers in Ukraine
5% growth in other IT industries
While 70% of their IT systems are legacy
Data
processing
(get answers in hours rather than days)
Streaming
processing (get answers in seconds rather than hours)
By Installing Hadoop on top of it
:)
Traditional (Relational) Database
ACID
Transactions (
A
tomicity,
C
onsistency,
I
solation,
D
urability)
Structured and
normalized data
Well-defined
Schema
SQL
is a main data access tool
80% of the data is
not structured
Use Cases: Healthcare
Healthcare (
1 percent
efficiency gain could yield more than
$63 billion
)
Treat
more patients more efficiently
Validate
and correct course of treatment
Use Cases: Sales and Marketing
Adaptive
price
Next product
to buy
Clickstream
analysis
Use Cases: IBM Watsons
Speaks
Reads
AI
Thinks?
Use Cases: Best Use Case!
?
Enabling
new products
Slides: http://goo.gl/fof53B
and veracity
Valentin Kropov
Veracity
Full transcript