Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


NoSQL & MongoDB

Presentation about the R&D project on NoSQL technology and more specifically MongoDB

Igor Gentil

on 12 August 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of NoSQL & MongoDB

Thinking differently with NoSQL
NoSQL & MongoDB
Igor Gentil
Infrastructure Team
First Thing about NoSQL
Not Only SQL
Storing non-relational data outside of RDBMS
Has sub-types and several implementations
Used to store and manage huge ammounts of data (>= PB)
Optimized for Parallel Processing and Clustering
Understanding the New Way of Thinking
How Clustering, Sharding and MapReduce can help applications manage BigData
NoSQL Databases
How much data is Big Data?
Data generated per minute on the web:
Youtube: 48h of video uploaded
Instagram: 3.600 new photos
Twitter: 100k new tweets
Apple: 47.000 app download requests
Facebook: 684.478 pieces of content
Means splitting the data between servers, in a way that each server is responsible for a piece - or shard - of the whole data.

Each Shard is delegated to a server, that stores, manages and deliver the data to the clients. Shards can be replicated individually for security and access performance.
Like Oracle's table partitioning?
Not quite. In Oracle, each 'shard' of data is stored in a different segment, under the same database. In MongoDB, each shard is delegated to a different server, completely independent from the other.
Servers hosting the shards don't need to share anything, just have network connection between each other.
Sharding happens at the Collection level, in which a field is chosen to be the "Shard Separator". Using this field, each new document gets "routed" to the right shard.
... and the setup is REALLY easy!
# mongo localhost:3001
> db.runCommand({"addshard" : "localhost:5001"})
> db.runCommand({"addshard" : "localhost:5002"})
> db.runCommand({"addshard" : "localhost:5003"})
> db.runCommand({ "enablesharding" : "my_db"})
> db.runCommand({ "shardcollection" : "my_db.my_col", "key" : {"_id" : 1}})
Connect to the MONGOS process and start Sharding
NoSQL & BigData
New Way of Thinking
Sharding & MapReduce
Managing ammounts of data this big would require expensive hardware, such as Mainframes and SANs. NoSQL technology enables scaling OUT the work and perform it with commodity hardware top to bottom.
No expensive Storages or big MPP servers to manage, store and process the data. Everything is done with off-the-shelf hardware running Linux.
Map and Reduce!
Understanding MongoDB
Document Oriented
Runs on Windows, Linux and Unix
Internal language is JavaScript
Supports Clustering, Replication, Sharding and has some internal MapReduce functions
Has drivers to most languages, such as Java, Python, Ruby, PHP, C...
MongoDB Database process is mongod
Each mongod process serves a different database and listens to a port
Shards are managed by mongos process, which depends on a Configuration DB
MongoDB internals resambles MySQL's
Each Instance is made of Databases
Each Database is made of Collections
Collections are sets of Documents, that may or may not be alike
Clustering & Replica Sets
Master-Slave setup
Slave can be a Read-Only Database
Processes are explicitly started as master and slave
The evolution of Clustering: Replica Sets
ReplicaSets work like Automatic Clusters, where one of the nodes is elected as Master and, in case of a failure, the surviving nodes elect a new master.
MongoDB Sharding Architecture
MongoDB Sharding is accomplished by 3 components
Config Databases
Start the Shard DBs, Config DBs and Routers
# mongod --fork --dataDir /data/shard1 --port 5001
# mongod --fork --dataDir /data/shard2 --port 5002
# mongod --fork --dataDir /data/shard3 --port 5003

#mongod --fork --dataDir /data/config --port 4001

#mongos --configdb localhost:4001 --port 3001
Documents and Collections
Collections are sets of documents
Documents are sets of Key-Value pairs
Keys are Strings
Values can be anything (including other documents)
Document format is JSON (JavaScript Object Notation)
"name" : "Igor Gentil"
"team" : "Infra"
"skills" : ["Oracle", "PL/SQL", "MongoDB", "Unix", "IAS"]
"address" : {
"street" : "Rua da República 379"
"flat" : 379
"City" : "Porto Alegre"
Thank You!
Working with Data
Adding & Updating
> db.employees.insert({"name" : "joao"})
> db.employees.update({"name" : "joao", {"$set" : {"skills" : ["Java", "Scala"]}}})
> db.employees.find({"skills" : ["Java"]}
var customFind = function() {
foreach(this.list as v) { if (v == this.field) return this;
> db.employees.find({"$where" : "customFind"}
Complex Query and Aggregation tasks must be performed by the application or use MapReduce functions...
MapReduce is a term used to descript the act of splitting up a problem in smaller pieces and delegating this pieces to different nodes to be processed - in parallel.

MapReduce greatly reduces - =) - the time taken to accomplish a task using only commodity hardware.
MapReduce using MongoDB
map = function() {
for (var i in this.tags) {
var recency = 1/(new Date() - this.date);
var score = recency * this.score;
emit(this.tags[i], {"urls" : [this.url], "score" : score});

reduce = function(key, emits) {
var total = {urls : [], score : 0}
for (var i in emits) {
emits[i].urls.forEach(function(url) {total.urls.push(url); }
total.score += emits[i].score;
return total;

>db.runCommand({"mapreduce" : "mr", "map" : map, "reduce" : reduce})
It's called BIG DATA for a reason...
100 terabytes of data uploaded daily to Facebook.
In 2008, Google was processing 20,000 terabytes of data (20 petabytes) a day.
94% of Hadoop users perform analytics on large volumes of data not possible before; 88% analyze data in greater detail;
AT&T has both the largest single database (312TB) and the largest number of rows in a unique database (1.9 trillion)
And there's a lot of money going around it...
-IDC has released a new forecast that shows the big data market is expected to grow from $3.2 billion in 2010 to $16.9 billion in 2015.

-Big data is a top business priority and drives enormous opportunity for business improvement. Wikibon’s own study projects that big data will be a $50 billion business by 2017.

-By the end of 2012 more than 90 percent of the Fortune 500 will likely have at least some big data initiatives under way

-Poor data across businesses and the government costs the U.S. economy $3.1 trillion dollars a year.
SmallData, Data & BigData
Full transcript