Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Architecting BigData Solution and
Transcript of Architecting BigData Solution and
Want to know more?
Please reach out to us! We are happy to help
Architecting BigData Solution and Infrastructure
A Data Lake for data scientist, Software engineers, business analyst, expert systems and more...
Hadoop Components and Tools
Use Case and Problems
About the Speaker
Melvin Ramos MBA, Owner and founder of TeraMint Technologies LLC, has over 17+ years of experience working on software development currently part of Zebra-Motorola ETO as Manager/Lead of BigData Solutions. He architected and help built Zatar Analytics from scratch and currently developing a solution for realtime inventory reporting and tracking. He is technology driven and result oriented technologist.
About the speaker
Old days Architecture and what changed?
Apache Spark/Apache Storm/Logstash
Apache Cassandra and ElasticSearch
Data is all about data!
Use Case and Problems
What and why BigData?
"BigData is data characterized by 3 attributes: volume, variety and velocity"
"BigData is data characterized by 4 attributes: volume, variety, velocity and value"
Do we care about BigData?
Improved processes through different variation points
Learning and automation, repeatetive task can be automated
Behavior analysis and targeted ads
Recommendation and prescriptive analytics
Predictive and forecasting
Manufacturing or Service companies can now track usage of their devices/industrial machines, by eliminating downtime and detecting patterns for continuous improvement.
Robo-Financial advisor, learning individual's financial goals and based on data execute trades to achieve your retirement goals.
Web targeted ads
Netflix, uses this to recommend shows to watch, based on your basic information and click rates.
LinkedIn and Amazon does the same too.
Obama 2012 campaign, BigData was utilized to find and target voters and urged them to vote.
Financial markets, hedge funds and money managers forecasts risks associated by positions they took in the market place.
Data its all about data!!!
Medical field, predicting the likelihood of you getting a cancer based on your genome.
Convergence of Data Science and Data Software Engineers
Realtime Analytics and E-Commerce Inventory Management
South Side Pet e-Store
Validating Insurance claims
Integration/Broker: Apache Kafka
Aggregation Engine and ML:
Aggregation Engine and ML:
NoSQL Store: Apache
Search for Indexing:
Managing, Maintaining and deploying Hadoop Components and services.
Is high throughput and low latency streaming broker.
Built by team at LinkedIn to process user event clicks and realtime aggregation of logs.
JVM based, uses kernel page cache for commit logs, and Zero Copy for networking.
Node Scaling producer/consumer distribution is based on Apache Zookeeper
Replay all event streams via configuration of SLA setting.
ActiveMQ and other Enterprise broker doesn't do replays!!!
Lightweight version of hadoop map-reduce, and much faster!
Data is wrapped around RDDs, and/or SQL based dataFrame borrowed from Python.
Supports both Streaming and Batch.
Streaming is based on microbatch and time-series data is based on aggregation window.
It has a MachineLearning Library! Apache Mahout, looks to expand algo's with Spark MLib.
One of the most active apache project. It has over 100 commits everyday on average.
Has huge support base from other open source community and growing. IBM created a company using Apache Spark as basis
Twitter's Open Source Realtime Analytics and Streaming Engine.
Low latency streaming,
a million tuples processed per second per node.
Is based on Spouts and Bolts, Spouts are the start of each stream, and bolts are the processing/aggregating nodes.
Uses Nimbus server to offload all aggregation and processing to nodes.
Its idempotent and guarantees is "at least once." Failures are done using replays, by anchoring the tuple to the succeeding Node in the chain.
Streaming Transactions guarantees is "exactly once."
For strong guarantees, "exactly once", it offers "trident" which is similar to Spark Streaming.
Recently offers ML lib via Trident as well.
RDDs are either store in memory or FileSystem in-case of failure another node can deserialize the RDD and restart calculation.
Log parser and transformation engine.
Can ingress data from everywhere, network socket, or a file.
Pluggable egress data store.
Written in Ruby but runs in JVM via JRuby for scalability
Is one of the most successful NoSQL DB based on Google's BigTable and Amazon's DynamoDB.
Apple had the largest installation with 75k Nodes, 10 Petabyte of data, Netflix has 2,500 nodes, 420 TB, over 1 trillion requests per day
Was incubated at Facebook and had been successful eversince
DB Transaction is eventual consistency.
Based on Apache Lucene Index
Document/JSON based storage, Its also free form which is schemaless.
Basic aggregations, Metrics, Bucket and Pipeline
REST based API for almost all services.
High Available and scalable master-slave setup where data are sharded across nodes.
Ability for DevOps/Engineers to setup a cluster faster and leaner.
Made the VM's and hardware as commodity server, as containers can now be used to run anywhere.
Is based on Linux cgroup hierarchy
Enables to run micro-services, that are deployed as a unit.
Help isolate processes across your VMs and containers
Secure in terms of PORT exposed to the host machine.
Docker Swarm or Google's Kurbenetes can be used to manage the cluster.
Maintaining and deploying to nodes and server machines via a configurable template language (ruby) and a server to handle all server manifest and deployment setup.
AP of CAP, with focus on Availability and Partition Tolerance.
Rq 1 : Report/Analytics within a given specific time granularity. (Hours or Minutes)
Rq 2 : Can handle events of around min of 1000 to 2000 per minute.
Rq 3 : Can handle 20,000-60,000 products in some variation.
Op 1 : Perform a realtime machine learning and predict when a product will get depleted and automatically stage an order.
Rq 4 : The connection is secure and the application can be accessed in one place only.
Op 2 : Perform a machine learning and provide a rerouting of product order.
Op 3 : Make recommendation based on customer buying patterns, behavior and characteristics and offer products.
1) BigData problem:
2000 events/m * 60 * 24 = 2,880,000 events/day * 60000 (products) = 172800000000
Rq 5 : Ability to query historical records and perform additional analytics
2) Continuously run reports (aggregation, analytics and machine learning) every minute.
Rq 6 : Scalable and highly available Report/Data engine.
Rq 1 : Report/Analytics within 30 seconds after clicking 2-4 items.
Rq 2 : Highly available and scalable data collection. Resilient WebAPP to serve customers.
Rq 3 : Periodically run targeted ads/promotion via email or when the user signs-in
Rq 4 : You can have 200 to 300 viewers for a product within a minute and 4000-9000 users logged-in any given time.
Rq 5 : Must be able to handle 5000 transactions every minute.
Rq 6 : Must be able to do adhoc analysis on historical data.
Rq 7 : Visualize my application software performance and ability to give snapshot of performance in any given time.
1) Currently have 5 major insurance company (Aetna, Cigna, UHC, BCBS etc.),
3) Each month each of the insurance companies gives claims ranging from 200M-500M claims
4) An ETL Business Rules loader streamline preselect 50M claims.
5) Out of 50M, another transformation engine filters out claim to around 1K, based on common claim codes.
6) 1K claims will be reviewed manually by an adjuster or RN.
7) If error is found, a commission is charged and be part of revenue.
2) Files from different companies varies and ranging from 1G-12T in total.
Monitoring, managing and deploying apps.
Client - Server Done!
Invented by Google circa 2004
Is a new programming model helping to solve large processing @ scale.
Traditional RDBMS cannot handle large processing and simple analytic queries.
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
Client-Server and 3 Tier Architecture
Rq 8 : Check the mobile sensor data for mobile activity
Rq 9 : Collect weather and city pollution index data to understand pet's allergy problems and enable mobile to get alerts based on pet and medical history.
How does this scale?
Its how the components are made.
Combination of tools
The algorithm used to make sure:
1 Data is correctly calculated
2 Complexities are handled in a distributed and highly concurrent application.
3 The facilities available on the OS level that also applying the algorithms
FB generates 500TB of data daily, "a typical day for twitter more than 500 million Tweets sent; average 5,700 TPS."
Typical Hadoop Solution
Web or App Server
ETL Solutions Server
UI tools for Visualization:
SPICE, Super fast, Parallel, In-Memory, Computing/Calculation, Engine
Support for both Scala, Python and Java(Native APIs)
UI and Visualization tools
http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed by Eric Brewer
Hadoop Components and tools