Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Big Data Analytics Seminar

No description

Pedro Martins Dusso

on 25 January 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Big Data Analytics Seminar

Big Data Introduction The Big Data Scenario Parallel Data Processing Systems Extensible Operator Models Conclusions Future Work Pedro Martins Dusso Extensible Operator Models Analytics Advantages Operator Size >>> typical DB tools Challenge >>> "just" size Capture
Analyze IDC: The Diverse and Exploding Digital Universe Variety
Veracity We have many tools, but we don't have all the tools Current analytical tools are limited Not their fault, they weren't projected "Big" Excess of simplicity
Excess of complexity
Optimization Extensible Operator Model Solution for simplicity x complexity x optimization Simplicity Complexity Optimization Current Analytical Tools Issues Operators for simple, SQL-like analysis Inherited from Data Warehousing Code what SQL can't do Handle huge amounts of (parallelizable) data ... analyticaly! Complex solutions + exabyte scale data Huge clusters System-driven optimization throughout the whole data-processing flow MapReduce Support the construction of
New, complex operators
From simple ones
With transparent semantics Process-mediated Highly structured
Stored in databases, analyzed in data warehouses (well know solutions)
Strategic decision-making (Business Inteligence) Solution: diverse and deeply integrated platform for all information Information landscape Process-mediated
Machined-generated Machine-generated Millions of networked sensors
Mobile phones, smart energy meters, automobiles, industrial machines...
"Internet of Things" Human-sourced Loosely structured What the current analytic models are designed and built for Different formats Different time constraints Different access rights Many forms May not represent "reality" Highly structured, ready-to-process precise and accurate The Big Data Zoo: Taming the Beasts, Barry Devlin, 2012 Criticism Programming Models Programming Models (2.0) Execution Engines Hadoop MapReduce Sopremo DryadLINQ Hyracks Pig
Jaql C#
.NET LINQ libraries
Scope Script
Execution graph Architecture of Stratosphere system MapReduce Dryad MR x RDBMS
Low level
Key/value model does not fit all
Pipelining MR stages is possibly inefficient Super set of MR, generic
Less restrictive semantics brings high level complexity
All stack runs on proprietary solutions (The Next Generation: Sopremo & Hyracks) Extensible Operator Model Build and manage the group of operators it contains Classes of users Decrease complexity Increase modularity and reuse Enable application-specific functions optimization Perform group-level actions
Specific to a particular set of users

Native user model (human users)
Operator implementor (developer) Hyracks Sopremo Descriptors (contract)
Activities (internal sub-steps)
Tasks (execution of activities) Meta-information
List of inputs
List of property specifications (similarity measures or threshold properties) Agenda Introduction
The Big Data Scenario
Massively Parallel Data Processing Systems
Extensible Operator Models
Conclusions and Future Work Advantages
Classes of users
Decrease complexity
Increase modularity and reuse
Enable application-specific functions optimization Semantically rich operator model
Develop and integrate extensions Hyracks Active Node graph Parallel instantiation of the graph Script
Plan and program
Execution MapReduce user model (compatibility layer, only Hyracks) MPI MapReduce Generic MR, Hive, Pig... Hyracks & Sopremo Avoid recoding
High level query languages are not enough Standard libraries
Operators can be developed, maintained and shared MR model is effective only for a set of problems
Complex analytical problems aren't include in this set
Result: complex operations treated as black-boxes Solution: make operators' semantic transparent
Let the query compiler and query optimizer work
Pipelining, optimal execution order, buffer management... Operator Implementor: describe execution orders, sequencing constraints, and input and output configurations Excess of Complexity Problem
Using wrong tools for the job Optimization Problem
Application specific functions are treated as black-boxes and avoid potentially optimizations of the data flow Excess of simplicity Problem
Massively parallel data processing system have too simple operators from the Big Data analytic point of view Solutions
Build complex operators from simple ones
Increase data warehousing architectures with MapReduce paradigm
Best of both worlds Solution
Use the programming model suitable for the task
Abstracting complex operators from simple ones
Building and reusing a library of operators Solutions
Implement and use operators with transparent semantic
Let MapReduce key/value handle machine-generated
Let Sopremo/Hyracks handle human-sourced
Let relational databases handle process-mediated Integration



Business aware thank you!
Full transcript