Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Copy of 35Min SciDB at NERSC, Analyze and Share Terabytes of Data

No description

Yushu Yao

on 28 October 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Copy of 35Min SciDB at NERSC, Analyze and Share Terabytes of Data

SciDB Testbed at NERSC
Powerful Back-end to an Online Web Service
Easy-to-use, fast, interactive analytic framework
Want to accelerate your
Discover Process?
To Try out SciDB at NERSC
Email yyao@lbl.gov
Any Science Project Welcome
- Share/Analyze Terabytes of Science Data
Scientific Discovery Through Data
- an Iterative Process
Decision Paralysis
Array Like Science Data
- More Common than you think
SciDB for Array Data
- Easy and Powerful
Case Study: OpenMSI
Online Mass Spectrometry Analysis
Old Style Science Gateway
Search Catalog. Download File. Do-your-own Analysis
Climate Simulation Output
-Terabytes of Output Per Run
Brain MRI Image
Many of people -> 4th Dimension
Gene Labeling
- Large Sparse Array
Gene (Billions)
Feature (Thousands to Millions)


SciDB Testbed @ NERSC
Partner up with Science Teams
10+ Science Projects
Complicated Algorithms
Multiple Science Domains:
Astronomy, Climate, Bio-imaging, Genomic
Smart New Science Gateway
with SciDB as Backend
Allow Complicated Queries that Aggregate TB of data, and return an Answer
SciDB allows to search through 100s of GB of RAW data and find images features inside it
Spectrum Taken from Sloan Digital Sky Survey
Spectra for 0.5 Billion Objects
Infrastructure/middle-ware is very important for efficiency
TB data, no MPI, don't worry about Parallel IO.
The Ideal Analysis Framework
Write your own data analysis code for an HPC system?
Understand the Parallel Architecture
Learn about MPIIO or some file format
Worry about parallel programming
What the #?X is an OST?
Why are my files corrupted?
Why isn't my sort working with 1000 cores?
Query: get all interesting data, aggregate over some dimension, then do a K-means clustering
Catalog of Billions of Stars
Right Ascension
-90 Deg
90 Deg
0 Deg
360 Deg
Big Data needs advanced mathematics:
-Statistics / Machine learning / Mining at Scale
Big Data Tasks in SciDB@NERSC Projects
SciDB Case Studies and Performance Comparisons
Match Supernova Observation with Simulation
Simulation of Supernova Explosion produces (many thousands to millions) spectra
Web user can search/plot like any other gateway
NEW: Web user can upload a spectrum, and Odetta will find the most "similar" spectrum in the database:
Very Compute/IO intensive
For 55K spectra, SciDB returns result in 10sec. (comparing to 20min in PosgreSQL+Python)
For 1Million Spectra, SciDB return result in 2min
Given 2 sets of observed objects, return the objects observed in both sets (~300GB for 1billion stars)
Spatial query not efficient in SQL
Cross Matching Catalogs of Stars in the Sky
In SciDB, you can lay out the stars in a 2D table, and overlay the them. In parallel.
50 times faster to match 1 billion stars in SciDB (5min) than PostgreSQL (5hr)
Metagenome Analysis Workflow
Aggregate E.g.
Sparse array, each cell contain some properties (e.g. a score). 3.5 Billion non-empty cells (0.5TB)
When to Use SciDB on Your Data
10+GB Data (and will grow)
Looks like an array (dense or sparse)
Write Once (or accumulate slowly) and read a lot
Lots of Filtering and Aggregating
Want to do Joins like SQL
Do most of calculation inside the Database
Linear Algebra on your Data
20 Jesup Nodes (8 Core, 24GB Memory each)
Too little memory!!!
Commodity 512GB SSD (OZC Vertex 4)
Carver IB Network
Next Steps for the NERSC SciDB Test Bed
Get Broader Audience:
Automate creation/resizing of SciDB Clusters
User-controlled SciDB Instances (start/stop their own SciDB cluster)
NGF-backed storage
Same Hardware (Almost)
Still the same:
Kick-off a new project by holding their hand to
Load first batch of data
Do the first round of analysis
When not so use SciDB
SciDB is for Analysis, NOT Transactions
For crunching through large data and return 1 small result, not return millions of small results at high throughput
Full transcript