Prezi

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in the manual

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Copy of 35Min SciDB at NERSC, Analyze and Share Terabytes of Data

No description
by Yushu Yao on 28 October 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Copy of 35Min SciDB at NERSC, Analyze and Share Terabytes of Data

SciDB Testbed at NERSC
Powerful Back-end to an Online Web Service
Easy-to-use, fast, interactive analytic framework
Want to accelerate your
Discover Process?
To Try out SciDB at NERSC
Email yyao@lbl.gov
Any Science Project Welcome
- Share/Analyze Terabytes of Science Data
Scientific Discovery Through Data
- an Iterative Process
Decision Paralysis
Array Like Science Data
- More Common than you think
SciDB for Array Data
- Easy and Powerful
Case Study: OpenMSI
Online Mass Spectrometry Analysis
Old Style Science Gateway
Search Catalog. Download File. Do-your-own Analysis
Climate Simulation Output
-Terabytes of Output Per Run
Brain MRI Image
Many of people -> 4th Dimension
X
Y
Z
Person
Gene Labeling
- Large Sparse Array
Gene (Billions)
Feature (Thousands to Millions)
Science
Gateway

Interactive
Analysis

SciDB Testbed @ NERSC
Partner up with Science Teams
10+ Science Projects
Complicated Algorithms
Multiple Science Domains:
Astronomy, Climate, Bio-imaging, Genomic
Smart New Science Gateway
with SciDB as Backend
Allow Complicated Queries that Aggregate TB of data, and return an Answer
SciDB allows to search through 100s of GB of RAW data and find images features inside it
X
Y
M/Z
Spectrum Taken from Sloan Digital Sky Survey
+
=
=
+
Spectra for 0.5 Billion Objects
=
+
Student
Intern
Infrastructure/middle-ware is very important for efficiency
TB data, no MPI, don't worry about Parallel IO.
The Ideal Analysis Framework
Write your own data analysis code for an HPC system?
Understand the Parallel Architecture
Learn about MPIIO or some file format
Worry about parallel programming
What the #?X is an OST?
Why are my files corrupted?
Why isn't my sort working with 1000 cores?
Pre-processing/Loading
Query: get all interesting data, aggregate over some dimension, then do a K-means clustering
OLD
NEW
Yushu Yao/LBNL-NERSC
Catalog of Billions of Stars
Declination
Right Ascension
-90 Deg
90 Deg
0 Deg
360 Deg
Big Data needs advanced mathematics:
-Statistics / Machine learning / Mining at Scale
Big Data Tasks in SciDB@NERSC Projects
SciDB Case Studies and Performance Comparisons
Match Supernova Observation with Simulation
Simulation of Supernova Explosion produces (many thousands to millions) spectra
Web user can search/plot like any other gateway
NEW: Web user can upload a spectrum, and Odetta will find the most "similar" spectrum in the database:
Very Compute/IO intensive
For 55K spectra, SciDB returns result in 10sec. (comparing to 20min in PosgreSQL+Python)
For 1Million Spectra, SciDB return result in 2min
MATCH
Given 2 sets of observed objects, return the objects observed in both sets (~300GB for 1billion stars)
Spatial query not efficient in SQL
Cross Matching Catalogs of Stars in the Sky
In SciDB, you can lay out the stars in a 2D table, and overlay the them. In parallel.
50 times faster to match 1 billion stars in SciDB (5min) than PostgreSQL (5hr)
Metagenome Analysis Workflow
Aggregate E.g.
count(score>10)
Biclustering
Sparse array, each cell contain some properties (e.g. a score). 3.5 Billion non-empty cells (0.5TB)
When to Use SciDB on Your Data
10+GB Data (and will grow)
Looks like an array (dense or sparse)
Write Once (or accumulate slowly) and read a lot
Lots of Filtering and Aggregating
Want to do Joins like SQL
Do most of calculation inside the Database
Linear Algebra on your Data
20 Jesup Nodes (8 Core, 24GB Memory each)
Too little memory!!!
Storage
Commodity 512GB SSD (OZC Vertex 4)
NGF PROJECT
Carver IB Network
Next Steps for the NERSC SciDB Test Bed
Get Broader Audience:
Automate creation/resizing of SciDB Clusters
User-controlled SciDB Instances (start/stop their own SciDB cluster)
NGF-backed storage
Same Hardware (Almost)
Still the same:
Kick-off a new project by holding their hand to
Load first batch of data
Do the first round of analysis
When not so use SciDB
SciDB is for Analysis, NOT Transactions
For crunching through large data and return 1 small result, not return millions of small results at high throughput
See the full transcript