Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Using Galaxy...Differently

No description
by

John Chilton

on 2 October 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Using Galaxy...Differently

Using Galaxy... Differently
John Chilton
Proteomics
Clinical Galaxy
Galaxy is a browser-based app allowing biologists to easily leverage advanced computing resources for genomics research.
@jmchilton
Galaxy-P
Multi-file Samples
Windows Applications
3 Ways to Galaxy-P
usegalaxyp.org
getgalaxyp.org
bit.ly/galaxyp-cloud
Public
Local
Cloud
Two Big Problems Solved?
Build target database - download and translate EST databases or perform gene prediction with Augustus.
Numerous tools for identification and text manipulation.
Workflow utilizing BLAST to identify novel peptides.
Tool to assess peptide-spectrum matches and visualize spectra.
Visualize id-ed peptides on genome.
Jagtap's 150 step seamless proteogenomic workflow
Poster @ ISMB/ECCB
Tool Enhancements
Tools can take in multiple inputs at all once.
Convenient for a specifying a few inputs, necessary for dozens or hundreds.
...group multiple
files into a single dataset to "flatten" workflow.
Workflow Enhancements
Enables the proteogenomic workflow to work with 1, 2, 53, or 200 input files.
Sample tracking
Keeping histories manageable & responsive
No need for shared file system
Cross-platform
Secure
Manage queues internally or delegate to external queue via DRMAA, Torque CLI, or Condor CLI
LWR
Install LWR webapp on (remote) system to allow it
to act as a Galaxy worker node.
remote staging (optional caching)
Not just for windows
SSL + private token authentication
Accounting
Dedicated vs. Shared Clusters
usegalaxy.org (for now) runs jobs on a 'dedicated' cluster (just for Galaxy).
Large institutional Galaxy instances want/do run Galaxy on 'shared' clusters (with users doing many different things).
AT MSI we have 200 cores for Galaxy,
9,000 cores on our largest shared cluster.
Shared Cluster Challenges
Scheduling
Submitting a job to a cluster, one must estimate resources required (job run time, number of cores, amount of RAM, ...)
Traditionally in Galaxy this is done on a per tool basis.
e.g. cufflinks jobs will take 5
hours and use 24GB of RAM.
Institutions like MSI spend hundreds of thousands of dollars a year determining how to divvy up CPU hours and disk usage and enforcing these decisions.
Many complicated systems in place that depend on the fact that the 'unix' user running the job is the person utilizing the resources.
People are doing clinical research
with Galaxy, Minnesota only place to transition to clinical diagnostics?
CLIA Certification

Required for handling of clinical diagnoses.
Certification is an expensive, time consuming process.
Results must be essentially exactly reproducible between certifications.
The Galaxy server at MSI was unsuited. Operating system updates, package updates, Galaxy updates, none of it compatible with CLIA certification.

Implemented infrastructure allowing identical, independent cloud environments to be stood up on Amazon EC2 for each sample.
Problem:
Solution:
Galaxy & the JVM
The Java Virtual Machine is a popular platform for "enterprise" and bioinformatics infrastructure
blend4j
galaxy
bootstrap
JGalaxy
TINT
clj-blend
CLC Bio
(Plugin)
Java library providing API access
to Galaxy and Tool Shed
github.com/jmchilton/blend4j
GalaxyInstance galaxyInstance =
GalaxyInstanceFactory.get(url, apiKey);
HistoriesClient historiesClient =
galaxyInstance.getHistoriesClient();
for(History history : historiesClient.getHistories()) {
String name = history.getName();
String id = history.getId();
String template = "Found history with name %s and id %s";
String message = String.format(template, name, id);
System.out.println(message);
}
Tools, Histories, Workflows, Data Libraries,
Roles, Permissions, Users, Tool Shed Repositories,
Many & more useful and examples on GitHub.
Java Web Start application
Batch download and upload files
Get around browser limitations
github.com/jmchilton/TINT
github.com/jmchilton/JGalaxy
github.com/jmchilton/galaxy-bootstrap
Java library to download and configure Galaxy. Used to drive blend4j automated testing.
Galaxy clojure client built
on blend4j by Brad Chapman
Run Galaxy workflows from CLC Bio
by Marc Logghe
`
Why Galaxy?
Allows sequencing core and clinicians to share an analysis environment, QC, reports, and visualizations.
Challenge?
Big Team Effort
CloudBioLinux
Extending More Than Just Galaxy...
CloudMan
http://bit.ly/prodcloudman
Security
Reporting
Advanced Job Configuration
Services
(Database, File Servers)
Authentication + SSL
usegalaxyp.org
One Command - Install all OS Packages - Ubuntu, CentOS, Scientific Linux
One Command - All Custom Bioinformatics Software - Multiple Versions
Added benefits
Looking for a new name.
unner
Local: Default runner, jobs run right on same computer.

DRMAA: Uses a standard API to submit jobs to a variety of cluster queue managers (e.g. PBS, Grid Engine, SLURM).
Tools describe what jobs to run, runners and destinations describe how job is to be run.
e.g.
A little coarse, tighter bounds mean
quicker jobs, greater efficiency.
Dynamic Job Destinations
with help from Nate!
Allow describing job destinations on many more criteria - user, inputs, system status.
These are implemented as Python functions (plugins) for maximum flexibility.
from galaxy.jobs import JobDestination
import os

def ncbi_blastn_wrapper(job):
# Allocate extra time
inp_data = dict( [ ( da.name, da.dataset ) \
for da in job.input_datasets ] )
inp_data.update( [ ( da.name, da.dataset ) \
for da in job.input_library_datasets ] )
query_file = inp_data[ "query" ].file_name
query_size = os.path.getsize( query_file )
if query_size > 1024 * 1024:
walltime_str = "walltime=24:00:00/"
else:
walltime_str = "walltime=12:00:00/"
parms = {"Resource_List": walltime_str}
return JobDestination(runner="pbs", params=params)
Give more time to big input datasets
from galaxy.jobs.mapper import JobMappingException

DEFAULT_JOB_DESTINATION_ID = "local"

def has_license(user):
user_group_assocs = user.groups or []
user_has_license = 'have_license' in \
[user_group_assoc.group.name for \
user_group_assoc in user_group_assocs]
if not user_has_license:
raise JobMappingException("No license, no tool!")
else:
return DEFAULT_JOB_DESTINATION_ID
"We want to have tools available only to users who provided a license for this tool." - Vandeweyer Geert
files/001/
dataset_1000.dat
dataset_1001.dat
dataset_1002.dat
Typically all files owned by 'galaxy' user, jobs read and write data directly to Galaxy's file store.
To run 'galaxy' jobs as random users, all these files must be readable/writable by everyone on cluster.
CLC bio - the world's leading bioinformatics analysis software
....
files/001/
dataset_1000.dat
dataset_1001.dat
dataset_1002.dat
....
jobs/1/
inputs/
outputs/
working_directory/
setup
finish
GALAXY
LWR
Roughly speaking...
Cluster permissions match Galaxy's permissions!
Modify permission on each job.
lwr.readthedocs.org
"
"
Could it do more?
Galaxy
BootStrap
Galaxy
blend4j
ToolShed
Setup
Install Tools
and Workflows
Could provide simple JVM API to quickly and transparently leverage powerful workflows with minimal prerequisites.
./launch.sh sample_1.tsv sample_1
Populated with users,
data, workflows.
Alignment workflow for sequencing core, before data released.
Variant detection workflow
executed by clinicians.
Data released to Clinic
Tool to shutdown VM and transfer final results to MSI.
...my contribution
"Implementation of Cloud based Next Generation Sequencing data analysis in a clinical laboratory"
Getiria Onsongo, Jesse Erdmann, Michael D Spears, John Chilton, Kenneth Beckman, Adam Hauge, Sophia Yohe, Matthew Schomaker, Matthew Bower, Kevin A.T. Silverstein and Bharat Thyagarajan
Infrastructure for spinning up VM, configuring it, and transferring data.,
Auto-populating Galaxy data.
* All my work merged into CloudBIoLinux
Thanks Greg!
Full transcript