Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Galaxy-P: Beyond Proteomics
Transcript of Galaxy-P: Beyond Proteomics
Galaxy-P: Beyond Proteomics
John Chilton, James Johnson, Getiria Onsongo, Ebbing de Jong, Pratik Jagtap, Timothy Griffin
Three Ways to Galaxy-P
What is Galaxy-P?
...or maybe 4
Three year NSF funded grant to build mass spec & proteomics data analysis platform on top of Galaxy.
Come chat about Galaxy proteomics more broadly (not just Galaxy-P) at the BoF during lunch today...
Binary Data Types
Large Numbers of Files
No need for shared file system
Well documented (lwr.readthedocs.org)
Client runner in Galaxy now
CloudBioLinux/CloudMan integration (in progress)
Support for public servers.
Install LWR webapp on remote system to allow it
to act as a Galaxy worker node.
w/experimental caching support
Not just for windows
SSL + private token authentication
Java Web Start Application
Batch Download and Upload Files
Get around browser upload limits
Many API enhancements contributed upstream
Built on blend4j
Optional Galaxy extensions for direct access
Tools can now take in multiple inputs at all once.
Replace dataset <repeat> elements with <input type="data" multiple="true">.
Convenient for a specifying a few inputs, necessary for dozens or hundreds.
(in your Galaxy now)
"An Automated Pipeline for High-Throughput Label-Free Quantitative Proteomics
(J. Proteome Res., 2013, PMID: 23391308)."
Arbitrary # Inputs
Merged into one output
for subsequent steps.
Applications run in
parallel (once per input)
Galaxy can run these tools, but it cannot build this workflow, Galaxy-P can!
Multiple File Datasets
...group multiple files into a single dataset to "flatten" workflow.
Merging or many-to-one tools...
Jagtap's "Shamelessly Seamless" Proteogenomics workflow
150+ steps - Multiple identifications, BLAST, custom tools for spectral validation and genome mapping.
Takes 3 days to run on a fractionated 52 RAW file sample
Multiple LWR steps.
Tools and datatypes are NOT adapted for multiple file datasets
Not just for proteomics...
Not just for toy workflows...
Track or merge multiple file datasets into your Galaxy.
nothing proteomics related
"... it is needed for the community. I don't think we have other options for our requirements, [the] multiple file datasets implementation was a real savior for us."
Alex Khassapov, CSIRO
Hagai Cohen, Hebrew University
I am using it with success for chip-seq and rna-seq analysis.
nearing end of year 1
Not just improved grouping...
Tools with many-to-many outputs...
Aligns feature space of each input against all others...
With current implementation, the tool does "need to know" it produces multiple file dataset.
Implemented an early version of multiple file datasets at tool and datatype level.
Dannon, Nate, Enis, Brad, Jeremy, Ira
Galaxy-P and Galaxy Teams @ MSI
w/special, special thanks to Anne-Francoise Lamblin & Benjamin Lynch
People in Galaxy community who helped with various parts of this
& Dave Clements
Hardest working man in Galaxy
Point LWR at a toolbox to describe what it can run.
Markup your standard Galaxy tool XML to lock down how it runs.
Proteomics category has 3rd
most repositories on tool shed.
Jagtap's Minnesota Two Step
Huge presence at ASMS 2013, generated a lot of buzz.
then any Galaxy instance can run
these tools on your resource
Easily share access to specialized hardware, large datasets, etc...
One click access...
Publish tools with default public server instances to the tool shed
and provide one-click access to your compute and data.
Sample tracking throughout complex workflows...
Dataset names mangled throughout complex workflows, but "part names" remain consistent.
Addresses additional limitations of Galaxy.
Have mass spec data?
Three ways to Galaxy-P...
or maybe 4...
Because the Galaxy team would like a more
flexible, deeply integrated concept of dataset
I do too!
Galaxy team doesn't want to support this.
Encumber the code base with special cases. (Technical debt.)
Make me a committer on Bitbucket, problem solved!
#86: Datatype Tracking Enhancements
#169: Support config files with task parallelism.
#87: Framework for per job task parallelism.
#122: Eliminate hard coding of upload1 tool hacks.
#133: Fix file_ext in image datatypes.
#123: Refactor huge functions in library_common.
#:83 More flexible task merging.
#:142 Simplify task splitting input specification.
#156: Fix 'from_work_dir' with task splitting,