Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


The Spider-Web

A dependency driven process management tool

Akos Farkas

on 9 October 2015

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of The Spider-Web

make my life easier
learn some GUI programming with Python
something that can be shown to clients as a process that we'll give them
The Spider-Web
General application
General reception
The Spider-Web
A ghetto queue
Fewest possible parameters
Process monitoring
Data Mining
What to do
Why bother
"You got carried away :)"
created cookbook for DB API and common techniques
"I want to do this to learn Java!"
some started learning Qt to build DM tools
... and when the guy leaves for holidays
A dependency driven
process management tool

a`Koshh Farkas

creating a data-mart of a dozen tables
extracting hundreds of features from DWH
mostly percent ranks across the population
e.g. "how many times did this customer change plans in the last 6 months compared to others?"
heavy use of analytical SQL functions
runs for days, one query for hours
dies depending on other load on DWH
SPSS Clementine as primary tool of mathematicians
R and SAS as alternatives
build and test models in Clementine
reprogram scoring in Oracle using PL/SQL or Java procedures (that' me)
Logistic regression
K-Means clustering
Latent class analysis
Churn probability
Product affinity
Need based segmentation
PowerPoint presentation with tons of histograms and guesses (I was getting bored looking at the 20th chart already)
Lift values - how much better we are than random
output of scoring in a table for every customer periodically
hard to sell to new clients - everything is a one-off manual artifact
hand off to an administrator
runs the processes for 3-4 projects every month
keeps half of the logs in an Excel sheet
some of the status is in his head
expects failures and schedules pessimistic retries - create; create;
resume in the morning
new clients see this will be a burden
model processes as a general directed graph, something like a Petri-net
create a tool to visualize the flow
a lightweight process execution framework that poses minimum requirements for programmers
something that DB developers are less likely to object, leave the control in their hands
but leave room for statistical tools as well
parallelize independent processes so that one can die and the rest will be executed
retry execution after cool-off
manual intervention
it was ubiquitous and I was a DB programmer
will host the scheduling logic and process meta-data
needs to work without a GUI as much as possible
provide a GUI
glue together client side components
access file system for model parameters from Clementine
cross-platform GUI framework from Nokia, bindings with PyQt
"Rapid GUI Programming with Python and Qt" from Mark Summerfield - perfect :)
the most convenient and easy to learn framework I have seen
need to visualize a directed graph
don't want a editor with manual placement because I had some existing meta-data to work with
provides balanced layouts
open source statistical package
run scripts on client machines
read and write the database
call through RPy
proprietary statistical framework, everybody hated it
automate if possible through COM
connects to the DB, we just need to call it but it only exists on one desktop
just populate two tables: processes and their dependencies
specify: name, system, technology, script, parameters, dependency type (AND, OR) and conditional fork-joins
database driven worker queue
a task is enabled if all it's dependencies are ready or refreshed
... unless it is scheduled for later
domino effect
heavy use of "connect by" queries for graph traversal in SQL
instantiate predecessors based on parameter matching
periodically check out processes to run on the client
or start anything manually
view logs, waiting status, results
copy to next month - also versioning
Some wanted more
submit arbitrary process for execution and dynamic growth
moved every system into the new framework
just laid back and watched the nodes going green :)
Multiple instances
Allowed cycles for recurring tasks
Continuous web scraping
DB transformations
Continuous scoring and weekly model rebuild
No fun for me!
Full transcript