GitHub - scipipe/scipipe: Robust, flexible and resource-efficient pipelines using Go and the commandline

Robust, flexible and resource-efficient pipelines using Go and the commandline

Project links:Documentation & Main Website|Issue Tracker|Chat

Why SciPipe?

Intuitive:SciPipe works by flowing data through a network of channels and processes
Flexible:Wrapped command-line programs can be combined with processes in Go
Convenient:Full control over how your files are named
Efficient:Workflows are compiled to binary code that run fast
Parallel:Pipeline paralellism between processes as well as task parallelism for multiple inputs, making efficient use of multiple CPU cores
Supports streaming:Stream data between programs to avoid wasting disk space
Easy to debug:Use available Go debugging tools or justprintln()
Portable:Distribute workflows as Go code or as self-contained executable files

Project updates

Jan 2020: New screencast:"Hello World" scientific workflow in SciPipe
May 2019: The SciPipe paper published open access in GigaScience:SciPipe: A workflow library for agile development of complex and dynamic bioinformatics pipelines
Nov 2018: Scientific study using SciPipe:Predicting off-target binding profiles with confidence using Conformal Prediction
Slides:Presentation on SciPipe and more at Go Stockholm Conference
Blog post:Provenance reports in Scientific Workflows- going into details about how SciPipe is addressing provenance.
Blog post:First production workflow run with SciPipe</a

Introduction

SciPipe is a library for writingScientific Workflows,sometimes also called "pipelines", in theGo programming language.

When you need to run many commandline programs that depend on each other in complex ways, SciPipe helps by making the process of running these programs flexible, robust and reproducible. SciPipe also lets you restart an interrupted run without over-writing already produced output and produces an audit report of what was run, among many other things.

SciPipe is built on the proven principles ofFlow-Based Programming (FBP) to achieve maximum flexibility, productivity and agility when designing workflows. Compared to plain dataflow, FBP provides the benefits that processes are fully self-contained, so that a library of re-usable components can be created, and plugged into new workflows ad-hoc.

Similar to other FBP systems, SciPipe workflows can be likened to a network of assembly lines in a factory, where items (files) are flowing through a network of conveyor belts, stopping at different independently running stations (processes) for processing, as depicted in the picture above.

SciPipe was initially created for problems in bioinformatics and cheminformatics, but works equally well for any problem involving pipelines of commandline applications.

Project status:SciPipe pretty stable now, and only very minor API changes might still occur. We have successfully used SciPipe in a handful of both real and experimental projects, and it has had occasional use outside the research group as well.

Known limitations

There are still a number of missing good-to-have features for workflow design. See theissue tracker for details.
There is not (yet) support for theCommon Workflow Language.

Installing

For full installation instructions, see theintallation page. For quick getting started steps, you can do:

DownloadandinstallGo
Run the following command, to install the scipipe Go library (don't miss the trailing dots!), and create a Go module for your script:

go install github /scipipe/scipipe/...@latest
go mod init myfirstworkflow-module

Hello World example

Let's look at an example workflow to get a feel for what writing workflows in SciPipe looks like:

packagemain

import(
// Import SciPipe, aliased to sp
sp"github /scipipe/scipipe"
)

funcmain() {
// Init workflow and max concurrent tasks
wf:=sp.NewWorkflow("hello_world",4)

// Initialize processes, and file extensions
hello:=wf.NewProc("hello","echo 'Hello ' > {o:out|.txt}")
world:=wf.NewProc("world","echo $(cat {i:in}) World > {o:out|.txt}")

// Define data flow
world.In("in").From(hello.Out("out"))

// Run workflow
wf.Run()
}

To create a file with a similar simple example, you can run:

scipipe new hello_world.go

Running the example

Let's put the code in a file namedhello_world.goand run it.

First you need to make sure that the dependencies (SciPipe in this case) is installed in your local Go module. This you can do with:

go mod tidy

Then you can go ahead and run the workflow:

$ go run hello_world.go
AUDIT 2018/07/17 21:42:26|workflow:hello_world|Starting workflow (Writing log to log/scipipe-20180717-214226-hello_world.log)
AUDIT 2018/07/17 21:42:26|hello|Executing:echo'Hello'>hello.out.txt
AUDIT 2018/07/17 21:42:26|hello|Finished:echo'Hello'>hello.out.txt
AUDIT 2018/07/17 21:42:26|world|Executing:echo$(cat../hello.out.txt)World>hello.out.txt.world.out.txt
AUDIT 2018/07/17 21:42:26|world|Finished:echo$(cat../hello.out.txt)World>hello.out.txt.world.out.txt
AUDIT 2018/07/17 21:42:26|workflow:hello_world|Finished workflow (Log written to log/scipipe-20180717-214226-hello_world.log)

Let's check what file SciPipe has generated:

$ ls -1 hello*
hello.out.txt
hello.out.txt.audit.json
hello.out.txt.world.out.txt
hello.out.txt.world.out.txt.audit.json

As you can see, it has created a filehello.out.txt,andhello.out.world.out.txt,and an accompanying.audit.jsonfor each of these files.

Now, let's check the output of the final resulting file:

$ cat hello.out.txt.world.out.txt
Hello World

Now we can rejoice that it contains the text "Hello World", exactly as a proper Hello World example should:)

Now, these were a little long and cumbersome filenames, weren't they? SciPipe gives you very good control over how to name your files, if you don't want to rely on the automatic file naming. For example, we could set the first filename to a static one, and then use the first name as a basis for the file name for the second process, like so:

packagemain

import(
// Import the SciPipe package, aliased to 'sp'
sp"github /scipipe/scipipe"
)

funcmain() {
// Init workflow with a name, and max concurrent tasks
wf:=sp.NewWorkflow("hello_world",4)

// Initialize processes and set output file paths
hello:=wf.NewProc("hello","echo 'Hello ' > {o:out}")
hello.SetOut("out","hello.txt")

world:=wf.NewProc("world","echo $(cat {i:in}) World >> {o:out}")
world.SetOut("out","{i:in|%.txt}_world.txt")

// Connect network
world.In("in").From(hello.Out("out"))

// Run workflow
wf.Run()
}

In the{i:in...part, we are re-using the file path from the file received on the in-port named 'in', and then running a Bash-style trim-from-end command on it to remove the.txtextension.

Now, if we run this, the file names get a little cleaner:

$ ls -1 hello*
hello.txt
hello.txt.audit.json
hello_world.go
hello_world.txt
hello_world.txt.audit.json

The audit logs

Finally, we could have a look at one of those audit file created:

$ cat hello_world.txt.audit.json
{
"ID":"99i5vxhtd41pmaewc8pr",
"ProcessName":"world",
"Command":"echo$(cat hello.txt)World \u003e\u003e hello_world.txt.tmp/hello_world.txt",
"Params":{},
"Tags":{},
"StartTime":"2018-06-15T19:10:37.955602979+02:00",
"FinishTime":"2018-06-15T19:10:37.959410102+02:00",
"ExecTimeNS":3000000,
"Upstream":{
"hello.txt":{
"ID":"w4oeiii9h5j7sckq7aqq",
"ProcessName":"hello",
"Command":"echo 'Hello ' \u003e hello.txt.tmp/hello.txt",
"Params":{},
"Tags":{},
"StartTime":"2018-06-15T19:10:37.950032676+02:00",
"FinishTime":"2018-06-15T19:10:37.95468214+02:00",
"ExecTimeNS":4000000,
"Upstream":{}
}
}

Each such audit-file contains a hierarchic JSON-representation of the full workflow path that was executed in order to produce this file. On the first level is the command that directly produced the corresponding file, and then, indexed by their filenames, under "Upstream", there is a similar chunk describing how all of its input files were generated. This process will be repeated in a recursive way for large workflows, so that, for each file generated by the workflow, there is always a full, hierarchic, history of all the commands run - with their associated metadata - to produce that file.

You can find many more examples in theexamples folderin the GitHub repo.

For more information about how to write workflows using SciPipe, and much more, seeSciPipe website (scipipe.org)!

More material on SciPipe

Seea poster on SciPipe,presented at thee-Science Academy in Lund, on Oct 12-13 2016.
Seeslides from a recent presentation of SciPipe for use in a Bioinformatics setting.
The architecture of SciPipe is based on anflow-based programminglike pattern in pure Go presented in thisand this blog posts on Gopher Academy.

Citing SciPipe

If you use SciPipe in academic or scholarly work, please cite the following paper as source:

Lampa S, Dahlö M, Alvarsson J, Spjuth O. SciPipe: A workflow library for agile development of complex and dynamic bioinformatics pipelines Gigascience.8, 5 (2019). DOI:10.1093/gigascience/giz044

Acknowledgements

SciPipe is very heavily dependent on the proven principles formFlow-Based Programming (FBP),as invented byJohn Paul Morrison. From Flow-based programming, SciPipe uses the ideas of separate network (workflow dependency graph) definition, named in- and out-ports, sub-networks/sub-workflows and bounded buffers (already available in Go's channels) to make writing workflows as easy as possible.
This library is has been much influenced/inspired also by the GoFlowlibrary byVladimir Sibirov.
Thanks toEgon Elbrefor helpful input on the design of the internals of the pipeline, and processes, which greatly simplified the implementation.
This work is financed by faculty grants and other financing for thePharmaceutical Bioinformatics groupofDept. of Pharmaceutical BiosciencesatUppsala University,and bySwedish Research Council through the SwedishNational Bioinformatics Infrastructure Sweden.
Supervisor for the project isOla Spjuth.

Related tools

Find below a few tools that are more or less similar to SciPipe that are worth worth checking out before deciding on what tool fits you best (in approximate order of similarity to SciPipe):

Name		Name	Last commit message	Last commit date
Latest commit History 986 Commits
.circleci		.circleci
cmd/scipipe		cmd/scipipe
components		components
docs		docs
examples		examples
.gitignore		.gitignore
.travis.yml		.travis.yml
CNAME		CNAME
LICENSE		LICENSE
README.md		README.md
appveyor.yml		appveyor.yml
audit.go		audit.go
baseprocess.go		baseprocess.go
common.go		common.go
const.go		const.go
go.mod		go.mod
ip.go		ip.go
ip_test.go		ip_test.go
log.go		log.go
misc_test.go		misc_test.go
mkdocs.yml		mkdocs.yml
port.go		port.go
port_test.go		port_test.go
process.go		process.go
process_test.go		process_test.go
settings.go		settings.go
settings_test.go		settings_test.go
sink.go		sink.go
task.go		task.go
task_test.go		task_test.go
testcov.sh		testcov.sh
updatedocs.sh		updatedocs.sh
utils_test.go		utils_test.go
workflow.go		workflow.go
workflow_test.go		workflow_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Why SciPipe?

Project updates

Introduction

Known limitations

Installing

Hello World example

Running the example

The audit logs

More material on SciPipe

Citing SciPipe

Acknowledgements

Related tools

About

Releases 44

Packages

Contributors 10

Languages

License

scipipe/scipipe

Folders and files

Latest commit

History

Repository files navigation

Why SciPipe?

Project updates

Introduction

Known limitations

Installing

Hello World example

Running the example

The audit logs

More material on SciPipe

Citing SciPipe

Acknowledgements

Related tools

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 44

Packages 0

Contributors 10

Languages

Packages