Jump to content

Dataflow programming

From Wikipedia, the free encyclopedia

Incomputer programming,dataflow programmingis aprogramming paradigmthat models a program as adirected graphof the data flowing between operations, thus implementingdataflowprinciples and architecture.[1]Dataflowprogramming languagesshare some features offunctional languages,and were generally developed in order to bring some functional concepts to a language more suitable for numeric processing. Some authors use the termdatastreaminstead ofdataflowto avoid confusion with dataflow computing ordataflow architecture,based on an indeterministic machine paradigm. Dataflow programming was pioneered byJack Dennisand his graduate students at MIT in the 1960s.

Considerations[edit]

Traditionally, a program is modelled as a series of operations happening in a specific order; this may be referred to as sequential,[2]: p.3  procedural,[3] control flow[3](indicating that the program chooses a specific path), orimperative programming.The program focuses on commands, in line with thevon Neumann[2]: p.3 vision of sequential programming, where data is normally "at rest".[3]: p.7 

In contrast, dataflow programming emphasizes the movement of data and models programs as a series of connections. Explicitly defined inputs and outputs connect operations, which function likeblack boxes.[3]: p.2 An operation runs as soon as all of its inputs become valid.[4]Thus, dataflow languages are inherently parallel and can work well in large, decentralized systems.[2]: p.3 [5] [6]

State[edit]

One of the key concepts in computer programming is the idea ofstate,essentially a snapshot of various conditions in the system. Most programming languages require a considerable amount of state information, which is generally hidden from the programmer. Often, the computer itself has no idea which piece of information encodes the enduring state. This is a serious problem, as the state information needs to be shared across multiple processors inparallel processingmachines. Most languages force the programmer to add extra code to indicate which data and parts of the code are important to the state. This code tends to be both expensive in terms of performance, as well as difficult to read or debug.Explicit parallelismis one of the main reasons for the poor performance ofEnterprise Java Beanswhen building data-intensive, non-OLTPapplications.[citation needed]

Where a sequential program can be imagined as a single worker moving between tasks (operations), a dataflow program is more like a series of workers on anassembly line,each doing a specific task whenever materials are available. Since the operations are only concerned with the availability of data inputs, they have no hidden state to track, and are all "ready" at the same time.

Representation[edit]

Dataflow programs are represented in different ways. A traditional program is usually represented as a series of text instructions, which is reasonable for describing a serial system which pipes data between small, single-purpose tools that receive, process, and return. Dataflow programs start with an input, perhaps thecommand lineparameters, and illustrate how that data is used and modified. The flow of data is explicit, often visually illustrated as a line or pipe.

In terms of encoding, a dataflow program might be implemented as ahash table,with uniquely identified inputs as the keys, used to look up pointers to the instructions. When any operation completes, the program scans down the list of operations until it finds the first operation where all inputs are currently valid, and runs it. When that operation finishes, it will typically output data, thereby making another operation become valid.

For parallel operation, only the list needs to be shared; it is the state of the entire program. Thus the task of maintaining state is removed from the programmer and given to the language'sruntime.On machines with a single processor core where an implementation designed for parallel operation would simply introduce overhead, this overhead can be removed completely by using a different runtime.

Incremental updates[edit]

Some recent dataflow libraries such asDifferential/TimelyDataflow have usedincremental computingfor much more efficient data processing.[1][7][8]

History[edit]

A pioneer dataflow language wasBLODI(BLOck DIagram), published in 1961 byJohn Larry Kelly, Jr.,Carol Lochbaum andVictor A. Vyssotskyfor specifyingsampled data systems.[9]A BLODI specification of functional units (amplifiers, adders, delay lines, etc.) and their interconnections was compiled into a single loop that updated the entire system for one clock tick.

In a 1966 Ph.D. thesis,The On-line Graphical Specification of Computer Procedures,[10]Bert Sutherlandcreated one of the first graphical dataflow programming frameworks in order to make parallel programming easier. Subsequent dataflow languages were often developed at the largesupercomputerlabs. POGOL, an otherwise conventional data-processing language developed atNSA,compiled large-scale applications composed of multiple file-to-file operations, e.g. merge, select, summarize, or transform, into efficient code that eliminated the creation of or writing to intermediate files to the greatest extent possible.[11]SISAL,a popular dataflow language developed atLawrence Livermore National Laboratory,looks like most statement-driven languages, but variables should beassigned once.This allows thecompilerto easily identify the inputs and outputs. A number of offshoots of SISAL have been developed, includingSAC,Single Assignment C,which tries to remain as close to the popularC programming languageas possible.

The United States Navy funded development of ACOS and SPGN (signal processing graph notation) starting in the early 1980s. This is in use on a number of platforms in the field today.[12]

A more radical concept isPrograph,in which programs are constructed as graphs onscreen, and variables are replaced entirely with lines linking inputs to outputs. Incidentally, Prograph was originally written on theMacintosh,which remained single-processor until the introduction of theDayStar Genesis MPin 1996.[citation needed]

There are many hardware architectures oriented toward the efficient implementation of dataflow programming models.[vague]MIT's tagged token dataflow architecture was designed byGreg Papadopoulos.[undue weight?discuss]

Data flow has been proposed[by whom?]as an abstraction for specifying the global behavior of distributed system components: in thelive distributed objectsprogramming model,distributed data flowsare used to store and communicate state, and as such, they play the role analogous to variables, fields, and parameters in Java-like programming languages.

Languages[edit]

Dataflow programming languages include:

Libraries[edit]

  • Apache Beam:Java/Scala SDK that unifies streaming (and batch) processing with several execution engines supported (Apache Spark, Apache Flink, Google Dataflow etc.)
  • Apache Flink:Java/Scala library that allows streaming (and batch) computations to be run atop a distributed Hadoop (or other) cluster
  • Apache Spark
  • SystemC:Library for C++, mainly aimed at hardware design.
  • TensorFlow:A machine-learning library based on dataflow programming.

See also[edit]

References[edit]

  1. ^abSchwarzkopf, Malte (7 March 2020)."The Remarkable Utility of Dataflow Computing".ACM SIGOPS.Retrieved31 July2022.
  2. ^abcJohnston, Wesley M.; J.R. Paul Hanna; Richard J. Millar (March 2004)."Advances in Dataflow Programming Languages"(PDF).ACM Computing Surveys.36:1–34.doi:10.1145/1013208.1013209.S2CID5257722.Retrieved15 August2013.
  3. ^abcdeWadge, William W.; Edward A. Ashcroft (1985).Lucid, the Dataflow Programming Language(illustrated ed.). Academia Press.ISBN9780127296500.Retrieved15 August2013.
  4. ^ab"Dataflow Programming Basics".Getting Started with NI Products.National Instruments Corporation.Retrieved15 August2013.
  5. ^Harter, Richard."Data Flow languages and programming - Part I".Richard Harter's World.Archived fromthe originalon 8 December 2015.Retrieved15 August2013.
  6. ^"Why Dataflow Programming Languages are Ideal for Programming Parallel Hardware".Multicore Programming Fundamentals Whitepaper Series.National Instruments Corporation.Retrieved15 August2013.
  7. ^McSherry, Frank; Murray, Derek; Isaacs, Rebecca; Isard, Michael (5 January 2013)."Differential dataflow".Microsoft.Retrieved31 July2022.
  8. ^"Differential Dataflow".Timely Dataflow. 30 July 2022.Retrieved31 July2022.
  9. ^John L. Kelly Jr.; Carol Lochbaum; V. A. Vyssotsky (1961). "A block diagram compiler".Bell System Tech. J.40(3): 669–678.doi:10.1002/j.1538-7305.1961.tb03236.x.
  10. ^Sutherland, William Robert(January 1966).The on-line graphical specification of computer procedures(PhD thesis).MIT.hdl:1721.1/13474.Retrieved2022-08-25.
  11. ^Gloria Lambert (1973). "Large scale file processing: POGOL".POPL '73: Proceedings of the 1st annual ACM SIGACT-SIGPLAN symposium on Principles of programming languages.ACM.pp. 226–234.
  12. ^Underwater Acoustic Data Processing, Y.T. Chan

External links[edit]