Skip to content

This whitepaper describes a new concept for building serverless microapps called Actors, which are easy to develop, share, integrate, and build upon. Actors are a reincarnation of the UNIX philosophy for programs running in the cloud.

Notifications You must be signed in to change notification settings

apify/actor-whitepaper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Repository files navigation

The Web Actor Programming Model Whitepaper [DRAFT]

This whitepaper describes a new concept for building serverless microapps calledActors, which are easy to develop, share, integrate, and build upon. Actors are a reincarnation of the UNIX philosophy for programs running in the cloud.

ByJan Čurn, Marek Trunkát, Ondra Urban,and theApifyteam.

Version 0.99 (September 2024)

Contents

Introduction

This whitepaper introducesActors, a new kind of serverless programs for general-purpose language-agnostic computing and automation jobs (also known as agents, functions, apps,...). The main goal for Actors is to make it easy for developers build and ship reusable software tools, which are also easy to run and integrate by others. For example, Actors are useful for building web scrapers, crawlers, automations, and AI agents.

Background

The Actors were first introduced byApifyin late 2017, as a way to easily build, package, and ship web scraping and web automation tools to customers. Over the years, Apify keeps developing the concept and has applied it successfully to thousands of real-world use cases in many business areas, well beyond the domain of web scraping.

Building on this experience, we're releasing this whitepaper to introduce the philosophy of Actors to the public and receive your feedback on it. Our hope is to make the Actor programming model an open standard, which will help community to more effectively build and ship reusable software automation tools, as well as encourage new implementations of the model in other programming languages.

The goal of this whitepaper is to be the north star showing what the Actor programming model is and what its implementations should support. Currently, the most complete implementation of Actor model is provided by the Apify platform, with SDKs for Node.jsand Python, and acommand-line interface (CLI). Beware that the frameworks might not yet implement all the features of Actor programming model described in this whitepaper. This is work in progress.

Overview

Actors are serverless programs running in the cloud. They can perform anything from simple actions such as filling out a web form or sending an email, to complex operations such as crawling an entire website, or removing duplicates from a large dataset. Actors can persist their state and be restarted, and thus they can run as short or as long as necessary, from seconds to hours, even infinitely.

Basically, Actors are programs packaged as Docker images, which accept a well-defined JSON input, perform an action, and optionally produce a well-defined JSON output.

Actors have the following elements:

  • Dockerfilewhich specifies where is the Actor's source code, how to build it, and run it.
  • Documentationin a form of README.md file.
  • Input and output schemasthat describe what input the Actor requires, and what results it produces.
  • Access to an out-of-boxstorage systemfor Actor data, results, and files.
  • Metadatasuch as the Actor name, description, author, and version.

The documentation and the input/output schemas make it possible for people to easily understand what the Actor does, enter the required inputs both in user interface or API, and integrate the results of the Actor into their other workflows. Actors can easily call and interact with each other, enabling building more complex systems on top of simple ones.

Apify Actor diagram

Apify platform

The Actors can be published on theApify platform, which automatically generates a rich website with documentation and a practical user interface, in order to encourage people to try the Actor right away. The Apify platform takes care of securely hosting the Actors' Docker containers and scaling the computing, storage and network resources as needed, so neither Actor developers nor the users need to deal with the infrastructure. It just works.

The Apify platform provides an open API, cron-style scheduler, webhooks andintegrations to services such as Zapier or Make, which make it easy for users to integrate Actors into their existing workflows. Additionally, the Actor developers can set a price tag for the usage of their Actors, and thus earn income and have an incentive to keep developing and improving the Actor for the users. For details, seeMonetization.

Basic concepts

This section describes core features of Actors, what they are good for, and how Actors differ from other serverless computing systems.

Input

Each Actor accepts aninput object,which tells it what it should do. The object is passed in JSON format, and its properties have a similar role as command-line arguments when running a program in a UNIX-like operating system.

For example, an input object for an Actorbob/screenshot-takercan look like this:

{
"url":"https:// example",
"width":800
}

The input object represents a standardized way for the caller to control the Actor's activity, whether starting it using API, in user interface, CLI, or scheduler. The Actor can access the value of the input object using theGet inputfunction.

In order to specify what kind of input object an Actor expects, the Actor developer can define anInput schema file.

The input schema is used by the system to generate user interface, API examples, and simplify integrations with external systems.

Example of auto-generated Actor input UI

Screenshot Taker Input UI

Run environment

The Actors run within an isolated Docker container with access to local file system and network, and they can perform an arbitrary computing activity or call external APIs. Thestandard outputof the Actor's program (stdout and stderr) is printed out and logged, which is useful for development and debugging.

To inform the users about the progress, the Actors might set astatus message, which is then displayed in the user interface and also available via API.

Running Actors can also launch aweb server, which is assigned a unique local or public URL to receive HTTP requests. For example, this is useful for messaging and interaction between Actors, for running request-response REST APIs, or providing a full-featured website.

The Actors can store their working data or results into specializedstorages calledKey-value storeandDatasetstorages, from which they can be easily exported using API or integrated in other Actors.

Output

While the input object provides a standardized way to invoke Actors, the Actors can also generate anoutput object,which is a standardized way to display, consume, and integrate Actors' results.

The Actor results are typically fully available only after the Actor run finishes, but the consumers of the results might want to access partial results during the run. Therefore, Actors don't generate the output object in their code, but they define anOutput schema file,which contains instruction how to generate such output object automatically.

You can define how the Actor output looks like using theOutput schema file. The system uses this information to automatically generate an immutable JSON file, which tells users where to find the results produced by the Actor. The output object is stored by the system to the Actor run object under theoutputproperty, and returned via API immediately when the Actor is started, without the need to wait for it to finish or generate the actual results. This is useful to automatically generate UI previews of the results, API examples, and integrations.

The output object is similar to input object, as it contains properties and values. For example, for thebob/screenshot-takerActor, the output object can look like this:

{
"screenshotUrl":"https://api.apify /v2/key-value-stores/skgGkFLQpax59AsFD/records/screenshot.jpg",
"productImages":"https://api.apify /v2/key-value-stores/skgGkFLQpax59AsFD/records/product*.jpg",
"productDetails":"https://api.apify /datasets/9dFknjkxxGkspwWd/records?fields=url,name",
"productExplorer":"https://bob--screenshot.apify.actor/product-explorer",
// or this with live view
"productExplorer":"https://13413434.runs.apify.net/product-explorer"
}

Storage

The Actor system provides two specialized storages that can be used by Actors for storing files and results: Key-value storeandDataset,respectively. For each Actor run, the system automatically creates so-calleddefault storagesof both these types in empty state and makes them readily available for the Actor.

Alternatively, a caller can request reusing existing storage when starting a new Actor run. This is similar to redirecting standard input in UNIX, and it is useful if you want an Actor to operate on an existing Key-value store or Dataset instead of creating a new one.

Besides these so-calleddefault storages,which are created or linked automatically, the Actors are free to create new storages or access existing ones, either by ID or a name that can be set to them (e.g.bob/screenshots). Theinput schema fileandoutput schema fileprovide special support for referencing these storages, in order to simplify linking an output of one Actor to an input of another. The storages are accessible through an API and SDK also externally, for example, to download results when the Actor finishes.

Note that the Actors are free to access any other external storage system through a third-party API, e.g. an SQL database or a vector database.

Key-value store

The Key-value store is a simple data storage that is used for saving and reading files or data records. The records are represented by a unique text key and the data associated with a MIME content type. Key-value stores are ideal for saving things like screenshots, web pages, PDFs, or to persist the state of Actors e.g. as a JSON file.

Each Actor run is associated with a default empty Key-value store, which is created exclusively for the run, or alternatively with an existing Key-value store if requested by the user on Actor start. TheActor inputis stored as JSON file into the default Key-value store under the key defined by theACTOR_INPUT_KEYenvironment variable (usuallyINPUT). The Actor can read this input object using theGet inputfunction.

The Actor can read and write records to key-value stores using the API. For details, seeKey-value store access.

The Actor can define a schema for the Key-value store to ensure files stored in it conform to certain rules. For details, seeStorage schema files.

Dataset

The Dataset is an append-only storage that allows you to store a series of data objects such as results from web scraping, crawling, or data processing jobs. You or your users can then export the Dataset to formats such as JSON, CSV, XML, RSS, Excel, or HTML.

The Dataset represents a store for structured data where each object stored has the same attributes, such as online store products or real estate offers. You can imagine it as a table, where each object is a row and its attributes are columns. Dataset is an append-only storage—you can only add new records to it, but you cannot modify or remove existing records. Typically, it is used to store an array or collection of results, such as a lit of products or web pages.

The Actor can define a schema for the Dataset to ensure objects stored in it conform to certain rules. For details, seeStorage schema files.

Integrations

Actors are designed for interoperability.Thanks to the input and output schemas, it easy to connect Actors with external systems, be it directly via REST API, Node.js or Python clients, CLI, or no-code automations. From the schema files, the system can automatically generate API documentation, OpenAPI specification, and validate inputs and outputs, simplifying their integrations to any other systems.

Furthermore, Actors can interact with themselves, for examplestart another Actors, attachWebhooksto process the results, orMetamorphinto another Actor to have it finish the work.

What Actors are not

Actors are best suited for compute operations that take an input, perform an isolated job for a user, and potentially produce some output.

For long-running jobs, Actor execution might be migrated from server to another server, making it unsuitable for running dependable storage workloads such as SQL databases.

As Actors are based on Docker, it takes certain amount of time to spin up the container and launch its main process. Doing this for every small HTTP transaction (e.g. API call) is not efficient, even for highly-optimized Docker images. TheStandby modeenables running an Actor as a web server, to more effectively process small API requests.

Philosophy

Actors are inspired by theUNIX philosophyfrom the 1970s, adapted to the age of the cloud:

  1. Make each program do one thing well.To do a new job, build afresh rather than complicate old programs by adding new “features”.
  2. Expect theoutput of every program to become the input to another, as yet unknown, program.Don’t clutter output with extraneous information. Avoid stringently columnar or binary input formats. Don’t insist on interactive input.
  3. Design and build software, even operating systems, to betried early,ideally within weeks. Don’t hesitate to throw away the clumsy parts and rebuild them.
  4. Use tools in preference to unskilled helpto lighten a programming task, even if you have to detour to build the tools and expect to throw some of them out after you’ve finished using them.

The UNIX philosophy is arguably one of the most important software engineering paradigms which, together with other favorable design choices of UNIX operating systems, ushered the computer and internet revolution. By combining smaller parts that can be developed and used independently (programs), it suddenly became possible to build, manage and gradually evolve ever more complex computing systems. Even today's modern mobile devices are effectively UNIX-based machines that run a lot of programs interacting with each other, and provide a terminal which looks very much like early UNIX terminals. In fact, terminal is just another program.

The UNIX-style programs represent a great way to package software for usage on a local computer. The programs can be easily used stand-alone, but also in combination and in scripts in order to perform much more complex tasks than an individual program ever could, which in turn can be packaged as new programs.

The idea of Actors is to bring benefits of UNIX-style programs from a local computer into an environment of cloud where programs run on multiple computers communicating over a network that is subject to latency and partitioning, there is no global atomic filesystem, and where programs are invoked via API calls rather than system calls.

Each Actor should do just one thing and do it well. Actors can be used stand-alone, as well as combined or scripted into more complex systems, which in turn can become new Actors. Actors provide a simple user interface and documentation to help users interact with them.

UNIX programs vs. Actors

The following table shows equivalents of key concepts of UNIX programs and Actors.

UNIX programs Actors
Command-line options Input object
Read stdin No direct equivalent, you canread from a datasetspecified in the input.
Write to stdout Push results to dataset,setActor status
Write to stderr No direct equivalent, you can write errors to log, set error status message, or push failed dataset items into an "error" dataset.
File system Key-value store
Process identifier (PID) Actor run ID
Process exit code Actor exit code

Design principles

  • Each Actor should do just one thing, and do it well.
  • Optimize for the users of the Actors, help them understand what the Actor does, easily run it, and integrate.
  • Also optimize for interoperability, to make it ever easier to connect Actors with other systems. Expect objects your work with might contain additional not-yet-known fields.
  • Keep the API as simple as possible and write a great documentation, so that Actors can be built and used by >90% of software developers, even ones using no-code tools (yes, that's also software development!).

Relation to the Actor model

Note that Actors are only loosely related to theActor modelknown from computer science. According toWikipedia:

The Actor model in computer science is a mathematical model of concurrent computation that treats Actor as the universal primitive of concurrent computation. In response to a message it receives, an Actor can: make local decisions, create more Actors, send more messages, and determine how to respond to the next message received. Actors may modify their own private state, but can only affect each other indirectly through messaging (removing the need for lock-based synchronization).

While the theoretical Actor model is conceptually very similar to "our" Actor programming model, this similarity is rather coincidental. Our primary focus was always on practical software engineering utility, not an implementation of a formal mathematical model.

For example, our Actors do not provide any standard message passing mechanism, but they can communicate together directly via HTTP requests (seelive-view web server), manipulate each other's operation via the Apify platform API (e.g. abort another Actor), or affect each other by sharing some internal state or storage. The Actors do not have any formal restrictions, and they can access whichever external systems they want, and thus going beyond the formal mathematical Actor model.

Why the name "Actor"

In movies and theater, anactoris someone who gets a script and plays a role according to that script. Our Actors also perform an act on someone's behalf, using a provided script. They work well with Puppeteers and Playwrights. And they are related to the Actor model known from the computer science.

To make it clear Actors are not people, the letter "A" is capitalized.

Installation and setup

Below are steps to start building Actors in various languages and environments.

Running on the Apify platform

You can develop and run Actors inApify Consolewithout installing any software locally. Just create a free account, and start building Actors in an online IDE.

Node.js

The most complete implementation of Actor system is provided by the Apify SDK for Node.js, via theapifyNPM package. The package contains everything that you need to start building Actors locally. You can install it to your Node.js project by running:

$ npm install apify

Python

To build Actors in Python, simply install the Apify SDK for Python, via theapifyPyPi package into your project:

$ pip3 install apify

Command-line interface (CLI)

For local development of Actors and management of the Apify platform, it is handy to install Apify CLI. You can install it:

$ brew install apify-cli

or via theapify-cliNode.js package:

$ npm install -g apify-cli

You can confirm the installation succeeded and login to the Apify platform by running:

$ apify login

The Apify CLI provides two commands:apifyandactor.

apifycommand lets you interact with the Apify platform, for example run an Actor, push deployment of an Actor to cloud, or access storages. For details, seeLocal development.

actorcommand is to be used from within an Actor in the runtime, to implement the Actors functionality in a shell script. For details, seeActorizing existing code.

To get a help for a specific command, run:

$ apifyhelp<command>
$ actorhelp<command>

Actor programming interface

The commands described in this section are expected to be called from within a context of a running Actor, both in local environment or on the Apify platform.

The Actor runtime system passes the context viaenvironment variables, such asAPIFY_TOKENorACTOR_RUN_ID,which is used by the SDK or CLI to interact with the runtime.

Initialization

The SDKs provide convenience methods to initialize the Actor and handle its result. During initialization, the SDK loads environment variables, checks configuration, prepares to receive system events, and optionally purges previous state from local storage.

Node.js

In Node.js the Actor is initialized by calling theinit()method. It should be paired with anexit()method which terminates the Actor. Use ofexit()is not required, but recommended. For more information go toExit Actor.

import{Actor}from'apify';

awaitActor.init();

constinput=awaitActor.getInput();
console.log(input);

awaitActor.exit();

An alternative way of initializing the Actor is with amain()function. This is useful in environments where the latest JavaScript syntax and top level awaits are not supported. The main function is only syntax-sugar forinit()andexit().It will callinit()before it executes its callback andexit()after the callback resolves.

import{Actor}from'apify';

Actor.main(async()=>{
constinput=awaitActor.getInput();
//...
});

Python

importasyncio
fromapifyimportActor

asyncdefmain():
asyncwithActor:
input=awaitActor.get_input()
print(input)

asyncio.run(main())

CLI

No initialization needed, the process exit terminates the Actor, with the process status code determining if it succeeded or failed.

$ actor set-status-message"My work is done, friend"
$exit0

UNIX equivalent

intmain(intargc,char*argv[]) {
...
}

Get input

Get access to the Actor input object passed by the user. It is parsed from a JSON file, which is stored by the system in the Actor's default key-value store, Usually the file is calledINPUT,but the exact key is defined in theACTOR_INPUT_KEYenvironment variable.

The input is an object with properties. If the Actor defines the input schema, the input object is guaranteed to conform to it. For details, seeInput.

Node.js

constinput=awaitActor.getInput();
console.log(input);

// prints: { "option1": "aaa", "option2": 456 }

Python

input=Actor.get_input()
print(input)

CLI

#Emits a JSON object, which can be parsed e.g. using the "jq" tool
$ actor get-input|jq

>{"option1":"aaa","option2":456 }

UNIX equivalent

$command--option1=aaa --option2=bbb
intmain(intargc,char*argv[]) {}

Key-value store access

Write and read arbitrary files using a storage calledKey-value store. When an Actor starts, by default it is associated with a newly-created key-value store, which only contains one file with input of the Actor (seeGet input).

The user can override this behavior and specify another key-value store or input key when running the Actor.

Node.js

// Save objects to the default key-value store
awaitActor.setValue('my_state',{something:123});// (stringified to JSON)
awaitActor.setValue('screenshot.png',buffer,{contentType:'image/png'});

// Get record from the default key-value store, automatically parsed from JSON
constvalue=awaitActor.getValue('my_state');

// Access another key-value store by its name
conststore=awaitActor.openKeyValueStore('screenshots-store');
constimageBuffer=awaitstore.getValue('screenshot.png');

Python

# Save object to store (stringified to JSON)
awaitActor.set_value('my-state',{'something':123})

# Save binary file to store with content type
awaitActor.set_value('screenshot',buffer,content_type='image/png')

# Get object from store (automatically parsed from JSON)
state=awaitActor.get_value('my-state')

UNIX

$echo"hello world">file.txt
$ cat file.txt

Push results to dataset

Larger results can be saved to append-only object storage calledDataset. When an Actor starts, by default it is associated with a newly-created empty default dataset. The Actor can create additional datasets or access existing datasets created by other Actors, and use those as needed.

Note that Datasets can optionally be equipped with schema that ensures only certain kinds of objects are stored in them. SeeDataset schema filefor more details.

Node.js

// Append result object to the default dataset associated with the run
awaitActor.pushData({
someResult:123,
});

// Append result object to a specific named dataset
constdataset=awaitActor.openDataset('bob/poll-results-2019');
awaitdataset.pushData({someResult:123});

Python

# Append result object to the default dataset associated with the run
awaitActor.push_data({'some_result':123})

# Append result object to a specific named dataset
dataset=awaitActor.open_dataset('bob/poll-results-2019')
awaitdataset.push_data({'some_result':123})

CLI

#Push data to default dataset, in JSON format
$echo'{ "someResult": 123 }'|actor push-data --json
$ actor push-data --json='{ "someResult": 123 }'
$ actor push-data [email protected]

#Push data to default dataset, in text format
$echo"someResult=123"|actor push-data
$ actor push-data someResult=123

#Push to a specific dataset in the cloud
$ actor push-data --dataset=bob/election-data someResult=123

#Push to dataset on local system
$ actor push-data --dataset=./my_dataset someResult=123

UNIX equivalent

printf("Hello world\tColum 2\tColumn 3");

Exit Actor

When the main Actor process exits (i.e. the Docker container stops running), the Actor run is considered finished and the process exit code is used to determine whether the Actor has succeeded (exit code0leads to statusSUCCEEDED) or failed (exit code not equal to0leads to statusSUCCEEDED). In this case, the platforms set a status message to a default value likeActor exit with exit code 0, which is not very descriptive for the users.

An alternative and preferred way to exit an Actor is using theexitfunction in SDK, as shown below. This has two advantages:

  • You can provide a custom status message for users to tell them what the Actor achieved On error, try to explain to users what happened and most importantly, how they can fix the error. This greatly improves user experience.
  • The system emits theexitevent, which can be listened to and used by various components of the Actor to perform a cleanup, persist state, etc. Note that the caller of exit can specify how long should the system wait for allexit event handlers to complete before closing the process, using thetimeoutSecsoption. For details, seeSystem Events.

Node.js

// Actor will finish with 'SUCCEEDED' status
awaitActor.exit('Succeeded, crawled 50 pages');

// Exit right away without calling `exit` handlers at all
awaitActor.exit('Done right now',{timeoutSecs:0});

// Actor will finish with 'FAILED' status
awaitActor.exit('Could not finish the crawl, try increasing memory',{exitCode:1});

//... or nicer way using this syntactic sugar:
awaitActor.fail('Could not finish the crawl, try increasing memory');

// Register a handler to be called on exit.
// Note that the handler has `timeoutSecs` to finish its job
Actor.on('exit',({statusMessage,exitCode,timeoutSecs})=>{
// Perform cleanup...
})

Python

# Actor will finish in 'SUCCEEDED' state
awaitActor.exit('Generated 14 screenshots')

# Actor will finish in 'FAILED' state
awaitActor.exit('Could not finish the crawl, try increasing memory',exit_code=1)
#... or nicer way using this syntactic sugar:
awaitActor.fail('Could not finish the crawl, try increasing memory');

CLI

#Actor will finish in 'SUCCEEDED' state
$ actorexit
$ actorexit--message"Email sent"

#Actor will finish in 'FAILED' state
$ actorexit--code=1 --message"Couldn't fetch the URL"

UNIX equivalent

exit(1);

Environment variables

Actors have access to standard process environment variables. The Apify platform uses environment variables prefixed withACTOR_to pass the Actors information about the execution context.

Environment variable Description
ACTOR_ID ID of the Actor.
ACTOR_RUN_ID ID of the Actor run.
ACTOR_BUILD_ID ID of the Actor build.
ACTOR_BUILD_NUMBER A string representing the version of the current Actor build.
ACTOR_TASK_ID ID of the saved Actor task.
ACTOR_DEFAULT_KEY_VALUE_STORE_ID ID of the key-value store where the Actor's input and output data are stored.
ACTOR_DEFAULT_DATASET_ID ID of the dataset where you can push the data.
ACTOR_DEFAULT_REQUEST_QUEUE_ID ID of the request queue that stores and handles requests that you enqueue.
ACTOR_INPUT_KEY The key of the record in the default key-value store that holds the Actor input. Typically it'sINPUT,but it might be something else.
ACTOR_MEMORY_MBYTES Indicates the size of memory allocated for the Actor run, in megabytes (1,000,000 bytes). It can be used by Actors to optimize their memory usage.
ACTOR_STARTED_AT Date when the Actor was started, in ISO 8601 format. For example,2022-01-02T03:04:05.678.
ACTOR_TIMEOUT_AT Date when the Actor will time out, in ISO 8601 format.
ACTOR_EVENTS_WEBSOCKET_URL Websocket URL where Actor may listen for events from Actor platform. SeeSystem eventsfor details.
ACTOR_WEB_SERVER_PORT TCP port on which the Actor can start a HTTP server to receive messages from the outside world, either asLive view web serveror in theStandby mode.
ACTOR_WEB_SERVER_URL A unique hard-to-guess URL under which the current Actor run's web server is accessible from the outside world. SeeLive view web serversection for details.
ACTOR_STANDBY_URL A general public URL under which the Actor can be started and its web server accessed in theStandby mode.
ACTOR_MAX_PAID_DATASET_ITEMS A maximum number of results that will be charged to the user using a pay-per-result Actor.
ACTOR_MAX_TOTAL_CHARGE_USD The maximum amount of money in USD an Actor can charge its user. SeeCharging moneyfor details.

The Actor developer can also define custom environment variables that are then passed to the Actor process both in local development environment or on the Apify platform. These variables are defined in theActor fileat.actor/actor.jsonusing theenvironmentVariablesdirective, or manually in the user interface in Apify Console.

The environment variables can be set as secure in order to protect sensitive data such as API keys or passwords. The value of a secure environment variable is encrypted and can only be retrieved by the Actors during their run, but not outside the runs. Furthermore, values of secure environment variables are omitted from the log.

Node.js

For convenience, rather than using environment vars directly, we provide aConfigurationclass that allows reading and updating the Actor configuration.

consttoken=Actor.config.get('token');

// use different token
Actor.config.set('token','s0m3n3wt0k3n')

CLI

$echo"$ACTOR_RUN_IDstarted at$ACTOR_STARTED_AT"

UNIX equivalent

$echo$ACTOR_RUN_ID

Actor status

Each Actor run has a status (thestatusfield), which indicates its stage in the Actor's lifecycle. The status can be one of the following values:

Status Type Description
READY initial Started but not allocated to any worker yet
RUNNING transitional Executing on a worker
SUCCEEDED terminal Finished successfully
FAILED terminal Run failed
TIMING-OUT transitional Timing out now
TIMED-OUT terminal Timed out
ABORTING transitional Being aborted by a user or system
ABORTED terminal Aborted by a user or system

Additionally, the Actor run has a status message (thestatusMessagefield), which contains a text for users informing them what the Actor is currently doing, and thus greatly improve their user experience.

When an Actor exits, the status message is either automatically set to some default text (e.g. "Actor finished with exit code 1" ), or to a custom message - seeExit Actorfor details.

When the Actor is running, it should periodically update the status message as follows, to keep users informed and happy. The function can be called as often as necessary, the SDK only invokes API if status changed. This is to simplify the usage.

Node.js

awaitActor.setStatusMessage('Crawled 45 of 100 pages');

// Setting status message to other Actor externally is also possible
awaitActor.setStatusMessage('Everyone is well',{actorRunId:123});

Python

awaitActor.set_status_message('Crawled 45 of 100 pages')

CLI

$ actor set-status-message"Crawled 45 of 100 pages"
$ actor set-status-message --run=[RUN_ID] --token=X"Crawled 45 of 100 pages"

Convention: The end user of an Actor should never need to look into log to understand what happened, e.g. why the Actor failed. All necessary information must be set by the Actor in the status message.

System events

Actors are notified by the system about various events such as a migration to another server, abort operationtriggered by another Actor, or the CPU being overloaded.

Currently, the system sends the following events:

Event name Payload Description
cpuInfo { isCpuOverloaded: Boolean } The event is emitted approximately every second and it indicates whether the Actor is using the maximum of available CPU resources. If that’s the case, the Actor should not add more workload. For example, this event is used by the AutoscaledPool class.
migrating N/A Emitted when the Actor running on the Apify platform is going to be migrated to another worker server soon. You can use it to persist the state of the Actor and abort the run, to speed up migration. SeeMigration to another server.
aborting N/A When a user aborts an Actor run on the Apify platform, they can choose to abort gracefully to allow the Actor some time before getting killed. This graceful abort emits theabortingevent which the SDK uses to gracefully stop running crawls and you can use it to do your own cleanup as well.
persistState { isMigrating: Boolean } Emitted in regular intervals (by default 60 seconds) to notify all components of Apify SDK that it is time to persist their state, in order to avoid repeating all work when the Actor restarts. This event is automatically emitted together with the migrating event, in which case theisMigratingflag is set totrue.Otherwise the flag isfalse.Note that thepersistStateevent is provided merely for user convenience, you can achieve the same effect usingsetInterval()and listening for themigratingevent.

In the future, the event mechanism might be extended to custom events and messages enabling communication between Actors.

Under the hood, Actors receive the system events by connecting to a web socket address specified by theACTOR_EVENTS_WEBSOCKET_URLenvironment variable. The system sends messages in JSON format in the following structure:

{
// Event name
name:String,

// Time when the event was created, in ISO format
createdAt:String,

// Optional object with payload
data:Object,
}

Note that some events (e.g.persistState) are not sent by the system via the web socket, but generated virtually on the Actor SDK level.

Node.js

// Add event handler
consthandler=(data)=>{
if(data.isCpuOverloaded)console.log('Oh no, we need to slow down!');
}
Actor.on('systemInfo',handler);

// Remove all handlers for a specific event
Actor.off('systemInfo');

// Remove a specific event handler
Actor.off('systemInfo',handler);

Python

fromapifyimportActor,Event

# Add event handler
asyncdefhandler(data):
ifdata.cpu_info.is_overloaded:
print('Oh no, we need to slow down!')

Actor.on(Event.SYSTEM_INFO,handler);

# Remove all handlers for a specific event
Actor.off(Event.SYSTEM_INFO);

# Remove a specific event handler
Actor.off(Event.SYSTEM_INFO,handler);

UNIX equivalent

signal(SIGINT,handle_sigint);

Get memory information

Get information about the total and available memory of the Actor’s container or local system. For example, this is useful to auto-scale a pool of workers used for crawling large websites.

Node.js

const memoryInfo = await Actor.getMemoryInfo();

UNIX equivalent

#Print memory usage of programs
$ ps -a

Start another Actor

Actor can start other Actors, if they have a permission.

It can override the default dataset or key-value store, and e.g. forwarding the data to another named dataset, that will be consumed by the other Actor.

Thecalloperation waits for the other Actor to finish, thestartoperation returns immediately.

Node.js

// Start Actor and return a Run object
construn=awaitActor.start(
'apify/google-search-scraper',// name of the Actor to start
{queries:'test'},// input of the Actor
{memory:2048},// run configuration
);

// Start Actor and wait for it to finish
construn2=awaitActor.call(
'apify/google-search-scraper',
{queries:'test'},
{memory:2048},
);

CLI

#On stdout, the commands emit Actor run object (in text or JSON format),
#we shouldn't wait for finish, for that it should be e.g. "execute"
$ apify call apify/google-search-scraper queries='test\ntest2'\
countryCode='US'
$ apify call --json apify/google-search-scraper'{ "queries": }'
$ apify call [email protected] --json apify/google-search-scraper
$ apify call --memory=1024 --build=beta apify/google-search-scraper
$ apify call --output-record-key=SCREENSHOT apify/google-search-scraper

#Pass input from stdin
$ cat input.json|actor call apify/google-search-scraper --json

#Call local actor during development
$ apify call file:../some-dir someInput='xxx'

Slack

It will also be possible to run Actors from Slack app. The following command starts the Actor, and then prints the messages to a Slack channel.

/apify start bob/google-search-scraper startUrl=afff

API

[POST] https://api.apify /v2/actors/apify~google-search-scraper/run

[POST|GET] https://api.apify /v2/actors/apify~google-search-scraper/run-sync?
token=rWLaYmvZeK55uatRrZib4xbZs&
outputRecordKey=OUTPUT
returnDataset=true

UNIX equivalent

#Run a program in the background
$command<arg1>,<arg2>,…&
// Spawn another process
posix_spawn();

Metamorph

🪄This is the most magical Actor operation. It replaces running Actor’s Docker image with another Actor, similarly to UNIXexeccommand. It is used for building new Actors on top of existing ones. You simply define input schema and write README for a specific use case, and then delegate the work to another Actor.

The target Actor inherits the default storages used by the calling Actor. The target Actor input is stored to the default key-value store, under a key such asINPUT-2(the actual key is passed via theACTOR_INPUT_KEYenvironment variable). Internally, the target Actor can recursively metamorph into another Actor.

An Actor can metamorph only to Actors that have compatible output schema as the main Actor, in order to ensure logical and consistent outcomes for users. If the output schema of the target Actor is not compatible, the system should throw an error.

Node.js

awaitActor.metamorph(
'bob/web-scraper',
{startUrls:["https:// example"]},
{memoryMbytes:4096},
);

CLI

$ actor metamorph bob/web-scraper startUrls=http://example
$ actor metamorph [email protected] --json --memory=4096 \
bob/web-scraper

UNIX equivalent

$exec/bin/bash

Attach webhook to an Actor run

Run another Actor or an external HTTP API endpoint after Actor run finishes or fails.

Node.js

awaitActor.addWebhook({
eventType:['ACTOR.RUN.SUCCEEDED','ACTOR.RUN.FAILED'],
requestUrl:'http://api.example?something=123',
payloadTemplate:`{
"userId": {{userId}},
"createdAt": {{createdAt}},
"eventType": {{eventType}},
"eventData": {{eventData}},
"resource": {{resource}}
}`,
});

CLI

$ actor add-webhook\\
--event-types=ACTOR.RUN.SUCCEEDED,ACTOR.RUN.FAILED\\
--request-url=https://api.example\\
--payload-template='{ "test": 123 "}'

$ actor add-webhook --event-types=ACTOR.RUN.SUCCEEDED\\
--request-actor=apify/send-mail\\
--memory=4096 --build=beta\\
[email protected]

#Or maybe have a simpler API for self-actor?
$ actor add-webhook --event-types=ACTOR.RUN.SUCCEEDED --request-actor=apify/send-mail

UNIX equivalent

#Execute commands sequentially, based on their status
$ command1;command2#(command separator)
$ command1&&command2#( "andf" symbol)
$ command1||command2#( "orf" symbol)

Abort another Actor

Abort itself or another Actor running on the Apify platform. Aborting an Actor changes itsstatustoABORTED.

Node.js

awaitActor.abort({statusMessage:'Your job is done, friend.',actorRunId:'RUN_ID'});

CLI

$ actor abort --run-id RUN_ID

UNIX equivalent

#Terminate a program
$kill<PID>

Reboot the Actor

Sometimes, an Actor might get into some error state from which it's not safe or possible to recover, e.g. an assertion error or a web browser crash. Rather than crashing and potentially failing the user job, the Actor can reboot its own Docker container and continue work from its previously persisted state.

Normally, if an Actor crashes, the system restarts its container too, but if that happens too often in a short period of time, the system might completelyabortthe Actor run. The reboot operation can be used by the Actor developer to indicate this is a controlled operation, not considered by the system as a crash.

Node.js

awaitActor.reboot();

Python

awaitActor.reboot()

CLI

$ actor reboot

Actor web server

An Actor can launch an HTTP web server that is exposed to the outer world to handle requests. This enables Actors to provide a custom HTTP API to integrate with other systems, to provide a web application for human users, to show Actor run details, diagnostics, charts, or to run an arbitrary web app.

The port on which the Actor can launch the web server is specified by theACTOR_WEB_SERVER_PORTenvironment variable.

The web server is started, it is exposed to the public internet on alive view URLidentified by theACTOR_WEB_SERVER_URL,for example:

https://hard-to-guess-identifier.runs.apify.net

The live view URL has a unique hostname, which is practically impossible to guess. This lets you keep the web server hidden from public yet accessible from the external internet by the parties, with whom you share the URL.

Node.js

constexpress=require('express');
constapp=express();

app.get('/',(req,res)=>{
res.send('Hello World!')
})

app.listen(process.env.ACTOR_WEB_SERVER_PORT,()=>{
console.log(`Example live view web server running at${process.env.ACTOR_WEB_SERVER_URL}`)
})

Standby mode

The Standby mode lets Actors run in the background and respond to incoming HTTP requests, like a web or API server.

Starting an Actor run requires launching a Docker container, and so it comes with a performance penalty, sometimes many seconds for large images. For batch jobs this penalty is negligible, but for quick request-response interactions it becomes inefficient. The Standby mode lets developers run Actors as web servers to run jobs that require quick response times.

To use the Standby mode, start an HTTP web server at theACTOR_WEB_SERVER_PORTTCP port, and process HTTP requests.

The Actor system publishes a Standby Actor's web server at URL reported in theACTOR_STANDBY_URLenvironment variable, and will automatically start or abort an Actor run as needed by the volume of HTTP requests or system load. The external Standby public URL can look like this:

https://bob--screenshot-taker.apify.actor

Unlike the live view URL reported in theACTOR_WEB_SERVER_URLenvironment variable, the Standby URL is the same for all runs of the Actor, and it's intended to be publicly known. The Actor system can perform authentication of the requests going to the Standby URL using API tokens.

Currently, the specific Standby mode settings, authentication options, or OpenAPI schema are not part of this Actor specification, but they might be in the future introduced as new settings in theactor.jsonfile.

Migration to another server

The Actors can be migrated from another host server from time to time, especially the long-running ones. When the migration is imminent, the system sends the Actor themigratingsystem event to inform the Actor, so that it can persist its state to storages. All executed writes to the default Actorstorageare guaranteed to be persisted before the migration. After the migration, Actor is restarted on a new host. It can restore its customer state from the storages again.

Charging money

To run an Actor on the Apify platform or another cloud platform, a user typically needs to pay to cover the computing costs. Additionally, the platforms are free to introduce othermonetization mechanisms, such as charging the users a fixed monthly fee for "renting" the Actor, or a variable fee for the number of results produced by the Actor. These charging mechanisms are beyond the scope of this whitepaper.

On top of these external monetization systems, Actors provide a built-in monetization system that enables developers to charge users variable amounts per event, e.g. based on the number of returned results, complexity of the input, or the cost of external APIs used internally by the Actor.

The Actor can dynamically charge the current user a specific amount of money by calling thechargefunction. Users of Actors can limit the maximum amount to be charged by the Actor using themaxTotalChargeUsdrun option, which is then passed to the Actor using theACTOR_MAX_TOTAL_CHARGE_USDenvironment variable. The Actor can call thechargefunction as many times as necessary, but once the total sum of charged credits would exceed this maximum limit, the invocation of the function throws an error.

When a paid Actor subsequently starts another paid Actor, the charges performed by the subsequent Actors are taken from the calling Actor's allowance. This enables Actor economy, where Actors hierarchically pay other Actors or external APIs to perform parts of the job.

Node.js

Charge the current user of the Actor a specific amount of USD:

constchargeInfo=awaitActor.charge({
eventName:'gpt-4o-token',
count:1000,
chargePerEventUsd:0.0001,
});

Specify the maximum amount you're willing to pay when starting an Actor.

construn=awaitActor.call(
'bob/analyse-images',
{imageUrls:['https:// example /image.png']},
{
// By default this is 0, hence Actors cannot charge users unless they explicitly allow that.
maxTotalChargeUsd:5,
},
);

Rules for building Actors with variable charging

If your Actor is charging users, you need to make sure at the earliest time possible that the Actor is being run with sufficient credits with respect to its input. If the maximum credits specified by theACTOR_MAX_TOTAL_CHARGE_USDenvironment variable is not sufficient for Actor's operation with respect to the input (e.g. user is requesting too many results for too little money), the Actor must fail immediately with an explanatory error status message for the user, and don't charge the user anything.

You also must charge the users onlyafteryou have incurred the costs, not before. If the Actor fails in the middle or is aborted, the users only need to be charged for results they actually received. Nothing will make users of your Actors angrier than charging them for something they didn't receive.

Actor definition files

The Actor system uses several special files that define Actor metadata, documentation, instructions how to build and run it, input and output schema, etc.

These files MUST be stored in the.actordirectory placed in Actor's top-level directory. The entire.actordirectory should be added to the source control. The only required files areActor fileandDockerfile, all other files are optional.

The Actor definition files are used by the CLI (e.g. byapify pushandapify runcommands), as well as when building Actors on the Apify platform. The motivation to place the files into a separate directory is to keep the source code repository tidy and to prevent interactions with other source files, in particular when creating an Actor from pre-existing software repositories.

Actor file

This is the main definition file of the Actor, and it always must be present at.actor/actor.json. This file has JSON format and contains a single object, whose properties define the main features of the Actor and link to all other necessary files.

For details, see theActor file specificationpage.

Example Actor file at.actor/actor.json

{
"actorSpecification":1,
"name":"screenshot-taker",
"title":"Screenshot Taker",
"description":"Take a screenshot of any URL",
"version":"0.0",
"input":"./input_schema.json",
"dockerfile":"./Dockerfile"
}

Dockerfile

This file contains instructions for the system how to build the Actor's Docker image and how to run it. Actors are started by running their Docker image, both locally using theapify runcommand as well as on the Apify platform.

The Dockerfile is referenced from theActor fileusing thedockerfile directive, and is typically stored at.actor/Dockerfile.

Note that paths in Dockerfile are always specified relative to the Dockerfile's location. Learn more about Dockerfiles in the officialDocker reference.

Example Dockerfile of an Actor

#Specify the base Docker image. You can read more about
#the available images at https://crawlee.dev/docs/guides/docker-images
#You can also use any other image from Docker Hub.
FROMapify/actor-node-playwright-chrome:22-1.46.0 AS builder

#Copy just package.json and package-lock.json
#to speed up the build using Docker layer cache.
COPY--chown=myuser package*.json./

#Install all dependencies. Don't audit to speed up the installation.
RUNnpm install --include=dev --audit=false

#Next, copy the source files using the user set
#in the base image.
COPY--chown=myuser../

#Install all dependencies and build the project.
#Don't audit to speed up the installation.
RUNnpm run build

#Create final image
FROMapify/actor-node-playwright-firefox:22-1.46.0

#Copy just package.json and package-lock.json
#to speed up the build using Docker layer cache.
COPY--chown=myuser package*.json./

#Install NPM packages, skip optional and development dependencies to
#keep the image small. Avoid logging too much and print the dependency
#tree for debugging
RUNnpm --quiet set progress=false \
&& npm install --omit=dev --omit=optional \
&& echo"Installed NPM packages:"\
&& (npm list --omit=dev --all || true) \
&& echo"Node.js version:"\
&& node --version \
&& echo"NPM version:"\
&& npm --version \
&& rm -r ~/.npm

#Install all required Playwright dependencies for Firefox
RUNnpx playwright install firefox

#Copy built JS files from builder image
COPY--from=builder --chown=myuser /home/myuser/dist./dist

#Next, copy the remaining files and directories with the source code.
#Since we do this after NPM install, quick build will be really fast
#for most source file changes.
COPY--chown=myuser../

#Run the image. If you know you won't need headful browsers,
#you can remove the XVFB start script for a micro perf gain.
CMD./start_xvfb_and_run_cmd.sh &&./run_protected.sh npm run start:prod --silent

README

The README file contains Actor documentation written inMarkdown. It should contain great explanation what the Actor does and how to use it. The README file is used to generate Actor's public web page on Apify and other things.

The README file is referenced from theActor fileusing thereadme property, and typically stored at.actor/README.md.

Good documentation makes good Actors. Learn morehow to write great READMEs for SEO.

Input schema file

Actors accept aninputJSON object on start, whose schema can be defined by the input schema file. This file is referenced in the Actor file (.actor/actor.json) file as theinputproperty. It is a standard JSON Schema file with our extensions, and it is typically stored at.actor/input_schema.json.

The input schema file defines properties accepted by Actor on input. It is used by the system to:

  • Validate the passed input JSON object on Actor run, so that Actors don't need to perform input validation and error handling in their code.
  • Render user interface for Actors to make it easy for users to run and test them manually.
  • Generate Actor API documentation and integration code examples on the web or in CLI, making Actors easy to integrate for users.
  • Simplify integration of Actors into automation workflows such as Zapier or Make, by providing smart connectors that smartly pre-populate and link Actor input properties.

For details, seeActor input schema file specification.

This is an example of the input schema file for thebob/screenshot-takerActor::

{
"actorInputSchemaVersion":1,
"title":"Input schema for Screenshot Taker Actor",
"description":"Enter a web page URL and it will take its screenshot with a specific width",
"type":"object",
"properties":{
"url":{
"title":"URL",
"type":"string",
"editor":"textfield",
"description":"URL of the webpage"
},
"width":{
"title":"Viewport width",
"type":"integer",
"description":"Width of the browser window.",
"default":1200,
"minimum":1,
"unit":"pixels"
}
},
"required":[
"url"
]
}

Output schema file

Similarly to input, Actors can generate anoutputJSON object, which links to their results. The Actor output schema file defines how such output object looks like, including types of its properties and description. This file is referenced in the Actor file (.actor/actor.json) file as theoutputproperty. It is a standard JSON Schema file with our extensions, and it is typically stored at.actor/output_schema.json.

The output schema describes how the Actor stores its results, and it is used by the other systems:

  • Generate API documentation for users of Actors to figure where to find results.
  • Publish OpenAPI specification to make it easy for callers of Actors to figure where to find results.
  • Enable integrating Actors with external systems and automated workflows.

For details, seeActor output schema file specification.

This is an example of the output schema file for thebob/screenshot-takerActor:

{
"actorOutputSchemaVersion":1,
"title":"Output schema for Screenshot Taker Actor",
"description":"The URL to the resulting screenshot",
"properties":{

"currentProducts":{
"type":"$defaultDataset",
"views":["productVariants"]
},

"screenshotUrl":{
"type":"$defaultKeyValueStore",
"keys":["screenshot.png"],
"title":"Product page screenshot"
},

"productExplorer":{
"type":"$defaultWebServer",
"title":"API server"
}
}
}

Storage schema files

Both main Actor file and input and output schema files can additionally reference schema files for specific storages. These files have custom JSON-based formats, see:

These storage schemas are used to ensure that stored objects or files fulfil specific criteria, their fields have certain types, etc. On the Apify platform, the schemas can be applied to the storages directly, without Actors.

Note that all the storage schemas are weak, in a sense that if the schema doesn't define a property, such property can be added to the storage and have an arbitrary type. Only properties explicitly mentioned by the schema are validated. This is an important feature which allows extensibility. For example, a data deduplication Actor might require on input datasets that have anuuid: Stringfield in objects, but it does not care about other fields.

Backward compatibility

If the.actor/actor.jsonfile is missing, the system falls back to the legacy mode, and looks forapify.json,Dockerfile,README.mdandINPUT_SCHEMA.json files in the Actor's top-level directory instead. This behavior might be deprecated in the future.

Development

Actors can be developed locally, using a git integration, or in a web IDE. The SDK is currently available for Node.js, Python, and CLI.

Local development

TODO (Adam): Explain basic workflow withapify- create, run, push etc. Move the full local support for Actors to ideas (seehttps://github /apify/actor-specs/pull/7/files#r794681016)

apify run- starts the Actor using Dockerfile referenced from.actor/actor.jsonor Dockerfile in the Actor top-level directory (if the first is not present)

Deployment to Apify platform

Theapify pushCLI command takes information from the.actordirectory and builds an Actor on the Apify platform, so that you can run it remotely.

TODO (Adam): Show code example

Continuous integration and delivery

The source code of the Actors can be hosted on external source control systems like GitHub or GitLab, and integrated to CI/CD pipelines. The implementation details, as well as details of the Actor build and version management process, are beyond the scope of this whitepaper.

Actorizing existing code

You can repackage many existing software repositories as an Actor by creating the.actor/directory with theActor definition files, and providing a Dockerfile with instruction how to run the software.

TheactorCLI command can be used from the Dockerfile'sRUNscript transform the Actor JSON input into the configuration of the software, usually passed via command-line arguments, and then store the Actor output results. For example:

TODO (Adam): Code examples of Dockerfile with"actor"command

Theactor initCLI command can automatically generate the.actordirectory and configuration files:

$ actor init

The command works on the best-effort basis, creating necessary configuration files for the specific programming language and libraries.

Actorization of existing code gives the developers an easy way to give their code presence in the cloud in a form of an Actor, so that the users can easily try it without having to install and manage it locally.

Sharing and publishing

Once an Actor is developed, the Actor platform lets you share it with other specific users, and decide whether you want to make its source code open or closed.

You can also publish the Actor for anyone to use on a marketplace likeApify Store. The Actor will get its public landing page likehttps://apify /bob/screenshot-taker, showing its README, description of inputs, outputs, API examples, etc. Once published, your Actor is automatically exposed to organic traffic of users and potential customers.

Apify Actor Store

Monetization

To build a SaaS product, one usually needs to:

  1. Develop the product
  2. Write documentation
  3. Find and buy a domain name
  4. Set up a website
  5. Setup cloud infrastructure where it runs and scales
  6. Handle payments, billing, and taxes
  7. Marketing (content, ads, SEO,...)
  8. Sales (demos, procurement, )

Building software as an Actor and deploying it to the Apify platform changes this to:

  1. Develop the Actor
  2. Write README
  3. Publish Actor on Apify Store

Packaging your software as Actors makes it faster to lunch new small SaaS products and then earn income on them, using various monetization options, e.g. fixed rental fee, payment per result, or payment per event (seeCharging money). The monetization gives developers an incentive to further develop and maintain their Actors.

Actors provide a new way for software developers like you to monetize their skills, bringing the creator economy model to SaaS.

For more details, read our essayMake passive income developing web automation Actors.

Future work

The goal of this whitepaper is to introduce the Actors philosophy and programming model to the public, to receive feedback, and to open the way for making Actors an open standard. To create an open standard, however, there is more work, including:

  • Finalize specification of all the schema files, includingoutputandstorageschema files.
  • Clearly separate what is the part of the standard and what is up to discretion of the implementations.
  • Define standardized low-level HTTP REST API interface to the Actor system, to separate "frontend" and "backend" Actor programming model implementations. For example, if somebody wants to build a support for Actor programming model in Rust, they should need to just write a Rust "frontend" translating the commands to HTTP API calls, rather than having to implement the entire system. And equally, if one decides to develop a new Actor "backend", all existing client libraries for Rust or other languages should work with it.

About

This whitepaper describes a new concept for building serverless microapps called Actors, which are easy to develop, share, integrate, and build upon. Actors are a reincarnation of the UNIX philosophy for programs running in the cloud.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published