Streamy — An enterprise-scale workflow management solution at Orbital Insight

14 min readMar 16, 2021

An enterprise data analytics and reporting platform typically run data pipelines with long and complex jobs, spanning across many services, programs, tools, and scripts all interacting together. These jobs need to run on an ad-hoc basis, have a set of dependencies on other existing datasets, and have other jobs that depend on them. Quickly this becomes a tangled mesh of computing and memory-intensive processes, leading to a maintenance nightmare, instability, and poor performance. This calls for a need to build a scalable and optimized workflow management solution. While there are a plethora of open source solutions available to solve these problems, they may not fit everyone’s needs.

In this article, we will introduce and take a peek under the hood of StreamY — Orbital Insight’s in-house workflow management solution. We will also describe the architectural patterns of such solutions, and considerations for those companies that chose to build a more customizable, simple, and elegant solution without having to reinvent the wheel.

Typical Goals of workflow management solution

Before we dive into what makes up an enterprise-scale workflow management solution, let us look at the goals or basic requirements of a workflow management solution

Simplified dependency management in an extremely complex distributed system setup — Consider an application that is broken down into a defined sequence of steps of processing during the design phase. Most real-world problems will encounter dependencies and sub dependencies. The complexity multiplies when the processing is spread across a truly distributed system and a microservices-based architecture. Therefore, one of the most important objectives for a workflow management solution is to abstract out this complexity and present a very simple interface for the users.
Ability to dynamically fan-out jobs — Another common need born out of complex business logic, is the ability to create new jobs on the fly. An important feature that obviates the need for the user from defining all the jobs statically and exhaustively. Instead, the workflow management solution is intelligent enough to fan-out these new “child” jobs to complete the processing
Ability to support resource-intensive Ad-Hoc jobs — Along with handling the underlying application logic, users have to often worry about the resource requirements viz memory, CPU, and I/O, etc. The ability to support such resource-intensive jobs becomes an essential feature of a workflow management system
Highly scalable and performant — This is almost an implicit requirement, a must-have need for the application to be highly scalable and performant. This means the ability to service requests at high throughput (high requests per second), very low latency, or turn around time to complete the workflow (including the intermediary steps). This would internally mean optimized use of infrastructure resources i.e high CPU utilization, well-distributed memory, and I/O consumption
Robust exception handling and self-healing — A complex business logic simplified by a workflow management solution will surely run into failed states. Handling such exceptions at the application and infrastructure level becomes a mandatory requirement. A strongly preferable add-on to this will be the ability to recover from such a failed state automatically
Alerting and Auditing — Ability to notify users of StreamY when a workflow fails or get stuck so the user can take necessary actions. It should also have the ability to do basic bookkeeping and audit e.g — how many workflows have been started? how many completed successfully and failed otherwise? This feedback helps keep the transparency of the system and therefore user having more confidence in what and how it's being executed
Programmatically supported — Implicit requirement, but worth pointing out. It needs a simple and developer-friendly interface through APIs or SDKs. Workflow Management solution is typically embedded within the larger application logic and therefore integrating it using these interfaces is a must-have requirement
Rich User Experience for easy understanding: Finally and equally important is the rich UI for anyone (technical as well as non-technical ) personnel to look at the dashboard of the system. It should give a glimpse of the state of the workflow with the ability to manually kickstart the workflows if needed (advanced requirement)

What does Orbital Insight really do? Use Case

Before we proceed to the technical section of this blog, it is important to understand the high-level business case for Orbital Insight and what it actually does.

Orbital Insight’s mission is to understand what is happening on and to the Earth, by applying artificial intelligence and geospatial analytics to datasets ranging from satellite, aerial & high-altitude imagery, to location data from connected devices & vehicles. Our GO platform helps customers harness vast amounts of data through automated processing and analysis of raw imagery and location data, at a scale that would be impossible to do manually. GO is used by customers in the public sector to provide situational analysis about areas and sites of strategic importance, and commercially to understand supply chains and economic activity.

Typical questions answered by GO include:

How many cars appear in satellite imagery of the city of Beijing — and how is this changing? [Link: https://medium.com/from-the-macroscope/tracking-the-coronavirus-part-four-amid-chinas-devastating-data-go-detects-a-recovery-89783628190a]
What does the land-use landscape look like in the Bahamas — where are the buildings/roads / etc? And how did this change after a hurricane event? [Link: https://orbitalinsight.com/blog/dorians-destruction-bahamas-seen-by-orbital-insight-via-esri]
How much mobile device “foot traffic” is observed at automotive assembly plants — how has this been affected by the Coronavirus pandemic? [Link: https://orbitalinsight.com/blog/tracking-the-coronavirus-auto-plant-restarts]

The customers of this platform are typically GIS analysts who are interested in three things — “when”, where” and “what”. “When” refers to the time range in days, weeks, months, or years. “Where” refers to the location on Earth. This could be a point in terms of latitude/longitude or an area defined by a bounded polygon also known as the Area of Interest.” What” is at the core of the use case. This refers to the type of analysis the customer wants to run or in other words the algorithm that defines the analysis. In order to get these right, Orbital Insight’s platform aggregates analyze and combine data from many different sources including satellite imagery, anonymized device location, and other geospatial sensory data.

To understand the nature and scale of the problem, let us consider a typical Foot Traffic Project example. A user may seek to analyze all the foot traffic at the hospitals in the state of Washington. The result of such a project will be a time series of data of all the foot traffic aggregated and normalized across the different hospitals within a time window. Below figure(1) illustrates the result

Figure 1

GO platform UI screenshot

Let us understand this further in terms of steps by the user along with the inputs and outputs to Orbital Insight GO Platform. The GO user first creates a project to capture his requirements for geospatial analysis. This newly created project will consist of the Area of Interests (AOI) or the polygon-shaped geographic bounding boxes that indicate the area that needs to be analyzed. The user also specifies the temporal requirements or the time range for which he seeks this analysis. Once created, the project is ‘submitted’ and all the parameters for analysis. The output of this project is a time series for either raw or normalized Foot Traffic. In terms of the above example, the left side of Figure (1) indicates the number of AOIs along with the AOI listing. The right side of the figure displays the map of the selected AOI and at the bottom is the actual result with the time series graphed out. In addition to this, Orbital Insight’s GO platform provides answers to?

So what exactly is the Foot traffic data that is shown in this result above? The simple answer to this question is that this data is actually anonymized cell phone pings in that AOI and in that time period. Orbital Insight is not in the business of collecting this data directly, rather it leverages a range of data providers to send the data into its system.

This analysis can be applied to any kind of AOI. Some common examples include Malls and retail outlets. The analysis not only reveals the foot traffic trends across these Malls, it can provide answers to analytical queries such as “how many of these devices were tracked across different malls”, also known as cross-shopping loyalty or the home/work location of the visitors of the mall.

Example workflow — Steps for analyzing a Mall’s foot traffic

All of the steps in the previous section (receiving the inputs, processing of user data, and sending back the results) therefore calls for building an end-to-end workflow pipeline. This pipeline sits at the core of the GO platform. Let us look at the details of one such workflow that analyzes the Mall foot traffic

Batching the AOIs. This step fetches AOI information (latitude, longitude, size of AOI, etc) for the list of all AOIs in question.
It splits the overall project time period into regular time chunks. This is done to ensure the project scales well. These essentially define a “job”.
All the data about each job is then packaged as objects and a specialized service is asynchronously invoked to fetch the unique device count for the batch of AOIs
Owing to the size of the fetch and to further optimize the processing, the Service itself divides the job into various tasks.
The first component of this service known as the “preparer”’s sole purpose is further chunked into weekly slices. It pushes these tasks to the next component called the “worker task” asynchronously
The worker is the real doer and runs the query on the underlying database to retrieve the unique device counts. There may be many parallel workers doing the fetch at the same time. Once the data is gathered, it is time for the last and final component of this service — the “completer” to take over
Completer’s responsibility is to gather all the data from the completed tasks by workers in the previous steps and send it back to the main process or the pipeline
The pipeline stitches all of the jobs into a full-time range
Depending on the requests, whether it’s for raw device counts or normalized counts the appropriate service is invoked
The pipeline may have additional advanced processing depending on the requests such as
Running the special algorithm that derives the cross-shopping loyalty for each of the devices that resulted from the above step
Running an algorithm that calculates the distance visitors are drawn from, also known as the trade area
Executing an algorithm to derive the home and work location data of the visitors
Executing an algorithm to derive visitors income chart using an external data source

Introducing StreamY

Having a need for a scalable, fast, and resilient workflow management solution that can run such advanced workflow pipelines gave rise to StreamY. Put simply StreamY is Orbital Insight’s Kubernetes native workflow management platform to programmatically create, monitor, execute, schedule, and manage workflows.

StreamY terminology (Facets of StreamY)

Here’s an initial glossary of terms that will be useful to understand as we describe StreamY’s inner workings in detail

Workflow Service: This is the user-facing entry point to using StreamY
Workflow: It is a Directed Acyclic Graph (DAG) based set of instructions for how to orchestrate the execution of a generic workload.
Vertex: A node within a workflow that serves as a “job” template.
Trigger: A directed edge between two vertices in a workflow which “triggers” job creation.
The Root Vertex: The unique DAG vertex that does not have any triggers which is the “target” of
WorkflowInstance: A “run/execution” following the instructions set in a workflow.
Vertex Job: A “run/execution” based on the vertex job template.
Event: A message associated with a vertex job that may match one or more triggers.
The Root Vertex Job: The first job in a workflow instance is created from the root vertex when the workflow instance is started.

Building blocks of StreamY

Directed Acyclic Graph (DAG)
All the StreamY requests can be expressed as a set of vertices and connected edges, forked and joined without any cycles. Each job in the workflow is treated as a Vertex. The edge of the graph that connects a source vertex to the target vertex is the event or outcome from the source vertex.
Event Queues
StreamY requests use queue-based event handlers to scale well. The main function of these pollers is to ensure high throughput and a decoupled design
Kubernetes based Resource Management
Kubernetes-based resource management helps the solution to dynamically allocate memory and CPU according to the request requirements. This capability makes it scalable even with most ad-hoc requests

How to kick off a StreamY workflow ( with code samples)

Everything in StreamY starts with a workflow service and an “actor”. While the former is a user-facing entry point, the “actor” refers to the actual processing logic — e.g fetch the raw device count from the previous example could be a piece of code wrapped up in an actor object

The other important point to note is StreamY is completely based on Kubernetes so it needs dockerized containers to run the workflow. Lastly, the entire workflow is a set of instructions expressed in terms of the directed acyclic graph (DAG) with vertices as jobs and defined triggers that form the edges of this graph

Following are the main steps to programmatically invoke StreamY

Step 1: Define the workflow — The DAG-based workflow is a JSON file that contains all the metadata needed to execute the workflow. Below is a screenshot of StreamY that shows a part of an example workflow.

Step 2 : Implement “WorkerSvcActor” — the standard type of Actor which runs in a dedicated docker container)

Step 3: Dockerize At least one docker image that also contains “Streamy-client” (comes with StreamY package) along with the Actor

Step 4: Register Now we need to build the payload that the WorkflowSvc.register_workflow endpoint expects and invoke register.The simplest way to do this is using the streamy_client.utils.builder module as shown below

Notice that the authentication is done by a user service that runs separately. At this point, StreamY’s UI component can display the newly registered workflow

Step 5: Invoke the Endpoint.

Now we need to invoke the WorkflowSvc.start_workflow_instance endpoint.

a) You need to first remember the workflow_id that your Workflow was assigned when you called to register, if you have not then you can query for the latest version of your Workflow by workflow_name:

b) Then actually invoke the endpoint very similar to the below sample code

That’s it!. Now we should be able to see the Progress of this workflow in the StreamY UI and let it run to complete it

Peek under the hood

Figure 2

StreamY DAG process flow

So what goes under the hood of this StreamY engine? Let us take a look

As described in the previous sections, StreamY is a DAG-based workflow management solution that leverages Kubernetes APIs to execute the logic. This DAG-based request-response flow is broken down into the following sequence of steps

When the first request is made to StreamY, it creates or starts a workflow instance. (In reality, there could be multiple requests leading to the creation of multiple workflow instances)
As shown in the figure above, the newly created instance picks the very first job that seeds the entire DAG processing. This first job is called the root job and it’s pushed into the queue called the vertex job queue. At the same time, it also creates an event poller component. We will see how this poller helps in the processing in the later steps
Moving on, a global scale poller then pulls this job out of the vertex job queue to dispatch it to a vertex job executor. This executor is a thin wrapper over Kubernetes API that creates a pod to run and execute the actor defined in the workflow.
This results in the generation of an outcome or an ‘event’. The event is pushed or published into the event queue.
Now the event poller, created in step 2, processes this event as it is subscribed to the event queue. Here the poller checks for a match with user-defined triggers in the workflow. If a match is found, the associated job is pushed back into the vertex job queue.
Step 2 to Step 6 are repeated until there are no more jobs in the queue
The event-based processing combined with Kubernetes management underneath helps scale the workflow processing

Physical Design

Figure 3

StreamY Physical Design

Here is another view of StreamY in terms of physical design. The actual service consists of a user-facing workflow service that is accessed via an API. It consists of a streamy DB to store metadata or system records in Streamy DB. The service talks to two main other internal services viz worker service and event service. Worker service does the actual processing of vertex jobs and is a wrapper over Kubernetes API. Event Service is a wrapper over messaging service that invokes NATS-based messaging service.

Build vs Open Source

One obvious question while choosing a workflow management solution for Orbital was the choice between in-house and open source. Please note while the decision to create an in-house solution for workflow management was primarily based on the following factors, these were historical and may need to be evaluated again for future use cases

Heavy Satellite Imagery use case: A good chunk of Orbital Insight jobs are based on the processing of satellite images with complex ingestion workflow. These consist of non-standard workflows and involve a lot of dynamic forking. The sheer complexity made the case in favor of building something more native to Orbital instead of an off the shelf software that may need some retrofitting to support the use case
Ad-hoc resource requirement: The requests for the insights are very ad-hoc in nature and require dynamic resource allocation. This means the ability to scale vertically and horizontally based on the request. Once again it is not best supported by any of the known market solutions
Need to support the integration with GPU: To run all the Computer Vision algorithms, integration with GPU was a requirement. A lot of open-source workflow management systems did not support this feature back when StreamY was chosen.

Conclusion

The blog talks about the inner workings of StreamY — Orbital Insights homegrown technology for a workflow management system that caters to Orbital’s use cases. Various architectural patterns including DAGs / Event Queues and Kubernetes-based APIs have been used to build this component. This has worked well for the current scale of Orbital Insight’s platform. In the future though, we will continue to evaluate the technology choice between improving the homegrown solution i.e StreamY vs adopting a suitable market leader open source to meet the needs of a growing scale.