An OGSA-DAI example: workflows and activities

OGSA-DAI is about data access, integration, transformation and delivery using data workflows and activities. Here's an example that captures the essence of OGSA-DAI in a page or two...

An example OGSA-DAI workflow

Important points to note are:

  • A workflow consists of a number of units called activities. An activity is an individual unit of work in the workflow and performs a well-defined data-related task such as running an SQL query, performing a data transformation or delivering data.
  • Activities are connected. The outputs of activities connect to the inputs of other activities.
  • Data flows from activities to other activites and this is in one direction only.
  • Different activities may output data in different formats and may expect their input data in different formats. Transformation activities can transform data between these formats.
  • Workflows, and therefore OGSA-DAI itself, do not just carry out data access, but also data updates, transformations and delivery.

Data integration and workflows

These examples show how workflows in OGSA-DAI can be used to integrate data from multiple data sources. In the first example two SQL queries are executed on two separate databases. The results of the first query are then transformed in some way. This transformed data is then joined in some way with the results from the second query. The joined data is then delivered by some means.

Data integration using a single workflow and OGSA-DAI server

In this second example we see how data exposed by multiple OGSA-DAI servers can be transported and integrated. A workflow is sent to an OGSA-DAI server. This runs an SQL query and makes the data available to other OGSA-DAI servers. A second workflow is then sent which pulls the data from the first OGSA-DAI server while simultaneously running an SQL query. The data pulled from the first server and the SQL query results from the second server are then joined in some way and delivered by some means.

Data integration using two workflows and two OGSA-DAI servers

This final example shows the use of an important OGSA-DAI extension, DQP (distributed query processing). The client submits a workflow which interacts with a "virtual database". As far as the client is concerned this is an actual database on which they can run queries. However this database is in fact a federation of a number of actual physical databases exposed via OGSA-DAI servers. When a client submits their queries the OGSA-DAI DQP engine parses their queries and forwards these to the other OGSA-DAI servers as required. DQP also marshals and joins the results of these queries into a single unified result for the client.

Using OGSA-DAI DQP

Activities

An activity is a well-defined workflow unit with a specific name. They can be dropped into a deployed OGSA-DAI server without requiring any recompilation or recoding of OGSA-DAI.

Example activities include:

  • SQLQuery - Execute an SQL query on a relational database.
  • ListDirectory - List the files in a directory.
  • XSLTransform - Execute an XSL transform on an XML document.
  • DeliverToFTP - Deliver data to an FTP server.

OGSA-DAI has a comprehensive and consistent set of standard activities for data access and integration.

Activity inputs, outputs and blocks

An activity can have 0 or more named inputs and 0 or more named outputs. Blocks of data flow from an activity's output into another activity's input. Activity inputs may be optional or required. If optional then a default value may be adopted.

Activity inputs and outputs

In an activity, there is no distinction between inputs and parameters. OGSA-DAI workflows use the notion of input literals which are used by clients to provide a parameter. It is up to the client whether an input value to an activity is provided by them or is provided via the output of another activity in the workflow.

All the required inputs of an activity must be connected to the output of another activity or have an associated input literal and all the outputs of an activity.

Activity inputs, input literals and outputs

Activity inputs expect blocks of specific types and activity outputs output blocks of specific types.

Activity input and output types

The following block types are those used by OGSA-DAI activities (this does not preclude the use of application-specific or other block types):

  • Java's basic types - Object, String, Integer, Long, Double, Number, Date, Boolean.
  • Binary types - char[], byte[], Clob, Blob.
    • BLOBs obtained from databases are stored as BLOB objects and references to entire BLOBs are passed between activities.
    • Byte arrays are typically data obtained from FTP or databases.
    • All binary data processing activities provided in OGSA-DAI can handle these representations.
  • Tuple - OGSA-DAI representation of a row of relational data which has one element per column in the row.
  • TupleMetadata - OGSA-DAI representation of a the meta data for a collection of Tuples. This includes column names and types for each field.
  • MetadataWrapper - OGSA-DAI wrapper for any object to be treated as meta-data, This allows the use of application-specific metadata within OGSA-DAI - individual activities handle metadata blocks as they see fit. TupleMetadata is an example of something that would be held in a MetadataWrapper.
  • OGSA-DAI list - a list groups related blocks together. Special blocks are used to mark the beginning and the end of a list.
    • Lists are useful for keeping related blocks together.
    • Lists are not a single block like a Tuple or a MetaDataWrapper. Rather, they consist of a ListBegin block, a sequence of related blocks, and a ListEnd block. Multiple activities can operate on different parts of an OGSA-DAI list simultaneously which can aid efficiency.
    • For example SQLQuery can dynamically take any number of SQL query expressions as input. Lists allow differentiation between the results of each query by grouping the tuples belonging to each query result together.
    • Activities define the granularity of their inputs and outputs, that is, whether they expect lists and the level of nesting of these lists (e.g. lists of lists, lists of lists of lists...)

OGSA-DAI does not validate the connections between activities to ensure the objects written to an output are compatible with those expected at a connected input. Errors like this will be detected by the activities themselves when the workflow they are involved in is executed.

Activity inputs and outputs and lists

Activities and iteration

When executing, an activity iterates. It is provided one data block on each input. So, for example, given an activity with inputs FirstName and Surname is provided with the following blocks of input:

InputData Value Stream
FirstName "Ally.txt", "Amy.txt", "Mike.txt"
Surname "Hume.txt", "Krause.txt", "Jackson.txt"

On the first iteration the activity would receive "Ally.txt" on input FirstName and "Hume.txt" on input Surname. On the second iteration the activity would receive "Amy.txt" on FirstName and "Krause.txt" on Surname.

In the presence of lists iteration becomes more complex. Suppose we have:

InputData Value Stream
Member ListBegin, "Debbie", "Susanna", "Michael", "Vicki" ListEnd, ListBegin, "Sting", "Stewart Copeland", "Andy Summers", ListEnd,
Band "The Bangles", "The Police",

In this case, on the first iteration the activity gets a list iterator for the Members input and the value "The Bangles" for the Band input. On the next iteration, it gets another list iterator for the Members input and the value "The Police" for the Band input.

What an activity does when it gets a list iterator is implementation-specific. So, for example when our activity gets the iterator for the second list from Members and value "The Police" for the Band input it could either:

  • Get every value in the Members list and do something with this and the Band value, e.g. dump to a file the value "Sting, Stewart Copeland, Andy Summers are in The Police".
  • Get each element in the Members list in turn and do something with this and the Band value, e.g. dump to a file the value "Sting is in The Police" then "Stewart Copeland is in The Police" then "Andy Summers is in The Police". This is the more typical approach for handling lists.

Activities check their inputs and will ensure they are in synch. For example, in the above if the list containing the members of the Police arrived on the Members input but no associated value arrived on the Band input then the activity would raise an error.

If the client wants to repeat the same value multiple times it can be inconvenient to repeat the value. OGSA-DAI provides a ControlledRepeat activity that repeats values.

Workflow portion showing use of ControlledRepeat

An example activity - ObtainFromFTP

As an example of an activity, consider ObtainFromFTP which pulls a file from an FTP server. The inputs and outputs of this activity along with their types is summarised in the following table:

Inputs/OutputsNameTypeOptional
InputfilenameStringno - we cannot obtain a file unless we know its name
InputhostStringno - we cannot obtain a file unless we know where to get it from
InputpassiveModeBooleanyes - a default value of false can be used unless the client wants to override it
Outputdata[ byte[] ]no - no activity outputs can be optional

When the ObtainFromFTP activity starts, it validates that input blocks obtained for each input are instances of the required type and raises an error if this is not the case.

Note how the FTP activity outputs a list of byte arrays. This allows a file to be streamed in smaller arrays rather than in one massive array holding the entire file contents. The use of a list allows these smaller arrays to be kept together and their relationship preserved. This allows downstream activities to operate on one of the byte arrays already received while the ObtainFromFTP activity is still receiving the rest of the file from the FTP server.

As an example suppose the ObtainFromFTP activity received the following inputs:

InputData Value Stream
filename "Ally.txt", "Amy.txt", "Mike.txt"
host "my.ftp.host", "your.ftp.host", "their.ftp.host"

then it would iterate three times. On each iteration it would obtain, one file from one FTP server and output the file as a list of byte arrays. Across the three iterations three lists of byte arrays would result, one for each filename and host pair. The output would be as follows:

OutputData Value Stream
data ListBegin, byte[], byte[], ...,ListEnd, ListBegin,, byte[], byte[], ...,ListEnd, ListBegin,, byte[], byte[], ...,ListEnd,

Activities and resources

Some activities can be targeted at OGSA-DAI resources. These are termed resource-specific activities. The activity interacts with the resource. The most common type of resource with which activities interact are types of data resources. OGSA-DAI data resources are components which abstract actual databases (or other data resources) into an OGSA-DAI compliant form. Examples of activities that interact with data resources include the SQLQuery activity which interacts with relational data resources and XMLListCollections activity which interacts with XMLDB data resources.

Activities can be defined to interact with any type of OGSA-DAI resource, e.g. there are activities for populating OGSA-DAI data sources (WriteToDataSource) or dumping state to or retrieving state from OGSA-DAI sessions (e.g. ObtainFromSession and DeliverToSession).

Activities and resources

Not all activities need to be targeted at a resource however. For a wide variety of transformation, factory and delivery activities there will be no target resource. Examples of such generic activities include: TupleToCSV, ObtainFromFTP, DeliverToFTP, CreateResourceGroup, XSLTransform and many more.

For much more detailed information, please take at look at the What is OGSA-DAI? section of the online User Guide.

News