Data Lineage
Article

Data Lineage

An Essential Step in Data Processing

As William Pollard said, "Information is a source of learning. But unless it is organised, processed, and available to the right people in a format for decision making, it is a burden, not a benefit." In any project it is essential to distribute information to all parties involved. Reproducible project process information is essential to all participants at every working level. Information should be organised and presented in a standardised, easy-to-use way, possible using data lineage.
View Larger Map

Project information is expected to contain answers to essential questions such as: how, what, when, where, why and who? The answers to these questions relate to several project processes in which individual processing steps are executed using datasets and user parameters to create new datasets. In the metadata standard ISO19115 for geo-datasets, data lineage encompasses these processing steps and parameters.

 

Data Lineage

A series of steps can only be stored when each intermediate processing step is identified. Without an independent source to describe these processing steps, the lineage chain is easily broken. This leads to difficulties, since not all datasets can be generated automatically, guaranteeing their existence in the lineage project information. Little attention has been paid recently to this important step in reproducing data processes. We therefore propose a new approach to modelling and guaranteeing project lineage information, using the integral zoning of agricultural development in North Brabant (Netherlands) as an example (Figure 1).

 

 

 

UML Diagram

Workflow reproducibility and preformed process documentation are of the utmost importance. Two workflow modelling levels are involved: the first is at project process level (where the main processing activities are addressed) and the second is the detailed description of a processing activity (where individual datasets and processing steps are modelled). This zoning project has a number of sequential processes, where each project process has input and output data. The resulting chain of data and processes can easily be modelled in a Unified Modelling Language (UML) diagram, the standard method for complex workflows in ICT projects.

 

Choosing the appropriate UML diagrams is essential. UML has two main groups of diagrams: structured and behaviour. A mandatory part of any ISO (International Organisation for Standardisation) standard within the Technical Committee 211 (TC211) community is a UML class (structured) diagram. The metadata standard (ISO19115) class diagram can be applied to any dataset. However, project workflow processes (dynamic by nature) must also be modelled. UML behaviour diagrams are intended for this purpose, of which the most suitable is the UML activity diagram. A new approach is to have these two modelling diagrams side by side and linked to one another using lineage.

 

 

Project Processes

The North Brabant zoning project has five project processes. These are presented in a simplified form in Figure 2, where each project process is considered as one activity and has both input and output data. The accumulation of data and execution of the processes are simultaneous; i.e. the amount of metadata increases during the project. This increase also corresponds to increased decision-maker interest in metadata as the project progresses, and both are represented by the growth in arrow thickness between processes in Figure 2. This makes more important the need for lineage in the project to guarantee the required metadata. The increase in metadata can be organised by adding a lineage activity between project processes. Lineage activity can inherit the accumulated metadata information from the related process, and may also be used to organise specific metadata at project level.

 

Lineage Activity in UML

The simple project figure can easily be translated into an activity UML model by introducing the lineage activity. This is depicted in Figure 3, where the white boxes represent datasets or, in UML models, objects. The small chain sign indicates that it is a composite object. In this case study the composite object represents more than one dataset. The blue boxes represent processes or, in UML models, activities. The small chain symbol in each activity implies that it is a structured activity, containing a detailed activity model of several processing steps and datasets. Each activity headed ‘LineageMetadata' contains lineage information at project processing level, which is where answers to questions (e.g. why, what, when) regarding a project process are addressed.

 

An additional advantage of the UML model is that it can also function as a general workflow model for the project. This highly comprehensible overview can be stored in a format which is easily exchanged between participating parties.

 

Metadata

 

Make First Concept

 

description

 

First concept combines basic datasets with criteria for limited (extensive) agricultural use.

 

rationale

 

Criteria for class values to be given to intensity types of agricultural use are provided by a set of separate datasets. These datasets are combined in ‘01 Starting Project Data' and ‘Input First Concept 1:25,000'. All parameters and values are found in activity ‘1 First Concept: 1:25000'.

 

processor

 

1 First Concept: 1:25000; Responsible party department

 

dateTime

 

15-3-2000

 

parameter1- name

 

Void

 

Parameter1- value

 

Void

 

Parameter2- name

 

Void

 

parameter2- value

 

Void

 

 

Zoning Project

The ISO19115 metadata standard on lineage is used to format the accumulated process information and is applied to entire datasets. Figure 4, the UML class model, depicts how lineage is related to the data source inputs and the processing steps used (lineage relationships to data subsets or individual objects are not taken into account). The zoning planning project was executed with entire datasets, not with individual objects or data subsets. Table 1 lists the content of the lineage metadata model relating to the process ‘1 First Concept: 1:25000' (Figure 3). As an example, this project process activity is also modelled into a detailed view of individual processing steps in Figure 5. The ISO lineage class model is again used to describe the operational process steps necessary to create new data.

 

 

 

The project process ‘First Concept' is used as an example and presented in a UML activity model in Figure 5. Each lineage metadata activity represents a processing step. The symbols in Figure 5 define the same activity elements as in Figure 3 (datasets and processes). The added grey box with the folded corner is reserved for additional model comments. The prefixes PS, PR and T represent different levels of dataset importance in relation to the zoning project: PS specifies a project-source dataset, PR represents project-result datasets created, and T specifies temporary data that is only needed within this processing activity. The C prefix to processing steps specifies computations in the first project process. The lineage part of ISO19115 is used again to describe each processing step. As an example, ‘C1a Create Buffer Zones' is described in Table 2. The questions (e.g. how, where and when) are again addressed, but at individual processing step levels.

 

 

Concluding Remarks

The approach presented shows that lineage-oriented UML models are very useful for documenting and describing the reproducibility of project workflow processes, both at project activity and detailed activity level. By introducing these two modelling levels the lineage becomes more orderly and accessible. It was sufficient to apply UML activity and class models to express both project processes and detailed processing steps. Employing UML composite objects for multiple datasets and structured activities for a series of processing steps proved very effective for modelling this complex project. Storing project information in these models allows project lineage questions to be answered. The lineage standard of ISO combined with the UML models supports all aspects referred to by Pollard. The lineage workflow information shows how the data could be processed and organised and the resulting lineage model is useful for decision-makers in organising, processing and distributing data.

 

Metadata

 

C1a Create Buffer Zones

 

description

 

Part of ‘First Concept'. Buffers zones are created to extend the area of the primary class of agriculture, A.

 

rationale

 

The distance of 250m relates to provincial regulations concerning agricultural influence zones.

 

processor

 

Custom Buffer AML ArcInfo JS v6; Responsible party department

 

dateTime

 

1-11-1999

 

parameter1-name

 

Buffer distance

 

parameter1-value

 

250m

 

parameter2-name

 

A_AREA_ZONES

 

parameter2-value

 

1

 

 

 

Acknowledgements

Thank to Paul Jansen of Geonovum, Marcel de Rink of ESRI, Wies Vullings of the Centre for Geo-Information, and Coen Wessels, Nexpri, all of The Netherlands, for their support and contributions.

 

Geomatics Newsletter

Value staying current with geomatics?

Stay on the map with our expertly curated newsletters.

We provide educational insights, industry updates, and inspiring stories to help you learn, grow, and reach your full potential in your field. Don't miss out - subscribe today and ensure you're always informed, educated, and inspired.

Choose your newsletter(s)