| Archive |
| Archive > September 2009, Volume 23, Issue 9 > Data Lineage |
Data Lineage11/08/2009 |
||||||||||||||||||||||||||||||||||||
| An Essential Step in Data Processing |
||||||||||||||||||||||||||||||||||||
| As William Pollard said, "Information is a source of learning. But unless it is organised, processed, and available to the right people in a format for decision making, it is a burden, not a benefit." In any project it is essential to distribute information to all parties involved. Reproducible project process information is essential to all participants at every working level. Information should be organised and presented in a standardised, easy-to-use way, possible using data lineage. | ||||||||||||||||||||||||||||||||||||
| John Stuiver, Wageningen University, Netherlands, and Joep Crompvoets, Katholieke Universiteit, Leuven, Belgium | ||||||||||||||||||||||||||||||||||||
View Larger Map Project information is expected to contain answers to essential questions such as: how, what, when, where, why and who? The answers to these questions relate to several project processes in which individual processing steps are executed using datasets and user parameters to create new datasets. In the metadata standard ISO19115 for geo-datasets, data lineage encompasses these processing steps and parameters.
Data Lineage A series of steps can only be stored when each intermediate processing step is identified. Without an independent source to describe these processing steps, the lineage chain is easily broken. This leads to difficulties, since not all datasets can be generated automatically, guaranteeing their existence in the lineage project information. Little attention has been paid recently to this important step in reproducing data processes. We therefore propose a new approach to modelling and guaranteeing project lineage information, using the integral zoning of agricultural development in North Brabant (Netherlands) as an example (Figure 1).
UML Diagram Workflow reproducibility and preformed process documentation are of the utmost importance. Two workflow modelling levels are involved: the first is at project process level (where the main processing activities are addressed) and the second is the detailed description of a processing activity (where individual datasets and processing steps are modelled). This zoning project has a number of sequential processes, where each project process has input and output data. The resulting chain of data and processes can easily be modelled in a Unified Modelling Language (UML) diagram, the standard method for complex workflows in ICT projects.
Choosing the appropriate UML diagrams is essential. UML has two main groups of diagrams: structured and behaviour. A mandatory part of any ISO (International Organisation for Standardisation) standard within the Technical Committee 211 (TC211) community is a UML class (structured) diagram. The metadata standard (ISO19115) class diagram can be applied to any dataset. However, project workflow processes (dynamic by nature) must also be modelled. UML behaviour diagrams are intended for this purpose, of which the most suitable is the UML activity diagram. A new approach is to have these two modelling diagrams side by side and linked to one another using lineage.
Project Processes The North Brabant zoning project has five project processes. These are presented in a simplified form in Figure 2, where each project process is considered as one activity and has both input and output data. The accumulation of data and execution of the processes are simultaneous; i.e. the amount of metadata increases during the project. This increase also corresponds to increased decision-maker interest in metadata as the project progresses, and both are represented by the growth in arrow thickness between processes in Figure 2. This makes more important the need for lineage in the project to guarantee the required metadata. The increase in metadata can be organised by adding a lineage activity between project processes. Lineage activity can inherit the accumulated metadata information from the related process, and may also be used to organise specific metadata at project level.
Lineage Activity in UML The simple project figure can easily be translated into an activity UML model by introducing the lineage activity. This is depicted in Figure 3, where the white boxes represent datasets or, in UML models, objects. The small chain sign indicates that it is a composite object. In this case study the composite object represents more than one dataset. The blue boxes represent processes or, in UML models, activities. The small chain symbol in each activity implies that it is a structured activity, containing a detailed activity model of several processing steps and datasets. Each activity headed ‘LineageMetadata' contains lineage information at project processing level, which is where answers to questions (e.g. why, what, when) regarding a project process are addressed.
An additional advantage of the UML model is that it can also function as a general workflow model for the project. This highly comprehensible overview can be stored in a format which is easily exchanged between participating parties.
Zoning Project The ISO19115 metadata standard on lineage is used to format the accumulated process information and is applied to entire datasets. Figure 4, the UML class model, depicts how lineage is related to the data source inputs and the processing steps used (lineage relationships to data subsets or individual objects are not taken into account). The zoning planning project was executed with entire datasets, not with individual objects or data subsets. Table 1 lists the content of the lineage metadata model relating to the process ‘1 First Concept: 1:25000' (Figure 3). As an example, this project process activity is also modelled into a detailed view of individual processing steps in Figure 5. The ISO lineage class model is again used to describe the operational process steps necessary to create new data.
The project process ‘First Concept' is used as an example and presented in a UML activity model in Figure 5. Each lineage metadata activity represents a processing step. The symbols in Figure 5 define the same activity elements as in Figure 3 (datasets and processes). The added grey box with the folded corner is reserved for additional model comments. The prefixes PS, PR and T represent different levels of dataset importance in relation to the zoning project: PS specifies a project-source dataset, PR represents project-result datasets created, and T specifies temporary data that is only needed within this processing activity. The C prefix to processing steps specifies computations in the first project process. The lineage part of ISO19115 is used again to describe each processing step. As an example, ‘C1a Create Buffer Zones' is described in Table 2. The questions (e.g. how, where and when) are again addressed, but at individual processing step levels.
Concluding Remarks The approach presented shows that lineage-oriented UML models are very useful for documenting and describing the reproducibility of project workflow processes, both at project activity and detailed activity level. By introducing these two modelling levels the lineage becomes more orderly and accessible. It was sufficient to apply UML activity and class models to express both project processes and detailed processing steps. Employing UML composite objects for multiple datasets and structured activities for a series of processing steps proved very effective for modelling this complex project. Storing project information in these models allows project lineage questions to be answered. The lineage standard of ISO combined with the UML models supports all aspects referred to by Pollard. The lineage workflow information shows how the data could be processed and organised and the resulting lineage model is useful for decision-makers in organising, processing and distributing data.
Acknowledgements Thank to Paul Jansen of Geonovum, Marcel de Rink of ESRI, Wies Vullings of the Centre for Geo-Information, and Coen Wessels, Nexpri, all of The Netherlands, for their support and contributions.
|
||||||||||||||||||||||||||||||||||||
| Biography of the Author(s) John Stuiver has more than 25 years experience in GIS at the Laboratory of Geo-Information Science and Remote Sensing at the Wageningen University, and has participated in research projects both commercial and fundamental. He has been a member of the Dutch delegation to ISO/TC211 since 2007. Email: john...@wur.nl Joep Crompvoets is associate professor at the Public Management Institute of the Katholieke Universiteit Leuven (Belgium) and lecturer at the laboratory of Geo-Information Science and Remote Sensing at Wageningen University (Netherlands). He specialises in development and research into GIS and SDIs. |
||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
Comments (0): |
|
Interactive |
Indoor Augmented Reality with Bing Maps |
|
During this presentation of Blaise Aguera during TED 2010, you can see Bing Maps working from the sky towards street-level imagery and also showing images inside buildings. It even is capable adding real-time movie imagery from inside. |
