Senior Design Project
Systems Science Engineering

Fall 2007 

Andrea Martinez
ESE 499 Senior Design Project:
Automating the Registration Process of Air Quality Emissions Datasets
Washington University, St. Louis

Project Supervisor:

Stefan Falke
Research Assistant Professor
Washington University Department of Energy, Environmental and Chemical Engineering

The Center for Air Pollution Impact and Trend Analysis (CAPITA), a research division of the Energy, Environmental & Chemical Engineering Department at Washington University, does air quality-related research and disseminates the data, via DataFed and DataFed Wiki. In order to distribute and improve public access to all of their geospatial metadata, it must be registered at geospatial databases. In the past, CAPITA had been registering its metadata manually, but in an attempt to register all of their air quality emissions datasets and remain up to date, they have moved towards an automatic registration process. The purpose of this research project has been to develop the automatic dataset registration process through the creation of an eXtensible Stylesheet Language (XSL) to automatically transform the metadata into the appropriate content standard format.

The effort towards creating an automatic registration process for the air-quality emissions datasets has been a collaborative attempt.  My part in this project has been to design an eXtensible Stylesheet Language Transformation (XSLT) sheet that will convert metadata into the Federal Geographic Data Committee (FGDC) format, and then to register the transformed metadata at GeoSpatial One Stop (GOS), a web-based portal that provides public access to maps, data, and other geospatial services. The FGDC supports the Content Standard for Digital Geospatial Metadata (CSDGM), which is the US Federal Metadata Standard. An image map, available here, gives more information regarding CSDGM and its required elements.

My task was to create an XSLT Stylesheet to transform a DataFed Wiki XML document into an FGDC-compatible XML document.  With XSLT, an XML document can be transformed into another XML document by rearranging, sorting, adding, and removing elements to achieve a specific display.  The operation of the entire process is shown below in figure 1.  The existing XML document (Source document) is often parsed before inputting it into the XSLT Processor.  The Source document and XSLT Stylesheet are merged together into the XSLT Processor; inside, the Stylesheet describes how to transform the Source document, and the Processor transforms the Source document into the Result document.

Figure 1. Operation of an XSLT Processor



Developing a feedback control loop flowchart of the system was very helpful in the design and testing process.  The flowchart, shown below in figure 2, contains two feedback loops.  The feedback ensures that the controlled variable (Result Document) is being measured and that the information is being used to influence its value.  Once the information from the Result Document is validated and the errors are measured, then that information is used to update the Stylesheet and the DataFed Wiki, and the program is re-run to obtain a better Result Document.  An advantage to using a closed-loop feedback system is that the errors are more likely to be reduced because it is easier to control and adjust them.



Figure 2. Feedback Control Loop Flowchart of System

Two very important conditions that had to be considered when designing the Stylesheet were: the number of errors and the number of extracted elements. It was important to make sure that there were as few errors as possible, if any at all, in order to ensure that the Result Document was conforming to CSDGM. These errors were generated after validating the Result Document using the Geospatial Metadata Validation Service (GMVS). Also, since the goal was to make the Stylesheet as generic as possible, it was just as important to try and extract every element from the Source Document, and for the Stylesheet to not contain any hard coded values. The number of errors and extracted values were used as validation techniques in the feedback loop of the system, with the goal of reducing the errors and increasing the extracted values. The Result Document was judged against these two criterion to determine how efficient it was, and what revisions needed to be applied to the Stylesheet and the DataFed Wiki (Source Document) in order to improve the Result Document.

The current Stylesheet was obtained through the revision of previous Stylesheets. In total, there were five Stylesheets that embodied significant changes made after validation; therefore, the most current Stylesheet is also known as Stylesheet 5.  In addition to the previous changes, this Stylesheet was changed to allow it to be linked to three sources (instead of one) to extract more metadata elements.  Applying the validation techniques yielded the following results: 1 error, which consisted of 1 missing metadata element, and 91.6% of the metadata elements being extracted.  These results are very near the goal of zero errors and 100% extraction.  The severity of a missing metadata element is very minimal since the metadata element is only mandatory if applicable, and in our case the metadata element does not apply to the dataset.  The reason that 8.4% of the metadata elements are still being hard coded is due to the complexity of 2 specific elements: Theme Keyword and Attribute Label.  Both elements are specific to each dataset and can occur once or multiple times, so it has been hard maintaining the generality of the Stylesheet.  The challenging issue of determining how to allow multiple selections per metadata element is currently in progress.  Until some changes are made, these two metadata elements will need to remain hard coded in the Stylesheet.

All five of the Stylesheets contain significant changes which were determined through the use of the validation techniques, and applied to the Stylesheet and the DataFed Wiki via the feedback loop. The progression of the Stylesheets towards achieving the goal of zero errors and 100% extraction is shown below in figure 3, as a bar chart. Column 6 indicates the goal, and columns 1-5 correspond to the respective Stylesheet. Also, the bars in the front indicate the total number of errors, and the bars in the back indicate the extraction percentage. It is obvious that Stylesheet 5 has nearly achieved the goal. For more information regarding Stylesheets 1-4, please view the Final Report in PDF form here.  

Figure 3. Comparison of Results for Stylesheets 1-5

In my senior design project, I have designed an XSLT Stylesheet that has facilitated in the dataset registration process. I also used important Systems Engineering concepts to develop testing methods. The development of the flowchart and the use of its feedback loop were beneficial in validating the Result Document and then making updates to the Stylesheet to achieve improvements. This testing scheme allowed me to adjust the design of the Stylesheet and determine if the goal of zero errors and 100% extraction was being met.

Stylesheet 5 produced very favorable results and had nearly achieved the goal. Although it is not 100% generic, it still simplifies and speeds up the registration process by a considerable amount, and valid postings on GOS can be made with it. The remaining steps, to be developed by the lead programmer, are to determine how to allow Theme Keyword and Attribute Label to have multiple choices. As soon as these changes are made to the Stylesheet or the DataFed Wiki, then my Stylesheet can be used to automatically transform any source XML document into an FGDC-compatible XML document. The result document can then be registered at GOS, where it will be scheduled for harvesting, completing the automatic registration process.

For more information, please view the following:
Stylesheet 1
Stylesheet 2
Stylesheet 3
Stylesheet 4
Stylesheet 5
Current DataFed Wiki Source XML Document
Result XML Document (using Stylesheet 5)



*Image in the top right corner obtained from