Ecological Metadata Language Increases Research Capability

Issue: 
Network News Spring 2002, Vol. 15 No. 1
Section:
Network News

Ecological Metadata Language (EML) is a metadata standard developed by the ecological community for the ecological discipline. This article will provide a brief background on EML and discuss its possible benefits for LTER sites and ecological research in general. EML is based on prior work done by the LTER Information Managers Committee, and the Ecological Society of America Future of Long Term Data Sets (FLED) Committee (Michener et al. 1997, Ecological Applications).

EML was originally developed at the National Center for Ecological Analysis and Synthesis (NCEAS) and is now a community-based project with an increasingly broad list of players.. EML is implemented as a series of XML (eXtensible Markup Language) document types that can by used in a modular and extensible manner to document ecological data. Each EML module is designed to describe one logical part of the total metadata that should be included with any ecological dataset.

Development of EML is an ongoing process, and it requires comment and input from the community it will serve. The development of an ecological metadata standard (EML) has broad implications for LTER research. Currently each site within the LTER Network is responsible for its own metadata management system. These metadata management systems have developed over time as each site’s needs have dictated. This has led to heterogeneity in site metadata content, format and storage systems. Examples of this heterogeneity run the gamut from text residing in a file on a computer to complicated centralized relational database management systems with customized interfaces. This heterogeneity makes the development of software tools for cross-site searching, sharing, integrating, and analyzing of metadata and data extremely difficult. EML helps solve this problem by providing a standard format for metadata content so that software tools can work together seamlessly.

EML could also be used as a guide by information managers and researchers for their own site or research specific metadata needs. EML version 2.0 will be finalized at a workshop at the Sevilleta Research Station in April 2002 and released to the community for review as a standard soon after.. The ability of sites to generate EML (Ecological Metadata Language)-standard metadata makes possible the creation of general tools to support ecological research (Figure 1). A list of tools and capabilities (made possible by the availability of EML metadata) were identified during a January 2002 workshop aimed at facilitating the implementation of EML at LTER sites. For the individual researcher, EML makes possible:

  • Access to on-line analytical engines that integrate a variety of analytical tools such as SAS, MATLAB etc. with point-and-click access to LTER data from multiple sites
  • Seamless, automated preparation for analysis of downloaded data from LTER sites that are generating EML
  • The ability to search, browse and locate available data using sophisticated search engines
  • Automated production of customized data input forms that check data for errors or inconsistencies as data is being entered, either in the laboratory or on palmtop-computers in the field
  • Use of metadata development as a tool for research design. Research design tools can have their parameters specified based on metadata.
    For the individual site, EML makes possible:
  • Use of generic software for metadata and data management, these may include:
    • Sophisticated programs for tracking changes in data
    • Easy-to-use metadata entry forms o Powerful search and query and retrieval interfaces o Tools for managing model and application data
    • Tools for providing data in a variety of forms (e.g., spreadsheet, statistical package, GIS)
  • Tools for production of alternative forms of metadata that allow the site to easily participate in national databases and clearinghouses (e.g., Global Change Master Directory, Mercury, National Biological Information Infrastructure, ISO TC211)
  • Reductions in software and personnel costs through use of generic software that may have been produced by other projects For the LTER Network as a whole, EML makes possible:
  • Improved facilitation of intersite synthesis by standardizing procedures for use of data from different sites
  • Development of software useful to multiple LTER sites
  • Easier development of Network Information System Modules that regularly integrate data from multiple LTER sites

All of these tools and capabilities have three things in common. First, they will improve our ability to conduct cutting-edge ecological research such as the recent cross-site primary production comparison (Knapp, A. and M.D. Smith 2001, Science 291: 481-484). Currently, large amounts of time are required to process and integrate data that originate at different sites, or even within the same site. Reducing the time required to standardize data months or years after it has been collected will facilitate research at larger scales of time and space. Secondly, because of the extreme heterogeneity of metadata and data at sites it is virtually impossible to create tools without LTER site participation in the generation of EML. Only at the site does the expertise exist to translate site metadata into alternative forms (e.g., EML). Once created, EML documents allow for a consistent exchanges so that tools can be developed on-site, according to site needs, and will be able to access metadata and data from any site that is generating it. Finally, the standardization of LTER metadata representation in EML is a foundation we can build on, but erecting our structures atop it, i.e., generating the EML-based metadata documents and the tools that use them, will come at a price, demanding additional work and imagination by information managers and software developers at LTER sites and elsewhere.

Additional information on EML is available at: http://www.ecoinformatics.org

Individual Research

A sudden beep from her palmtop computer brought Shirley Wright, graduate student, back from her musings on the role caterpillar frass could play in the local nitrogen cycle. Looking at her display, she saw an error message that the pH value in the data record she was entering was out of the specified bounds. Shirley had specified the valid range for pH values when she prepared the metadata for her data set weeks before. She had estimated an appropriate pH range by querying an online analysis engine for pH values from data collected on the same soil type. Someone told her that the data for that analysis came from three different LTER sites and required an analytical system that integrated GIS software and statistical packages, but she didn’t need to know the details because the user-friendly web interface hid unnecessary complexities. The actual analysis had been done on a computer at some other site, but again Shirley did not need to be immersed in the technology to use it. Although time consuming, the preparation of the metadata for Shirley’s study had been a useful exercise. It allowed her to think through which variables would need to be measured and what units of measurement should be used. Preparing the metadata ahead of the study:

  1. Gave her a customized data input program for her palmtop
  2. Automatically generated quality checking features (like the one that is now beeping at her).

A sudden cold gust of wind reminds Shirley that it might be important to add a variable for snow depth to her dataset. With this tool, all she needs to do is add a new variable for snow to the metadata and download a new input module to her wirelessly-connected palmtop. The module used to gather snow depth metadata running on Shirley’s palmtop is a collaborative effort between the information manager at her site and information managers at two others sites. The three of them can collaborate on modules that can be used by all sites because it is built upon EML, which is standard across all sites. However, they aren’t there to help her with her current problem: the beeping palm-device. Why the beep? Inspection of the pH value indicates that Shirley had mistakenly entered two ‘8’s instead of one. Having a measured pH of 88.5 could definitely have caused problems in the model she is working on. Glad the program caught it now! - JP

Collaborative Studies

Things have not gone well for Dr. Publish R. Parish today. The working group he assembled to study the effects of caterpillar frass on soil nitrogen levels is spending all its time dealing with issues of data formatting and unit conversions, leaving no time for real analyses. Moreover, a rift is developing in the group between people who want to analyze the data using Excel and those who want to use MATLAB. Unfortunately, with 24 sites to deal with, it is excruciating enough to import the data into one or the other. Doing both is out of the question! However, things are starting to look up. His new graduate student, Shirley Wright has been helping the participants search the LTER Data Catalog, which is based around the EML metadata standard. Workshop participants are able to access data from all of the LTER sites using a common interface, one that gives them the option of receiving the data either as an Excel Spreadsheet or a MATLAB file, among other options.

They don’t even need to worry about the fact that some of the underlying data was originally archived as comma-separated ASCII, others as tab separated and yet others using column formatting. Because that formatting information is stored in the metadata, automated programs written at different LTER sites can read the data in its raw form and transform it into the forms requested by his group. Shirley is now trying out an online data integration engine. It allows her to identify equivalent variables in different datasets through a “point and click” process. Possible “matches” are identified based on similar units and she can read the specific methods for each variable to see if they should, or should not, be matched with one another. Once she has identified the appropriate matches, a new dataset (along with its own set of metadata and citations for the original datasets used) will be produced in the format she requests. It was never like this in the old days! - JP


* Note: The tools and capabilities described in these boxes are not currently in existence, although in a few cases prototypes are actually under construction. All rely on development of EML formatted metadata to provide the common platform for sharing metadata among sites. Additional development regarding use of information from EML data sets to directly access data itself will be required.

Acknowledgements: The author thanks participants in the January 2002 Metadata Workshop at CAP LTER for such stimulating discussion of the wealth of possibilities opened by the development and implementation of EML. Patty Sprott and Owen Eddins made constructive comments and additions to the manuscript. However, any errors or absurdities remain the author’s. Particular thanks go to Matt Jones, Peter McCartney and others who have worked so hard on developing the EML standard.