KDI Update: Creating the Knowledge Network for Biocomplexity

Issue: 
Network News Fall 2000, Vol. 13 No. 2
Section:
Network News

Imagine that your research program could be considerably more efficient than it already is. Imagine that you could have access to a data catalog containing twice as much—or maybe ten or a hundred times as much—data as you could collect over your entire career. Imagine what new questions you would ask. Imagine having access to eager collaborators who want to share data and ideas and work together toward unraveling complex biological questions.

This is the premise of the Knowledge Network for Biocomplexity project, a collaborative project between the National Center for Ecological Analysis and Synthesis (NCEAS), the Long Term Ecological Research Network Office, Texas Tech University, and the San Diego Supercomputer Center (SDSC). In 1999, these organizations were funded jointly to develop the Knowledge Network for Biocomplexity (KNB), a depository and retrieval system for ecological data. The KNB will facilitate scientific research in ecology and related disciplines. The KNB project has three main components:

Informatics research and development, large-scale Ecological and Biogeographical research, and Education, Outreach, and Training. Informatics research and development: Work on the KNB has laid the groundwork for the infrastructure, software tools, and also for building a community of users. The next step (beginning in early 2001) involves installing user interfaces (software) at sites where data exist. These sites include your own computer as well as servers at NCEAS, LTER sites, OBFS sites, and various universities. A major component of the project involves gaining the interest and support of researchers (like you) so you will enter and register your data into the network.

Ecological data reside in many different formats, media, and locations, and currently have no centralized index. Finding those data, retrieving them, and analyzing them is often time consuming and inefficient. The KNB solves many of these problems by providing a set of data management tools (software) that is a single portal to all the data that reside within its network of databases. The KNB contains a sophisticated engine for searching a catalog of metadata to find specific types of data. The software allows users to search metadata by title, abstract, author, keyword, and full text, among many other possibilities. It also contains components that facilitate the creation of metadata and tie the metadata and data together, allowing users to view many types of actual data online. These software tools allow scientists to not only identify appropriate data for some project but also to easily retrieve and, eventually, analyze them by using the same software tools. The first release of the software, created at the NCEAS, is scheduled for January 2001.

In order to give researchers ultimate control over how—and by whom—their data are used, the software that provides data management tools will contain a feature allowing researchers to set different levels of access to their data by users or groups. For example, you could specify unlimited access for anyone using your local server (where the data actually reside) but allow read-only privileges to people who access your data from a remote server.

The Storage Resource Broker (SRB), software developed at the SDSC, is one of the computer technologies that ties all the database servers together. This technology enables the KNB to share data that may be in multiple formats (e.g., comma-or tab-delimited text, SAS files, Excel spreadsheets, Matlab files, etc.) across multiple operating systems (e.g., UNIX, Windows, NT, MacOS, etc.). The LTER Network Office is directing the effort to install the SRB on various servers where the data currently reside. The first round of installations includes servers at the six LTER Network sites that provided data for an NCEAS Working Group project (see below) that explored the relationships between biodiversity and primary productivity.

Research efforts to examine large-scale Ecology and Biogeography

The relationship between biodiversity and productivity is receiving considerable attention in the current ecological literature. This is because the results from experimental and observational studies often conflict with theoretically-predicted patterns. Our hope is that synthetic studies that make use of data sets from various sources can provide a more complete understanding the “fit” between predicted and observed patterns. The primary goal of the NCEAS Working Group was to reconstruct, document, and analyze data about biodiversity and productivity from six LTER sites. Site representatives provided data that they had discovered, retrieved, and manipulated (i.e., secondary data). These data then were synthesized to examine relationships between productivity and diversity across sites.

The LTER program has an established policy on making data available on the Internet. Thus, researchers hoped that the original, unmanipulated data (i.e., primary data) from each LTER site could be identified and retrieved. In addition to providing a validation exercise for the KNB, this process would lead to the discovery of new data on biodiversity and productivity from the six LTER sites. However, several shortcomings of this approach to synthetic research soon became apparent. First, metadata (documentation about a data set), when available, were not consistent, even within the LTER Network. Second, an efficient means of searching metadata does not exist. Third, current metadata documents often do not contain enough information to assess the validity of comparing data from multiple sites. These findings demonstrate the considerable obstacles to conducting synthetic analyses of biodiversity issues, and they reinforce the necessity of developing the KNB.

The KNB should allow researchers to conduct synthetic ecological research in a more efficient and thorough manner. First-year work on the project has involved developing a data set to use for testing and validating the KNB—putting the software through its paces—as well as initiating further research on the relationship between biodiversity and productivity. Validation of the KNB will be based partially on the results from this study and on the previous study done by the aforementioned Working Group. Approximately half of the original and manipulated LTER data that were used for the original study have been retrieved and documented in a standardized format. The data used for this validation exercise are dispersed among six LTER sites, reside on multiple servers running different operating systems, and exist in various formats. This makes retrieving all the data from this study an excellent test of the data management tools developed by the Informatics team. It also tests how well those tools make the KNB accessible to researchers.

Storage Resources (Ecological Data)
  • Scattered among numerous storage sites (e.g., LTER sites, OBFS stations, other data archives)
  • Stored in numerous formats
Storage Resource Broker
  • Software that resides at storage sites
  • Allows and controls access to storage resources
Metadata Catalog
  • Describes the available data resources
  • In standard (XML) format
Desktop Client Software
  • Contains tools for creating and editing metadata
  • Allows users to query all data in hypothesis format
  • Identifies, retrieves, and integrates all relevant data

Because data retrieval, even within the LTER network, can be laborious and inefficient, we sincerely hope that the KNB will encourage participating researchers to work together to document data with appropriate metadata and to register the data with a KNB server. By sharing data and associated metadata, researchers can leave a permanent and endlessly useful legacy of data for generations of scientists to come.

Once the KNB is operational and populated with data, researchers will have access to data and expertise never before available. For example, scientists will be able to find data to use for a web-based pilot project—a project that might otherwise cost thousands of dollars or an entire field season. Researchers can use the KNB to advertise for collaborators who have expertise critical to a project’s success, to increase sample sizes, to do time-series analyses, to gain new information about relevant subjects, etc. Ultimately, the array of research projects facilitated by the KNB will be limited by:

  1. The amount and types of data in the network
  2. Your own imagination

The KNB will require the cooperation and input of an entire network of researchers, including you. By entering your data, you—the researchers—are helping to build the KNB, and are helping yourselves and your colleagues to:

  1. Increase research efficiency
  2. Gain a better understanding of large-scale biological patterns and processes
  3. Enjoy collaborations with other ecologists
  4. Identify other subject areas in need of further research

Plans for Education, Outreach, and Training

Also in development is a nationwide (and, potentially, international) outreach program to train graduate students how to use the KNB to facilitate their own research. Beginning in January 2001, scientists at the NCEAS will coordinate a web-linked series of multi-institution graduate student research training seminars focused on multi-scale patterns of species richness and productivity. Students will use the KNB and the data management software on their local computers to design and complete a research project. They will engage in online collaborations among their various institutions. The seminars will culminate with collaborative Working Group meetings at the NCEAS. Targeting students early in their careers will foster computer-based research skills that will enable them to tackle complex ecological questions.

The KNB team is seeking collaborators and contributors who want to become part of this network-based research endeavor -- an effort that involves more than 20 people working on state-of-the-art Informatics research and development, large-scale Ecological and Biogeographical research, as well as Education, Outreach, and Training. If you want to contribute to the informatics research and development, contribute data, participate in the seminar activities, or otherwise collaborate on this project, contact:

Informatics Collaboration:
Matt Jones (jones@nceas.ucsb.edu)

Biocomplexity Collaboration:
Stephen Cox (stephen.cox@TTU.EDU)

Seminar Collaboration:
Elizabeth Sandlin (sandlin@nceas.ucsb.edu).

For more information, please visit the website: http://www.nceas.ucsb.edu/kdi