Synthesizing large datasets

Issue: 
Network News Fall 2010, Vol. 23 No. 2

LTER information managers and graduate students collaborate on an international research project

Two graduate students and four information managers from the US Long Term Ecological Research (LTER) Network joined researchers from Malaysia, Taiwan, Vietnam, and Thailand in the “Second Analytical Workshop on Dynamic Plot Database Application and Tool Design” that took place July 18-23, 2010 in Kuala Lumpur, Malaysia. The workshop brought together scientists and information managers from the Center for Tropical Forest Science (CTFS), East Asia Pacific-ILTER and the US LTER to collaboratively analyze long-term datasets collected by CTFS while monitoring forest plots in Taiwan, Malaysia, Singapore, Panama, Japan, Puerto Rico and the US.

The workshop had two major goals:

  1. To “field-test” advanced informatics tools and approaches built around the Ecological Metadata Language (EML), the “R” statistical language, and the Kepler Scientific Workflow system
  2. To attempt comparative ecological analyses of data from multiple international plots to identify broad patterns

Two analyses were initiated at the workshop: Jennifer Holm, a graduate student at the University of Virginia, led a working group that examined how forest biodiversity differs between plots, while University of New Hampshire graduate student Matt Vadeboncoeur’s working group explored how spatial structure varies across sites.

The workshop also provided hands-on training on how to use the three informatics tools to Malaysian scientists engaged in the Malaysian Ecological Research Network (MyERNET). This initiative by the Forestry Research Institute of Malaysia is using metadata-based approaches pioneered by the Partnership for Biodiversity Informatics (which includes the US LTER) to increase data archiving and reuse.

Meei-ru Jeng of the Taiwan Ecological Research Network (TERN) led a session that included presentations by information managers from the US, Taiwan, and Malaysia.

In addition to the formal workshop sessions, participants had the rare opportunity to tour the Pasoh Forest -- which is the oldest mapped large forest plot (50 ha) in Asia and hosts a very large and sophisticated forest flux tower and canopy walkway system -- during a two-day field trip. They also had a custom tour of the new National Botanical Garden and a traditional Malaysian dinner at a historical museum.

The CTFS forest plot datasets are large and complex; for instance, over 400,000 individual trees from 817 species have been monitored at a large forest plot in Pasoh, Malaysia. Synthesizing of the multiple forest plot datasets, which differed in the detail of taxonomic data, designation of the status of stems (live, dead, main, secondary), and structure of the data files, proved quite a challenge. However, the workshop participants took advantage of EML and the Kepler Workflow System to make data processing steps more efficient.

EML metadata were used to automatically write R statistical programs for reading the data, which were then customized to resolve dataset-specific idiosyncrasies. Once a workflow was generated in Kepler, it was reused, often with only a few modifications, on other datasets. Although the R statistical language and Kepler learning curves were steep, workflow reuse ultimately accelerated the data processing steps.

The workflows have turned out to be an effective mechanism for collaboration between workshop members, who are currently completing analyses from their home institutions across Asia and the US. For instance, workflows created in Virginia can be revised in Taiwan and run in Malaysia.

Working groups are also developing additional data resources on the climate and ecology of the forest plot locations to facilitate the comparisons, and expect to submit a manuscript describing the results of their analyses by early 2011.

Funding for participation in the workshop for Matt Vadeboncoeur (HBR), Jennifer Holm (VCR), and information managers Kristin Vanderbilt (SEV), John Porter (VCR), Don Henshaw (AND), and Eda Melendez-Colom (LUQ) was provided by the National Science Foundation through an International Supplement to the Sevilleta LTER. The Forestry Research Institute of Malaysia (FRIM) generously covered all participants’ local costs in Malaysia.

Before the workshop, Chau-Chin Lin and the TERN information management team hosted Matt and Jennifer in Taiwan and provided them with additional training in ecoinformatics tools.