A usable simulation model archive: Does it really exist?

Issue: 
Network News Winter 2014, Vol. 27 No. 4
Editor's Comment: An earlier version of this article published under the heading "Proposed best practices for archiving simulation models" was a concept summary meant for discussion and not for publication. We take this opportunity to apologize to our readers for that inadvertent publication and to correct that error by publishing the correct version.

At some point in their career everyone probably wonders what their legacy will be. That is certainly the case for the two of us, as we near the end of our careers as ecologists. While this might seem a narcissistic obsession, there is a practical side as well for everyone else approaching the end of their career. One of the hallmarks of humans as a species is the ability to pass knowledge, skills, technology, and experience from one generation to the next. So it is sensible to wonder what will be passed along from one scientific generation to the next. As scientists, we hope to pass along critical concepts, facts, technologies, samples, and observations; if this does not happen then each generation must start over and scientific progress will be delayed because old findings have to be rediscovered.

However, it is not enough to want to pass along a legacy—we must also put in place mechanisms to enable this process. In thinking about the various parts of our own legacies, we are confident that for some parts we wish to pass along the mechanisms are well established. For example, almost anything appearing in a journal article or book, be it data, concepts, or knowledge, will be archived in various libraries, online databases, and other well established and redundant systems such as JSTOR (or Journal Storage, a digital library), Questia (an online commercial digital library of books and articles), SCOPUS (a bibliographic database of abstracts and citations of academic journal articles), the Web of Science, and the Library of Congress. Ecological data can be entered, documented, and retrieved using several systems such as CDIAC (Carbon Dioxide Information Analysis Center), the Network Information System (NIS) of the U.S. Long Term Ecological Research (LTER), and the Ecological Society of America’s Ecological Archives, all of which are already in place but will continue to evolve. The ability to pass along physical samples is less certain, but systems to archive, curate, and retrieve such material are widespread, ranging from classical organism collections in museums to site-based systems such as the Hubbard Brook LTER program’s Sample Archive. For such materials the main issue is how to take advantage of these systems based on one’s location and type of sample. What we do not seem to be able to find is a system to enable legacies for ecological simulation models (Thornton et al 2005). The rest of this commentary describes what we think such a system would look like and the steps required to develop it. Before describing this system, let us first be clear why we think simulation models are in particular need of attention.

There are many forms of models used in ecology. Analog models (e.g., microcosms) are generally not saved, although there’s no doubt some are in museums and many are described in publications. Since being replaced by digital models, there is probably little need to be concerned about the ability to recreate analog models outside of an educational setting. 

Conceptual (e.g., N-saturation model of Aber et al. 1989) and analytical (e.g., NEE model of Shaver et al 2007) models are usually relatively simple and well documented in publications. The same is usually true for empirical models, as the main purpose of many publications is to use data to develop these relationships. While the data to develop empirical models is often inadequately presented in publications, there are at least systems (as noted above) to store everything from raw to clean, processed data. 

Simulation models are generally more complicated than analytical ones, and although described in publications, there is generally not enough room to do this fully; hence much of the information needed to use or recreate them exists outside the publication system. Moreover, unlike either an analog or analytical model, digital simulation models are generally not simple to recreate and given the limited description in publications, may be impossible to recreate exactly from that source alone. This is unfortunate because simulation models have become a way to synthesize ecological knowledge, explore integrative hypotheses, and analyze complex systems. 

As they reflect relatively new ways to think about ecological systems, failing to pass simulation models from one generation to the next is potentially an extremely unfortunate situation that could slow progress in ecological sciences. In a sense it is similar to every generation having to reinvent the elemental analyzer (a scientific instrument used to determine the chemical elements in a sample) or some other critical piece of technology that we currently take for granted. 

Developing a system to usefully document, archive, and retrieve ecological simulation models will involve considerable thought. Part of the complexity of this effort is reflected in the fact that simulation models are really an amalgamation of concepts, hypotheses, data, and technology. Fortunately, parts of other systems can be reused and modified to create this new model archiving system. For example, data are usually used to drive simulation models and data is a primary model output. Documentation of these model-related data can take advantage of existing systems (such as LTER’s NIS) that document, archive, and retrieve data. 

Model parameters can also be described using these systems. However, there is additional information on the source, transformation (often, parameters are derived from data and this process needs to be described), uncertainty, and other aspects of model parameters that need to be included. It is also useful to understand how sensitive a model is to changes in the variables and parameters that drive it. While some of this information may be described in publications, detailed examinations of sensitivity often undertaken by model developers are rarely documented. It would be useful for future users if a sensitivity analysis was part of every simulation model’s documentation (e.g., Grimm et al. 2014). 

As simulation models are developed it is not unusual to have multiple versions of models and while it may not be practical to save every version, those that represent significant milestones of development (such as a major change in functionality or publication of a key analysis) should be archived. Fortunately, conventions developed for other forms of software development can be used. While storing of the computer code (i.e., source and executable files) may not be challenging per se, one should not assume that computer code created on one operating system environment will automatically be useable on another or the code will run under some future operating system. Therefore, in addition to the code itself, it may be necessary to archive the operating system in which that code was developed, which in turn might mean also physically saving the hardware able to run that operating system.

Alternatively, new technology now allows the development of virtual machines to simulate one operating system and associated hardware on another. These virtual machines can be more easily passed from one generation of system to the next.

Finally, as the main reason for archiving simulation models is to use them again in future, this process would be greatly helped by archiving input and output data that can be used to test if the recreated model is acting as expected and to serve as a template for formatting new parameter and driver files to be used with the archived model.

We envision an archival system where not only the full information needed to recreate the model is available, but the model itself is available and usable under any future operating system.  Imagine, for example, being able to rerun the original Botkin et al. (1972) JABOWA simulations for Hubbard Brook from our computers without having to recode the model. Bear in mind the original JABOWA was developed on punch cards for an IBM mainframe computer, a system not currently available to anyone outside of a museum. In the archive we envision, the JABOWA code and a simulated IBM operating system would be archived on a virtual machine so that the model could be rerun not only with the original input files but with any newly created input file in the same format.

We have few illusions as to the challenges to be faced in developing the proposed system and while the subject has previously been noted by others (Thornton et al 2005), little appears to have been done to address it. This indicates to us that perhaps one of the biggest challenges is to convince scientists and funding agencies to recognize the need for such a system and to understand that it will be different from what currently exists. We may be mistaken, but in our conversations with others we got the impression that there is widespread belief that such a system already exists (how could it not?) or that current systems for data are sufficient. However, we are not really convinced that a general system exists and suspect those that do are not sufficient without some modification.

Another challenge is that a system to document, archive, and retrieve simulation models will cost time and resources in development and in use. Those using data systems will understand that proper documentation and archiving of data can add 25-35 per cent in effort to a project, which in a fixed budget world means fewer publications and presentations. We would expect the same costs for a simulation model system, and unless scientists and funding agencies support these costs in terms of lower short-term productivity developers of simulation models will be reluctant to bear them. That would be very unfortunate since failure to accept these short-term costs will likely come at the expense of long-term productivity. 

References

Aber, JD, KJ Nadelhoffer, P Steudler, and JM Melillo. 1989. Nitrogen saturation in northern forest ecosystems. BioScience 39: 378-386.

Botkin, DB, JF Janak, and JR Wallis. 1972. Source Some Ecological Consequences of a Computer Model of Forest Growth. Journal of Ecology 60:849-873.

Grimm, V, J Augusiak, A Focks, BM Frank, F Gabsi, ASA Johnston, C Liu, BT Martin, M Meli,V Radchuk, P Thorbek, and SF Railsback. 2014. Towards better modelling and decision support: Documenting model development, testing, and analysis using TRACE. Ecological Modelling 280:129-139.

Shaver, GR, LE Street, EB Rastetter, MT van Wijk, and M Williams. 2007. Functional convergence in regulation of net CO2 flux in heterogeneous tundra landscapes in Alaska and Sweden. Journal of Ecology 95: 80--817.

Thornton, P. E., Cook, R. B., Braswell, B. H., Law, B. E., Shugart, H. H., Rhyne, B. T., and Hook, L. A. 2005. Archiving numerical models of biogeochemical dynamics. Eos, Transactions American Geophysical Union 86: 431-431.