Features of reproducible scientific computing in the cloud
Table 1: Features of reproducible scientific computing in the cloud
Traditional challenges Cloud-computing solutions Data sharing • Large data sets difficult to share over standard internet connections; can require substantial technical resources to obtain and store.
• Public data sets change frequently. Difficult to archive and share entire data repositories used for analyses.• Large data sets can be stored as 'omnipresent' resources in the cloud. Easily copied and accessed directly from any point in the cloud.
• 'Snapshots' of large public data sets can be rapidly copied, archived and referenced.Software and applications • Reproducibility of results often requires replication of the precise software environment (that is, operating system, software and configuration settings) under which the original analysis was conducted. Specific versions of software or programming-language interpreters often required for reproducibility.
• Analyses typically conducted by several types of software or scripts executed in a precise sequence across one or several systems as part of an analysis pipeline. Only the individual programs or scripts are usually provided with published results. Substantial technical resources typically required to recreate the pipeline used in the original analysis.
• Standard software packages cannot serve all the needs of a scientific domain. Investigators develop nonstandard software and computational pipelines to facilitate computational analysis exceeding the capabilities of common tools.• Computer systems are virtualized in the cloud, allowing them to be replicated wholesale without concern for the underlying hardware. Snapshots of a fully configured system or group of systems used in analysis can be rapidly archived as digital machine images. System machine images can be copied and shared with others in the cloud, allowing reconstitution of the precise system configuration used for the original analysis.
• System images can be preconfigured with common and customized software and tools in a standardized fashion to facilitate common tasks in a scientific domain (e.g., assembly of genome sequences from DNA sequencer data). Preconfigured images can be shared as public resources to promote reproducibility and follow-up studies.System and technical • Substantial computational resources might be required to replicate an analysis. Original computational analyses requiring several hundred processors to complete becoming more common. Reproducibility limited to those with requisite computational resources.
• Substantial technical support often required to reproduce a computational analysis and to replicate the software and system configuration required by the analysis. Prevents reproducibility by nontechnical investigators lacking substantial IT support.• Cloud-based computational resources can be scaled up in a dynamic fashion to provide necessary computational resources. Investigators can create large computational clusters on demand and disperse upon analysis completion.
• Complete digital representations of a computational pipeline can be shared as machine images along with deployment scripts that can be executed by nontechnical users to reconstitute a complete computational pipeline.Access and preservation • Grant-funded software and data repositories often disappear from the public domain after funding is discontinued or the maintainers abandon the project. Leads to loss of access by dependent users and loss of public investment into the resource. • Software, code and data from grant-funded projects can be archived and provided as publicly accessible resources in the cloud. Economies of scale in the cloud allow for active preservation of grant-funded resources for many years past funding for nominal cost.
• Cloud-computing providers already show a willingness to host public scientific data sets at no cost.








