Features of reproducible scientific computing in the cloud

Table 1: Features of reproducible scientific computing in the cloud
 Traditional challengesCloud-computing solutions
Data sharing• Large data sets difficult to share over standard internet connections; can require substantial technical resources to obtain and store.
• Public data sets change frequently. Difficult to archive and share entire data repositories used for analyses.
• Large data sets can be stored as 'omnipresent' resources in the cloud. Easily copied and accessed directly from any point in the cloud.
• 'Snapshots' of large public data sets can be rapidly copied, archived and referenced.
Software and applications• Reproducibility of results often requires replication of the precise software environment (that is, operating system, software and configuration settings) under which the original analysis was conducted. Specific versions of software or programming-language interpreters often required for reproducibility.
• Analyses typically conducted by several types of software or scripts executed in a precise sequence across one or several systems as part of an analysis pipeline. Only the individual programs or scripts are usually provided with published results. Substantial technical resources typically required to recreate the pipeline used in the original analysis.
• Standard software packages cannot serve all the needs of a scientific domain. Investigators develop nonstandard software and computational pipelines to facilitate computational analysis exceeding the capabilities of common tools.
• Computer systems are virtualized in the cloud, allowing them to be replicated wholesale without concern for the underlying hardware. Snapshots of a fully configured system or group of systems used in analysis can be rapidly archived as digital machine images. System machine images can be copied and shared with others in the cloud, allowing reconstitution of the precise system configuration used for the original analysis.
• System images can be preconfigured with common and customized software and tools in a standardized fashion to facilitate common tasks in a scientific domain (e.g., assembly of genome sequences from DNA sequencer data). Preconfigured images can be shared as public resources to promote reproducibility and follow-up studies.
System and technical• Substantial computational resources might be required to replicate an analysis. Original computational analyses requiring several hundred processors to complete becoming more common. Reproducibility limited to those with requisite computational resources.
• Substantial technical support often required to reproduce a computational analysis and to replicate the software and system configuration required by the analysis. Prevents reproducibility by nontechnical investigators lacking substantial IT support.
• Cloud-based computational resources can be scaled up in a dynamic fashion to provide necessary computational resources. Investigators can create large computational clusters on demand and disperse upon analysis completion.
• Complete digital representations of a computational pipeline can be shared as machine images along with deployment scripts that can be executed by nontechnical users to reconstitute a complete computational pipeline.
Access and preservation• Grant-funded software and data repositories often disappear from the public domain after funding is discontinued or the maintainers abandon the project. Leads to loss of access by dependent users and loss of public investment into the resource.• Software, code and data from grant-funded projects can be archived and provided as publicly accessible resources in the cloud. Economies of scale in the cloud allow for active preservation of grant-funded resources for many years past funding for nominal cost.
• Cloud-computing providers already show a willingness to host public scientific data sets at no cost.

Comments (0)
Posted

Rise of nonsense statistics in everything from adverts to voting

Seife coins the term “proofiness” to refer to the misuse of numbers, deliberate or otherwise. He dubs the simplest quantitative sins “fruit-packing”. These include: “cherry-picking” the data, as he says Al Gore did when describing climate change in An Inconvenient Truth; “comparing apples to oranges”, as economics pundits do when they neglect to adjust for price inflation; and “apple-polishing”, as when advertisers use graphics to mislead.

Comments (0)
Posted

Have customers not audiences.

The basic problem is that these new-media companies don’t really have customers; they have audiences.

Comments (0)
Posted

Story of a dog, nice very nice

Check out this website I found at i.imgur.com

Media_httpiimgurcomsb_gbxjq

Comments (0)
Posted

Storage prices versus DNA sequencing costs


The blue squares describe the historic cost of disk prices in megabytes per US dollar. The long-term trend (blue line, which is a straight line here because the plot is logarithmic) shows exponential growth in storage per dollar with a doubling time of roughly 1.5 years. The cost of DNA sequencing, expressed in base pairs per dollar, is shown by the red triangles. It follows an exponential curve (yellow line) with a doubling time slightly slower than disk storage until 2004, when next generation sequencing (NGS) causes an inflection in the curve to a doubling time of less than 6 months (red line). These curves are not corrected for inflation or for the 'fully loaded' cost of sequencing and disk storage, which would include personnel costs, depreciation and overhead.

Filed under  //  Big data  
Comments (0)
Posted

Hadoop is beatable

The problem with simple batch processing tools like MapReduce and Hadoop is that they are just not powerful enough in any one of the dimensions of the big data space that really matters. If you need complex joins or ACID requirements, SQL beats Hadoop easily. If you have realtime requirements, Cloudscale beats Hadoop by three or four orders of magnitude. If you have supercomputing requirements, MPI or BSP beat Hadoop easily. If you have graph computing requirements, Google's Pregel beats Hadoop by orders of magnitude. If you need interactive analysis of web-scale data sets, then Google's Dremel architecture beats Hadoop by orders of magnitude. If you need to incrementally update the analytics on a massive data set continuously, as Google now have to do on their index of the web, then an architecture like Percolator beats Hadoop easily.

Media_httpressysconco_jdsbt

Filed under  //  Big data   Hadoop   MapReduce  
Comments (0)
Posted

Newton's Law of Citation

Media_httpwwwnatureco_jswui

More references mean more citations, according to an analysis of papers published in Science.

Filed under  //  Science  
Comments (0)
Posted