A High-memory Supercomputer for Proteomics, Text Mining and Microbiome Research
Biography Overview We request funds to purchase an integrated supercomputer to unite 5 highly productive and collaborative laboratories with complementary expertise in the microbiome, proteomics, text mining, and supercomputing, and to extend these capabilities to the broader NIH-funded biomedical research community via cloud and web applications. The critical shared need not met by other systems on campus, unavailable in commercial clouds, and oversubscribed at national labs, is for a system that can run jobs that require high memory (8-32 GB/core) and long duration (>2 weeks wall-time), and is optimized for high-IO tasks that saturate network or storage on other systems. The system will consist of 128 servers, each using 2x8-core 2.93GHz Intel Sandybridge CPUs. 20 large-memory nodes will each have 512GB of RAM (32GB/core), and 100 compute nodes will each have 128GB of RAM (8GB/core). These 120 nodes will each use two 10Gbps Ethernet ports bonded together for a 20Gbps/node (2.5GB/s) connection to the rest of the system, and each node will have 2.4TB raw high- performance local storage. The total aggregate performance of these local disks is over 36GB/s sustained (>300MB/s per node). The remaining 8 nodes will be used for administration, support for advanced software tools and infrastructure, and user interaction. A central high-performance Lustre parallel file system will provide 1.15PB of usable scratch space and sustain 36GB/s to the 128 clients. An archival system of 4 drives/300 tapes will sustain >1GB/s aggregate (accounting for compression), provide 450TB of raw capacity, store ~4.5 PB of user data, and scale to 5x this size. The system, valued at $4.5 million but quoted at $2 million by HP due to the strategic importance of this partnership, will be housed in a state-of-the art machine room in the new Jennie Smoly Caruthers Biotechnology Building on the Boulder campus (opening Feb 2012), and connect to the rest of the campus at 40Gbps. The system will be a key enabling technology for key scientific areas where data growth is exponential and current systems on campus are end-of-life, solely dedicated to other purposes, or optimized for other tasks. The major users will use the instrument largely for time-consuming one-time tasks such as parameter optimization for microbiome and genome assembly workflows, building knowledgebases, and performing simulations and database searches that will provide resources that are re-used by much broader user communities (hundreds of collaborators; thousands of end users) who lack supercomputing access. One key innovative aspect of this proposal is configuration of part of the system as an academic cloud, which will allow us to pilot workflows that can later be deployed by diverse users on commercial clouds (e.g. Amazon EC2) and academic clouds (e.g. Magellan and DIAG) once those clouds are upgraded. The system will also build a broad expertise base in high-performance computing in the life sciences through outreach to promising new faculty and trainees on NIH training grants, and collaborations with new users of the Sequencing Core. The proposed instrument will thus have a profound impact on NIH-funded research.
Time
|