Thursday, November 17, 2005

21st Century Research: How to Handle Massive Amounts of Data

Think about this: In about two years, if all goes according to plan, the Large Hadron Collider (LHC) will be commissioned and running out at CERN, the European particle physics facility near Geneva. One of the experiments, the Compact Muon Solenoid (CMS), will be able to collect 225 MB of data each second, and will run for the equivalent of 115 full days in 2008. This adds up to 2 petabytes of data (enough to be stored on 1.4 million CDs!). So goes 21st century scientific research. The question, of course, is where and how does one store such massive amounts of data, let alone distribute it to hundreds or even thousands of scientists, postdocs, and graduate students who need the data for their individual projects?

Back around 1999, physicists and computer scientists began developing prototype systems known as 'grid computing' in order to handle large datasets. Single labs or universities almost certainly will not have the capability to store and efficiently make use of datasets mentioned above for CMS, so it is necessary to expand on collaborative efforts of the past for analyses of all kinds. Grid computing is named because it is analogus to an electrical grid. When you turn on a light in your house, you don't really have any idea from where on the grid the power is coming from, just as long as you get it. The grid is a sort of black box, an enormous network of wires, transformers, power stations, and so on the collectively feeds power when needed on a large scale. Data and information grid computing networks act in a similar manner.

In a sense this means that there will be virtual organizations and collaborations. Members of a particular grid network will enter the information they need at any time on their own computer, and whatever it is they requested from the grid will be accessed from wherever that information is stored on the grid. All members of the grid network will have such capabilities. Large datasets can presumably be broken up and stored at any number of sites all over the world, and the grid architecture and software will be able to quickly grab any files that are requested, regardless of what local computer network has that file. Such a system of virtual organizations is essential because of the size and cost of many scientific projects that are either in existence or are planned for the future. To put it in perspective, Microsoft's Tony Hey said, "It's no exaggeration to say that we'll collect more data in the next five years than we have in all of human history."

The data being collected comes as both data from physical experiments such as CMS or other experiments at the Fermilab national lab outside Chicago, as well as simulation data from computer-based experiments. Many of the complicated and dense simulation data comes from programs being run on modern supercomputers and other parallel-processing networks. Presently large databases can be accessed on the traditional Internet, such as protein databases and other information from the Human Genome Project, and these and similar databases are only expected to grow over time. Large national and international collaborations that share data are not new, and from my own experience at Fermilab, certain types of science cannot be done without such efforts. Grid computing is the natural extension and solution to ever-increasing information from such collaborations.

An interesting extension of collaborative work comes in the form of increasing amounts of multidisciplinary research. More research areas are overlapping, and with grid computing there is an expectation that biologists, chemists, geologists, physicists, astronomers, theorists, and so on will be able to access each other's data and tools, as well as share research methods and models to enable new breakthroughs. With growing work being done on complex systems and emergent behavior, for instance, such sharing will likely be necessary to expand the field.

How grid computing works depends on four layers of the network. The foundation is the physical network that links all the members and resources, which are considered the second layer, on the grid. Each successive layer depends on the lower layers. On top of the resources sits the middleware, which refers to the software that makes the grid work, as well as hides the complexity of the grid from users. The top layer is then the applications software that users actually see and access on their computers. This is like the application icons one has on your computer desktop; click on it, and software that you don't see and take for granted does its thing to open up the program or application you want. The big difference from your current computer is that a computer that is a member of the grid will have icons for applications or data that, when you need them and click on them, will not grab the program off your hard-drive, but rather pull it from whatever computer on the grid that has it...and that computer may be on the other side of the world. This all happens automatically.

As people gain experience with the relatively small grid networks that currently exist, larger and more complex grids will naturally form. It will be an Internet on steroids, and as always with technology, we cannot imagine the ways in which it will be used. One potential problem, of course, is security of enormous datasets, so development of new types of encryption and anti-virs software will likely be developed as well to maintain the integrity of a grid where members change frequently and new information is added daily to the grid, much like what already happens on the traditional Internet.

5 comments:

DDW said...

The Large Hadron Collider is a perfect example of Boondoggle Science: obscenely expensive and dangerous. Will it produce mini black holes, or a "strangelet"? No physcicst I know can has adequately defended this grotesquery.

http://tinyurl.com/as784

DW

Mark Vondracek said...

DW,

There was a worry about mini-black holes a few years back when RHIC (relativistic heavy ion collider) turned on. It smashes heavy nuclei together at nearly the speed of light, and at those speeds the concern was the mass-energy density could be great enough to produce a min-black hole. Theorists worked on it and were absolutely convinced this would not happen, and we're still here.

I think you refer to superstring calculations as far as LHC is concerned. First thing is, since I am not familiar with your background, string theories have nice concepts, however they have absolutely no real scientific credibility in the sense that they have not yet produced any physical, realistic predictions, and hence have not been tested at all. THose are more of a mathematical philosophy right now. I'm fairly certain that the use of more accepted theories that are relevant to smashing together protons with antiprotons, such as QCD, have been checked and simulated under LHC conditions for the past decade, and I for one have not heard of any concerns regarding mini-black holes.

As for costs, yes, it is tremendously expensive. But contributing governments and institutions obviously must have thought of it as a valid and important investment, otherwise why commit to it. The reasons they thought it is a good investment are not necessarily because scientists may find the Higgs boson, or evidence for supersymmetry, or just plain new physics such as what constitutes dark matter or other exotics states (these are the reasons for the physicists). Instead, governments and investors likely looked at the history of other labs doing similar work, such as SLAC and Fermilab, and the older CERN experiments and facilities. They more than pay for themselves in terms of new technologies, computing systems, data acquisition systems, and industrial, medical, and engineering applications and methods. And one never knows what any new science discoveries will lead to.

We have the Internet, superconducting and magnet technology (which led to, for instance, MRI technology), fast electronics and computing networks, new computer languages and applications, new medical treatments for cancer, new laser and fiber optic technology and applications, as well as communication among the world's scientists that have had off-shoot effects for international cooperation among government and private sector officials...all because of investments in big science. Who knows what spin-offs the world gets from the LHC and other major scientific investments and collaborations. I suppose we'd never know if we never try some of these types of projects.

Mark Vondracek said...

Hi Chris,

Thanks for the added details. Because of the Hawking radiation and evaporative nature of minis, this is likely the reason we don't see mini's that could have been abundant at the time of the Big Bang.

I had a feeling you might like the philosophy comment. ;-)

Any new predictions from the string community? I'm out of touch with the literature.

Mark

Daniel S. said...

It's interesting to read this post, because I'm most likely going to be involved in some massive simulation project or other this coming summer when I go work at one of the national labs (probably Los Alamos). Of course, when you're working at Los Alamos, they don't need no stinkin' collaboration. They can handle the teraflop calculations on their own, thank you. ;)

-Daniel Summerhays.

Mark Vondracek said...

Hi Dan,

Let's just hope the security is better at Los Alamos these days. I'm sure there is still a number of collaborators, since a number of civilian consultants are used at even weapons facilties. If they do have the capacity on site for the data load we're considering, it helps to have an in to the largest R&D budget on the planet (good ol' DoD). :-)

Let me know how it's going, Dan.