Think about this: In about two years, if all goes according to plan, the Large Hadron Collider (LHC) will be commissioned and running out at CERN, the European particle physics facility near Geneva. One of the experiments, the Compact Muon Solenoid (CMS), will be able to collect 225 MB of data each second, and will run for the equivalent of 115 full days in 2008. This adds up to 2 petabytes of data (enough to be stored on 1.4 million CDs!). So goes 21st century scientific research. The question, of course, is where and how does one store such massive amounts of data, let alone distribute it to hundreds or even thousands of scientists, postdocs, and graduate students who need the data for their individual projects?
Back around 1999, physicists and computer scientists began developing prototype systems known as 'grid computing' in order to handle large datasets. Single labs or universities almost certainly will not have the capability to store and efficiently make use of datasets mentioned above for CMS, so it is necessary to expand on collaborative efforts of the past for analyses of all kinds. Grid computing is named because it is analogus to an electrical grid. When you turn on a light in your house, you don't really have any idea from where on the grid the power is coming from, just as long as you get it. The grid is a sort of black box, an enormous network of wires, transformers, power stations, and so on the collectively feeds power when needed on a large scale. Data and information grid computing networks act in a similar manner.
In a sense this means that there will be virtual organizations and collaborations. Members of a particular grid network will enter the information they need at any time on their own computer, and whatever it is they requested from the grid will be accessed from wherever that information is stored on the grid. All members of the grid network will have such capabilities. Large datasets can presumably be broken up and stored at any number of sites all over the world, and the grid architecture and software will be able to quickly grab any files that are requested, regardless of what local computer network has that file. Such a system of virtual organizations is essential because of the size and cost of many scientific projects that are either in existence or are planned for the future. To put it in perspective, Microsoft's Tony Hey said, "It's no exaggeration to say that we'll collect more data in the next five years than we have in all of human history."
The data being collected comes as both data from physical experiments such as CMS or other experiments at the Fermilab national lab outside Chicago, as well as simulation data from computer-based experiments. Many of the complicated and dense simulation data comes from programs being run on modern supercomputers and other parallel-processing networks. Presently large databases can be accessed on the traditional Internet, such as protein databases and other information from the Human Genome Project, and these and similar databases are only expected to grow over time. Large national and international collaborations that share data are not new, and from my own experience at Fermilab, certain types of science cannot be done without such efforts. Grid computing is the natural extension and solution to ever-increasing information from such collaborations.
An interesting extension of collaborative work comes in the form of increasing amounts of multidisciplinary research. More research areas are overlapping, and with grid computing there is an expectation that biologists, chemists, geologists, physicists, astronomers, theorists, and so on will be able to access each other's data and tools, as well as share research methods and models to enable new breakthroughs. With growing work being done on complex systems and emergent behavior, for instance, such sharing will likely be necessary to expand the field.
How grid computing works depends on four layers of the network. The foundation is the physical network that links all the members and resources, which are considered the second layer, on the grid. Each successive layer depends on the lower layers. On top of the resources sits the middleware, which refers to the software that makes the grid work, as well as hides the complexity of the grid from users. The top layer is then the applications software that users actually see and access on their computers. This is like the application icons one has on your computer desktop; click on it, and software that you don't see and take for granted does its thing to open up the program or application you want. The big difference from your current computer is that a computer that is a member of the grid will have icons for applications or data that, when you need them and click on them, will not grab the program off your hard-drive, but rather pull it from whatever computer on the grid that has it...and that computer may be on the other side of the world. This all happens automatically.
As people gain experience with the relatively small grid networks that currently exist, larger and more complex grids will naturally form. It will be an Internet on steroids, and as always with technology, we cannot imagine the ways in which it will be used. One potential problem, of course, is security of enormous datasets, so development of new types of encryption and anti-virs software will likely be developed as well to maintain the integrity of a grid where members change frequently and new information is added daily to the grid, much like what already happens on the traditional Internet.