Research possibilities to pursue in peer-to-peer networking

This keynote speech was delivered on January 30, 2002 to the workshop Collaborative Computing in Higher Education: Peer-to-Peer and Beyond, sponsored by Internet2 and SURA.

Abstract: Universities, and particularly Internet2 researchers, can contribute to research on various peer-to-peer topics and can potentially benefit from the use of peer-to-peer technologies. This talk covers interesting applications of the peer-to-peer concept (distributed searches, affinity groups, workflow control, and collaboration) and lays out some research possibilities in the area (naming/identity and resource discovery, service delivery, security, routing, metadata). It also attacks the unfounded anxiety over bandwidth usage and tries to explain why peer-to-peer applications tend to arouse so much positive energy among users.

  1. How I got interested in peer-to-peer
  2. Sample uses for peer-to-peer
    1. Finding relevant documents
    2. Affinity groups
    3. Workflow control
    4. Unstructured collaboration
  3. Legal concerns
  4. Technical issues in peer-to-peer
    1. How to find peers
    2. Delivering services
    3. Security
    4. Architecture
    5. Routing
    6. Metadata
    7. Bandwidth issues
  5. Social impact and public excitement

Peer-to-peer is a venerable and far-reaching concept that has received a new impetus and a striking visibility in the past year. I’m thrilled to present some of my observations before an audience of academic administrators and professionals, because when peer-to-peer first burst upon the attention of the greater public, universities played a crucial role—you tried to stop it.

Yes, when Napster started reaching huge audiences, universities got alarmed at the increase in network bandwidth usage and many blocked Napster from their networks. This says a lot about university administrators, and I’ll return to the issue after I explain some network architecture issues.

Anyone delivering a speech in these times of terror and war, particularly a speech with implications for society and for technical infrastructure, seems called upon to address the audience’s quite reasonable preoccupation with the precariousness of modern life and the balance of world power. I have little to offer that will help you set aside or assimilate thoughts of the attacks of September 11. I’ll make two observations related to peer-to-peer.

First, with the overhead that will have to be expended on greater security and on fighting terrorism, researchers and ordinary businesses are going to have to get along on less than we’ve had before for quite some time. We’re going to have to learn to do more with fewer resources. I hope peer-to-peer can help us do this.

Second, when Internet researchers currently think about privacy and network security, they operate in a situation where “we” (the researchers and the people they respect) are in charge; where the bad guys are snipers stuck in the hills, pummeling us with viruses, web bugs, port scans, and unsolicited email. But in the future, the situation may be reversed. People who want to control our actions and limit what the public can do may be in charge of the Internet, and it may be we who will be up in the hills. Peer-to-peer will help to avoid this cataclysmic change, and will help us cope if it comes true.

Academic environments are ideal for experimenting with peer-to-peer and benefiting from peer-to-peer. You have an open attitude toward information, well-educated staff who can adapt to new tools, a variety of projects that require information exchange, and a willingness to expend time and effort in order to save money.

The Internet2 project, in particular, overcomes many barriers that are holding back the deployment of peer-to-peer products in current corporate environments. Internet2 is a good test bed for basic research that can benefit peer-to-peer.

When people ask me whether peer-to-peer is really anything new and whether the term has any value, I say, “Sure it’s new, and sure the term has value, because these systems have created all kinds of new problems.” Perhaps these problems will be solved by Sun or by the Peer-to-Peer Working Group organized by Intel, but perhaps they’ll be solved by Internet2 researchers—I can only ask you to try.

How I got interested in peer-to-peer

Here’s why I presume to address you on this subject. As a book editor for O’Reilly & Associates, I’m used to the caretaking of books about configuring a network, providing network services to users, and so on. As a side line, I also write position papers and articles on Internet policy and related social issues.

I wasn’t too interested in Napster when it hit the newspaper headlines because it rehashed old technologies; nor was I sympathetic to its legal problems because it was clearly guilty of vicarious and contributory copyright infringement. I first got interested in new networking technologies when I heard of Freenet and Gnutella. Talking to other people at O’Reilly, I put these together with various technologies that CEO Tim O’Reilly and other people in the company were looking at, and we came up with an idea for a book called something uninspiring like “Collaborative Networking” or “Distributed Processing.” When the term “peer-to-peer” became popular, there could no longer be any doubt about the book’s title.

Sample uses for peer-to-peer

Let me describe for you a few sample uses for peer-to-peer to show its breadth and its potential value. If you haven’t internalized the idea of peer-to-peer yet, I hope these examples can make it stick.

Finding relevant documents

File-sharing was the first peer-to-peer application to be widely noticed, and now it’s gone commercial. Traditional distributed file systems like NFS and AFS already provide virtual naming systems and allow an administrator to hide the physical location of a document. A new generation of file systems such as OceanStore or OpenCola’s Folders go even farther by replicating the files and breaking them into multiple pieces to make downloading faster. Such systems also tend to encrypt files and provide digital signatures. While OceanStore is an extremely ambitious research project that demands substantial disk space and bandwidth, I see no reason why the more modest innovations offered by peer-to-peer companies won’t become standard features within a few years.

Several companies, such as Thinkstream and Jibe, let you search for material on other people’s and organizations’ computers through structured queries. Jibe claims that employees spend four to eight hours a week searching for files within their own organizations; products like these try to eliminate that waste. They basically tie together databases in different organizations; their big trick is to provide a common kind of search data—metadata—that will work for all these inconsistent and disparate systems. They create a kind of affinity based on shared metadata. O’Reilly has a system like this called Meerkat, which uses a standard called RSS to let news organizations notify users of articles they might be interested in.

Improved metadata would mean finding more people and organizations with which you have an affinity. Developing metadata systems and putting them to work without placing too much of a burden on the user is an issue worthy of further research. How can standardized tags emerge from community efforts—is there a grass-roots process? How do you attach metadata with minimal effort by the user (because the effort most users are willing to invest ranges from minimal to zero)?

Affinity groups

OpenCola’s Folders product is even more convenient. It monitors what files you ask for and what other people use the system to ask for. If it finds you are like other people, because you show interest in the same files they do, the system starts looking for new files that you ought to be interested in. It copies documents automatically into a special folder on your computer without even asking you—collaborative filtering gone mad. Whereas the metadata-driven services I mentioned, Thinkstream, Jibe, and Meerkat, created controlled affinity, Folders fosters a discovered affinity.

Workflow control

Consilient (now defunct, unfortunately) was one of the subtlest and most ambitious products in peer-to-peer. It let a programmer set up a workflow, saying, “This document has to go to the dean for approval before heading to finance for processing, and then to the legal department,” and so on. The Consilient workflow tried to make sure each person did what he or she was supposed to do, while also enabling you to change the course of the workflow on the fly. It allowed concurrent processing by different people, when that was appropriate, and let people see the latest completed steps based on their role.

For people interested in such things, Consilient ran on a small Java Virtual Machine and interacted with its users using Java Server Pages. I pull out these buzzwords just to show that there was no magic involved.

Unstructured collaboration

Whereas Consilient stressed structure and accountability, Groove lets everybody do their thing in self-contained groups. Groove implements an old concept, cooperative work environments, but in an extremely flexible manner that doesn’t depend on any central coordinating server. I can edit an outline with my authors, watching every key stroke, and simultaneously carry on a secure chat that is visible to all the authors and I, but only to the authors and I. I can stop half-way and pick up the work from home. It’s a system that promotes freedom and flexibility. Furthermore, it provides a basic infrastructure for real-time synchronization and security, which any program written to its API can take advantage of.

Legal concerns

Clearly, I would like you to experiment with developing and using peer-to-peer systems. Many of you have already created peer-to-peer applications, whether or not you applied this term to them.

Researchers tend to be interested in its possibilities. By contrast, administrators of the university and its computing services associate peer-to-peer with the file-sharing programs that have become the IT departments’ biggest problem. The flood of file exchanges is judged not for its fertile possibilities but for its destructive impact, soaking up bandwidth and bringing legal concerns to the surface.

Peer-to-peer is a groundswell that goes far beyond file sharing, as I have shown. But its tendency to give power to the end-users will raise further legal questions as the field develops.

Peer-to-peer advances the key premise that new value comes from sharing information and building on it. Naturally it comes up against copyright issues, a problem that I don’t trivialize because, after all, I work for a publisher.

Most changes that affect businesses are social as well as technological. Businesses that try to hold back technological change find themselves at odds with society as a whole, as is proven by the various digital copyright battles going on now.

These companies are stuck in regressive defense mechanisms because of sheer panic—the same risk that our whole country now faces in its reaction to terrorism—and the lack of creative adaptivity will bring the copyright holders low in the end. Recently they tipped their hand, introducing a bill into Congress that would utterly halt normal technological evolution and try to freeze current social relations in hardware. (Governments have tried to do things like this before; see the final chapter of Jared Diamond’s Guns, Germs, and Steel.) But in contradiction to their fears, social change is usually slow enough that there is plenty of time for an entrepreneurial business to adapt. Physical music records will continue to be in demand for decades to come; the same goes for physical books and other media.

Of course, this doesn’t make it OK to throw cease-and-desist orders in the trash can. You certainly have to understand your legal responsibilities. If students or staff are using your computers to share material that’s copyrighted by someone else, you have legal liability. The much-maligned Digital Millennium Copyright Act (DMCA) actually makes your life easier on this specific issue, because it provides a procedure your system administrators can go through to protect you—but few universities understand the procedure. So learn more about the law.

Content producers may ask you to go beyond what you are legally required to do. They may approach you with various studio-friendly initiatives and tell you that you have a moral obligation to help them restrict digital distribution. Your answer to them should be, “No.” Say to them, “Your industry is going through a historic upheaval, and it is up to you to figure out how you’re going to deal with it. It’s not our job.” However, as university administrators, you do have a responsibility to protect the privacy of your students, faculty, and staff. You have an obligation to protect their freedom of speech to the extent allowed by law.

So you should watch out for copyright holders snooping around your networks or trying to suppress activities that are legal. Don’t think I’m arbitrarily speculating. The Recording Industry Association of America has already declared that they might try to crack into computer systems to prevent the transmission of unauthorized music files, and they were level-headed enough during discussions of the PATRIOT Act to offer an amendment that specifically exempted copyholders from anti-hacking provisions. (They didn’t succeed.)

Technical issues in peer-to-peer

Let’s return to the happier subject of research topics: what peer-to-peer can do for you, and what you can do for peer-to-peer.

Research into distributed applications and infrastructure has a very wide application. Centralized systems are evolving toward decentralization as they grow larger and scale upward. A well-known example is how the hosts file on the Internet became the Domain Name System; a more recent example concerns Web caching and the use of Akamai by large sites with high bandwidth demands. You might have heard that Akamai’s founder and CTO, Daniel C. Lewin, was tragically lost on one of the hijacked planes on September 11; one observer pointed out that the rush to news sites after the tragedies proved the importance of his company’s technology.

So centralized systems evolve toward decentralization. In an intriguing, complementary operation, decentralized or peer-to-peer sites are evolving toward centralization, also in a response to growth and the need to scale upward. Gnutella now has superpeers. Freenet provides gateways, JXTA Search creates a hierarchy of servers, and so on.

My colleague Jon Orwant, a Perl expert, statistician, and technology researcher, described peer-to-peer as a mindset rather than a single technology. Tim O’Reilly called it a meme, with similar connotations. In addition to the exciting technologies it spawns, peer-to-peer is a recognition that power exists at the edges of the network. The power has been there for a long time. It is currently harnessed in distributed denial-of-service attacks, unsolicited email, and web bugs, but a peer-to-peer mindset can help us harness it in more constructive ways. Researchers can ask themselves, “Why am I taking work or data from the endpoints and centralizing it? Leaving the work or data at the endpoints probably requires more coordination and consistency checking, but could it be more efficient? Even better, could it open up more creative possibilities on the part of the users?”

Some of the activities I’ve seen on the Internet2 Web site under the middleware directory touch on the problems that centralized as well as peer-to-peer projects face. It would be great for Internet2 developers to remember the peer-to-peer aspects of whatever they are researching, and its potential applications to peer-to-peer. For a start, consider the possibility of symmetric exchanges over all your protocols and infrastructure. Here are some more specific topics.

How to find peers

First, I’ll talk about naming and resource discovery. The only systems where you don’t care about names are systems where you want to be anonymous. Gnutella and Freenet are famous for this characteristic, of course, and they have achieved something incredibly ground-breaking and mind-expanding: they provide content independent of its location. Other, later peer-to-peer systems have built on this innovation, which allows lots of good things. But most systems need to find particular individuals or repositories for information—they need identification and resource discovery.

They achieve this through a shameless lapse away from decentralization. Identities are stored in a strictly centralized repository, as in instant messaging services. Some products, like XDegrees, Jibe, and Redmind, do some fancy distribution and breaking up of the namespace. The good old Domain Name System does this, in fact.

The Gartner Group speaks of a virtual name space for peer-to-peer. I don’t know what makes these names less real and more virtual than any other names. I think what the Gartner Group means is that these name spaces—instant messaging, Napster, and so on—tend to spring up ad hoc and opportunistically. This seems to me a weakness of current peer-to-peer systems, not a strength.

IPv6 will definitely help. It will, we hope, bring users’ systems out into the open, eliminating the current Network Address Translation system that hides the users. But IPv6 is not enough to solve peer-to-peer’s addressing problem. First, we can’t wait until IPv6 is deployed in the larger world. Second, it is naive to think that every device will have a fixed, permanent address when IPv6 is deployed. To do so would overwhelm the world’s routers; one of the major benefits advertised for IPv6 in fact is that it makes renumbering easier. Finally, what we really want is names rather than numbers anyway. When I ask you to visit my Web site, I don’t ask you to type 209.204.146.22 into your browser. Furthermore, I may log in from many places—work, home, a mobile phone, a train station—and I’m still me even though my address is different.

Identification and resource discovery is therefore one of the great problems you can work on in Internet2. I would like answers to the question: “What combination of centralization and decentralization works best for a particular application and information architecture?”

Delivering services

Partly because so many services are already offered through Web HTML forms and CGI, and partly because firewalls block any data not sent through port 80, the chief method of service delivery for the next couple years will be Web services using HTTP and probably either XML-RPC or SOAP. These protocols and the programs that handle them are probably not the most efficient or flexible way to handle peer-to-peer communications.

Network administrators are leery of programs that undermine their security efforts by sending arbitrary application traffic through port 80, and I feel the same way. I’d rather have users come to their network administrators and say, “Here’s a really exciting application, and it needs its own port. Will you open a port for me?” And I’d like administrators to say “Yes” (assuming the application is a useful one). That’s the reason for my giving this speech.

The following figure shows what standards are currently available for peer-to-peer, and roughly how they fit together.

Network Stacks for Peer-to-Peer Applications.
Web Services (SOAP, XML-RPC, UDDI, WSDL, etc.) Top layers of JXTA services (identification, presence, etc.)
HTTP or BEEP JXTA
TCP or SCTP

Some protocols you can explore include JXTA, of course, and all the research projects using it. The SCTP transport-level protocol is interesting for peer-to-peer because it is more flexible than TCP in supporting multiple independent streams of messages. On top of SCTP, you can run another new and intriguing protocol, BEEP. This application-layer protocol was designed by Marshall Rose and colleagues with peers specifically in mind. It supports multiple channels within a single session, and maintains an egalitarian view of the two correspondents: each correspondent can issue requests on one channel while satisfying requests on the other.

Security

Security is the bogey man invoked by many people who want to debunk peer-to-peer. I’m not sure why there’s so much hysteria around the supposed security problems of peer-to-peer; most systems—and certainly commercial systems—are perfectly up-to-date on encryption, digital signatures, digests, and other standard elements of network security.

Naturally, whenever you start up a program that accepts requests and exchanges data with someone else’s computer, you are essentially running a server. You are therefore at risk if you administer the server improperly, like the people who went on cable networks and turned on Windows file-sharing. You are also at risk from buffer overflows and similar program bugs. I have to point out, though, that these risks are currently hypothetical. No reports have been publicized of unwanted file sharing or buffer overflow exploits in the peer-to-peer world. And while you have reason to be gravely concerned when someone pokes a hole into your Web server—particularly if it also gives the intruder access to a database, mail server, etc.—you need to worry a lot less if somebody gets access to a random user’s PC that may contain nothing more important than his leisure and fitness class homework.

I believe that part of the security concern arose because the most famous peer-to-peer systems, like Napster and Freenet, are marvelously open and uncontrolled. To people who are unused to disruptive technologies, open and uncontrolled must mean insecure.

If peer-to-peer were inherently insecure, it would not be used by the McAfee company to distribute updates to their virus detection software. McAfee ASaP is a service provided to large companies to let them distribute updates quickly throughout their organizations. Instead of making ten thousand individuals contact the McAfee Web site—a sure recipe for network overloads—a few initial systems contact the McAfee site, and they pass on the software to other systems in a chain. This is called rumor technology, and is a form of peer-to-peer; the same architecture used by such content delivery networks and streaming media distributors as AllCast. When you’re fighting viruses, you’re clearly concerned with security, and McAfee’s use of a partially peer-to-peer system is a stunning endorsement of peer-to-peer’s security. Their rumor technology is not only more efficient than routine Web downloads, but more secure. Employees of each company have to go outside their corporate network only a few times to get the software. Most of the networking takes place inside the corporate network, presumably protected by a firewall and the general LAN architecture.

But peer-to-peer systems have to deal with the same security problems as traditional systems. There’s denial of service, where computers can become overloaded with requests or with data. There’s authentication, so you know who’s sending you data, and there are larger trust issues. A centralized public-key infrastructure (PKI) is not necessarily any more robust than the peer-to-peer solution known as a “web of trust.” I would not be surprised if authentication and trust become the greatest success of peer-to-peer. Eventually we may all move to adopt the web of trust as our preferred form of PKI.

Architecture

There’s lots and lots of room for research projects in architecture. What’s the best structure to impose on the mass of internetworked computers for each combination of application and environment?

Routing

If architecture represents the static view of communications in a peer-to-peer system, routing represents a dynamic view of the same communications. As we can see from Freenet and Gnutella, many peer-to-peer systems impose an extra level of routing at the application layer, far above the routing done at the network or IP layer.

If you have a choice of many correspondents, you may want to retrieve data from the one that you can reach at the lowest cost. If you can get what you want from someone on your LAN, you certainly prefer that to going outside onto the Internet. And if you have to go out to the Internet, you may get better service by choosing someone who shares your ISP and the communications facilities it controls, rather than someone whom you have to reach by passing through many far-flung access points. Peer-to-peer applications that handle location with such flexibility are not doing their own routing—yet—but they do know the architecture of their own networks.

The routing issue really becomes critical when you set up a hardware peer-to-peer system. The consortium CANARIE in Canada is doing just that. Universities purchase optical fiber links so that they can connect directly to each other and get out from under the thumb of an expensive telecom carrier. They can then send data very cheaply over a series of high-bandwidth links. It’s the old Usenet decked out in a new, twenty-first century guise. But you have to choose where you route your data carefully.

So routing may no longer be submerged at the network-layer. Perhaps peer-to-peer applications will become aware of network topology and control their own routing. Or perhaps a new layer will be added just beneath the applications to handle this function.

Metadata

I have already mentioned metadata as an area for research. Metadata includes the kinds of categories people search for, what scales they use to measure one resource against another—or in a social sense, what brings people together.

Jabber and RDF are particularly promising ways to deploy metadata, but communities must somehow agree on tags. Then applications that exploit their potential need to be developed.

Bandwidth issues

And finally, bandwidth issues, one of the fundamental features of Internet2. I’ve saved this for last among my technical topics, because the popularity of file-sharing systems on college campuses and the negative reaction of many system administrators deserve a good share of time.

By decentralizing data and therefore redirecting users so they download data directly from other users’ computers, Napster reduced the load on its servers to the point where it could cheaply support tens of millions of users. The same principle is used in many commercial peer-to-peer systems; I just mentioned it in relation to McAfee ASaP. In short, peer-to-peer can distribute not only files, but the burden of supporting network connections. The overall bandwidth required on the Internet remains the same as in centralized systems, but bottlenecks are eliminated at central sites—and equally importantly, at their ISPs.

How much bandwidth does a simple peer-to-peer system like Napster save? Let’s look at some rough estimates made by a company called CenterSpan, which makes a peer-to-peer content sharing system called C-Star. They estimated that, if you put together Napster and the various Gnutella systems and all the knock-offs, you’d see about three billion songs traded every month—sounds like a high number, but it’s been replicated elsewhere and could be pretty accurate. If you delivered all those songs from a central server, you’d need 25,000 T1 lines costing 25 million dollars a month. (The burden could be even worse if connections are bursty.) Peer-to-peer has to be more efficient.

Many network administrators will now protest that Napster was a bandwidth hog and overloaded their campus networks. This is not a problem caused by peer-to-peer: the load would have been just as bad had all those students exchanged files over FTP or some other protocol. Music files are just plain big, and if it suddenly became the hippest thing on the planet to exchange PowerPoint presentations or hundred-page PostScript files (like term papers), the load would be just the same.

I’m not surprised that colleges would complain about Napster bandwidth requirements, because I hear the same wringing of hands over education in general. I hear there are too many applicants to top colleges. Excuse me, but wouldn’t it be good to educate more students? Instead of saying there are too many applicants, why don’t you work on increasing the availability of high-quality course offerings? I know you don’t have tenure-track positions for all the people awarded doctorates, but it’s not your job to offer everyone a position; it’s your job to educate them.

College administrators have fallen into the same rut as telephone companies that are slow to roll out high-bandwidth lines, or the recording industry that is shutting down Napster. These institutions all find it more profitable to manage scarcity than to offer abundance.

I’ll apply the same reasoning now to Napster. The reason tens of millions of people used it is that it opened up the wonderful universe of music. Napster was much, more more than a free source of popular tunes; it represented exploration, a striving to know the unknown, a widening of cultural horizons. Yes, I know most of the stuff traded over Napster was junk, but much of what Beethoven wrote was also junk; just ask any musicologist. The point is that you need to cut a wide swath to encourage new experiences and new sounds. Universities should be excited by the spirit of curiosity shown by Napster. It was a flowering of cultural opportunities never before seen in the world, and that’s why there were so many downloads. Let’s provide bandwidth for the material people want instead of complaining that they want it.

I know you have to pay for bandwidth, so you have to charge for bandwidth too, somehow. I’m sure your users like to have things cost-free, and the next best alternative to cost-free is flat-rate. If you have to move to some kind of chokepoints or metered pricing, I don’t have a right to criticize you, but I’d like to offer a couple points of comparison for you to consider first.

One frequently-made observation is that Internet access and use is much greater in the United States, where local calls are priced at flat rates, than in most countries where local calls are metered. Observers tend to conclude that flat-rate pricing encourages experimentation, and suggest that many innovative uses of the Internet arose within this environment of experimentation. The whole phenomenon called “surfing” was a critical phase in the growth of the World Wide Web.

A second fascinating historical point of comparison is the New York City subway system. When it was opened near the beginning of the 1900s, the city’s leaders had to decide whether to base prices on how far people were riding, or to charge a nickel for every ride regardless of distance. They choose the latter, flat-rate system. Historians believe this led to the rapid spread of New Yorkers out of Manhattan and into the surrounding boroughs, creating a richer and more thriving city.

Social impact and public excitement

Peer-to-peer excites people because they can participate and make a difference. Even something as impersonal as SETI@home—where users downloaded software that performed calculations in the background—attracted millions of volunteers; and many said they did it because they felt like they were part of something. Just think how much more sense of ownership and pride can evolve around systems where you share ideas and content that have personal meaning to you.

University professors are already feeling anxious about students who share personal notes on the Web. Some professors have tried to force students to remove these notes—all the while uttering the same irrelevant shibboleths that plague peer-to-peer now: the professors claim to be worried about quality, they think they own intellectual property rights to the ideas they put forward in class—not true!—and so forth. Everybody knows the professors just scared to death at having their work exposed to scrutiny.

Peer-to-peer will up the ante even further. We have no idea what students will think of sharing next. Their experiments should be welcomed because it will make the university more transparent and force professors to teach better. Remember what happened—excuse me if this is trite—when Alexander Fleming discovered a foreign mold in one of his petri dishes. Instead of throwing it out, he started to research it, and discovered penicillin.

There’s still a lot of innovation left for computer technology. The next time you need rapid interaction, efficient data sharing, and the combined processing of inputs from many different sources, I urge you to look at what peer-to-peer can offer.