Peer-to-peer and self-organizing algorithms

Márk Jelasity, Fall Semester, 2007, Szeged, Hungary

Home Projects

Send anonymous feedback through this form.

Abstract

The course is about distributed algorithms and systems that are fully decentralized, extremely scalable and fault tolerant. We will see techniques for organizing millions of independent components to perform useful functions, while avoiding bottlenecks or single points of failure. The need for studying such algorithms has emerged in many fields independently. Most recently P2P systems, and especially file sharing protocols, have triggered a considerable research effort into this direction. In other fields, such as artificial intelligence (through multi-agent systems) and networking (routing protocols, usenet, etc) similar problems have long been studied, not to mention seemingly unrelated fields such as biological self-organization, sociology, and so on.

Introduction

Fully decentralized systems and algorithms have recently become widely visible and mainstream due to P2P applications. However, in various fields of research decentralization and emergence have long been a topic of investigation. We review the history of decentralized and self-organizing algorithms to better understand their various motivations. Finally, we also discuss some main issues central to our discussion, such as the role of cooperation, dynamism, scale, and the topology of the communication network.

Reading material

Nelson Minar and Marc Hedlund. A network of peers: Peer-to-peer models through the history of the internet. In Andy Oram, editor, Peer-to-Peer: Harnessing the Power of Disruptive Technologies, chapter 1. O'Reilly, 2001.
CacheLogic Inc. P2P in 2005. http://www.cachelogic.com/home/pages/studies/2005_01.php, a survey on the volume and composition of P2P traffic on the Internet.

Complex Networks

Large and distributed self-managing systems inevitably involve complex networks, either explicitly designed, or unexpected (emergent), if the system is not centrally controlled. We review some of the basic models of complex networks and their main properties.

Reading material

Sergei N. Dorogovtsev and J. F. F. Mendes. Evolution of networks: From Biological Nets to the Internet and WWW. Oxford University Press, 2003.
Réka Albert and Albert-László Barabási. Statistical mechanics of complex networks. Reviews of Modern Physics, 74(1):47–97, January 2002.
R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U Alon. Network motifs: Simple building blocks of complex networks. Science, 298:824–827, 2002.
Mark E. J. Newman. Random graphs as models of networks. In Stefan Bornholdt and Heinz G. Schuster, editors, Handbook of Graphs and Networks: From the Genome to the Internet, chapter 2. John Wiley, New York, NY, 2002.

Lecture slides

Lecture slides in open document presentation (odp) and pdf formats.

Search: unstructured networks

The first real decentralized file sharing networks did not pay much attention to overlay network topology: for example, Gnutella used a relaxed approach allowing for self-organizing overlay construction, and flooding-based communication. Since no strict overlay topology is enforced, similar systems are often coined unstructured, although this terminology is somewhat misleading. Motivated by the initial success of Gnutella (and its subsequent evolution) we review the good and bad sides of working with unstructured networks and the techniques one can apply there including search protocols and different ways of adapting the system (replication, topology adaptation) that can substantially improve search performance. The overall conclusion is that (enhanced) unstructured systems are a viable and simple alternative with the only drawback of not being able to deal with rare (unpopular) items.

Reading material

Yatin Chawathe, Sylvia Ratnasamy, Lee Breslau, Nick Lanham, and Scott Shenker. Making gnutella-like p2p systems scalable. In Proceedings of ACM SIGCOMM 2003, pages 407–418, 2003.
Georgos Siganos, Michalis Faloutsos, Petros Faloutsos, and Christos Faloutsos. Power-laws and the AS-level Internet topology. IEEE/ACM Transactions on Networking, 11(4):514–524, 2003.
Qin Lv, Pei Cao, Edith Cohen, Kai Li, and Scott Shenker. Search and replication in unstructured peer-to-peer networks. In Proceedings of the 16th ACM International Conference on Supercomputing (ICS'02), 2002.
Matei Ripeanu, Adriana Iamnitchi, and Ian Foster. Mapping the gnutella network. IEEE Internet Computing, 6(1):50–57, 2002. (doi:10.1109/4236.978369)
Lada A. Adamic, Rajan M. Lukose, Amit R. Puniyani, and Bernardo A. Huberman. Search in power-law networks. Physical Review E, 64:046135, 2001.

Lecture slides

Lecture slides in open document presentation (odp) and pdf formats.

Search: structured networks

Problems with search protocols in unstructured networks include their inability to locate rare items and sometimes their lack of scalability. Partly to address these issues, and partly to support other applications than search, the distributed hash table (DHT) abstraction has been introduced which supports the storage of (key,value) pairs and their lookup based on the key. DHTs are based on self-organizing overlay networks in which the neighborhood relations are more strictly controlled than in unstructured networks. We will overview the DHT abstraction, the most well-known DHT protocols, and take a look at how they can be utilized for search. Finally, we conclude the discussion of search protocols with discussing approaches to combine the best parts of the unstructured and DHT worlds.

Reading material

Miguel Castro, Manuel Costa, and Antony Rowstron. Peer-to-peer overlays: structured, unstructured, or both?. Technical Report MSR-TR-2004-73, Microsoft Research, Cambridge, UK, 2004.
Boon Thau Loo, Ryan Huebsch, Ion Stoica, and Joseph M. Hellerstein. The case for a hybrid P2P search infrastructure. In Proceedings of the 3rd International Workshop on Peer-to-Peer Systems (IPTPS'04), San Diego, CA, USA, 2004.
Patrick Reynolds and Amin Vahdat. Efficient peer-to-peer keyword searching. In Middleware 2003, volume 2672 of Lecture Notes in Computer Science, pages 21–40. Springer-Verlag, 2003.
Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, and Scott Schenker. A scalable content-addressable network. In Proceedings of the 2001 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM), pages 161–172, San Diego, CA, 2001. ACM, ACM Press.
Antony Rowstron and Peter Druschel. Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems. In Rachid Guerraoui, editor, Middleware 2001, volume 2218 of Lecture Notes in Computer Science, pages 329–350. Springer-Verlag, 2001.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In Proceedings of the 2001 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM), pages 149–160, San Diego, CA, 2001. ACM, ACM Press.

Lecture slides

Lecture slides in open document presentation (odp) and pdf formats.

Cooperative Content Distribution

Locating content through search is only one side of the story. Content (large files, audio, video) also has to be actually delivered to the user. Besides search, content delivery is quickly becoming a fundamental issue in the Internet and in P2P networks. In particular, this is one of the applications where P2P protocols represent mainstream technology (for example, BitTorrent). Here decentralization and cooperation is the best way to go due to scalability and robustness issues, and their role is not simply to avoid legal trouble. We overview some of the basic approaches to P2P content delivery for both BitTorrent-style distribution of large files and streaming media.

Reading material

Christos Gkantsidis and Pablo Rodriguez Rodriguez. Network coding for large scale content distribution. In Proceedings of the 24th Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM 2005), pages 2235–2245, 2005.
Michael Freedman, Eric Freudenthal, and David Mazières. Democratizing content publication with Coral. In Proceedings of the First Symposium on Networked Systems Design and Implementation (NSDI '04), 2004.
M. Izal, G. Urvoy-Keller, E.W. Biersack, P.A. Felber, A. Al Hamra, and L. Garcés-Erice. Dissecting bittorrent: Five months in a torrent's lifetime. In Passive and Active Network Measurement, volume 3015 of Lecture Notes in Computer Science, pages 1–11. Springer, 2004.
Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, Animesh Nandi, Antony Rowstron, and Atul Singh. Splitstream: high-bandwidth multicast in cooperative environments. In Proceedings of the nineteenth ACM symposium on Operating systems principles (SOSP'03), pages 298–313, New York, NY, USA, 2003. ACM Press. (doi:10.1145/945445.945474)
Bram Cohen. Incentives build robustness in bittorrent. In Proceedings of the 1st Workshop on Economics of Peer-to-Peer Systems (P2PECON), Berkeley, CA, 2003.
Dejan Kostic, Adolfo Rodriguez, Jeannie Albrecht, and Amin Vahdat. Bullet: high bandwidth data dissemination using an overlay mesh. In Proceedings of the nineteenth ACM symposium on Operating systems principles (SOSP'03), pages 282–297, New York, NY, USA, 2003. ACM Press. (doi:10.1145/945445.945473)

Lecture slides

Lecture slides in open document presentation (odp) and pdf formats.

Techniques for Hiding

Decentralized networks have a controversial application: hiding various activities and information from observers. We review two branches of this theme: botnets and anonym networks. In the former, criminals apply P2P technology to avoid detection of their activity and identity. In the latter, specialized P2P networks offer anonymity for users who wish to access Internet services that are indecent, prohibited or otherwise problematic. In this case the emphasis and intent is "white-hat" applications, such as protecting free speech, etc.

Reading material

Patrick Gray. The hack of the year. The Sunday Morning Herald, November 2007.
Sandeep Sarat and Andreas Terzis. Measuring the storm worm network. Technical Report 01-10-2007, Hopkins InterNetworking Research Group, Johns Hopkins University, 2007.
The Honeynet Project and Research Alliance. Know your enemy: Fast-flux service networks, 2007.
Roger Dingledine, Nick Mathewson, and Paul Syverson. Tor: the second-generation onion router. In Proceedings of the 13th USENIX Security Symposium (SSYM'04), pages 21–21, Berkeley, CA, USA, 2004. USENIX Association.

Lecture slides

Lecture slides in open document presentation (odp) and pdf formats.

Home Projects

Students are required to do one home project. This will typically involve performing simulation experiments with the PeerSim simulator, except if the project is the real implementation of a particular protocol. Some possible topics for projects are listed below, but other topics are also acceptable (ask me if you have your own project idea). The PeerSim simulator is discussed in class, besides, you can find documentation on the PeerSim homepage. Try to solve problems on your own, but if you really get stuck, ask me.

Presentation of the projects

Graduate students will write a paper about their project in which they should discuss related literature, applied research methodology (what was tested and how, why exactly that way, what is the goal) and the actual results. It is expected that the work shows signs of independent analytical and critical thinking.

Undergraduate students will present their work verbally, with the help of a few slides (a short presentation). They are expected to show a clear understanding of the functioning and the importance of a selected protocol.

Some possible projects

All projects (except implementation projects) will use PeerSim. Both the cycle based and event based models are acceptable, but if the cycle based model is used, some arguments are needed to justify the simplification. Truly excellent projects will compare the two models...

Simulate the evolution of the early unstructured Gnutella network. Implement the join protocol and the failure detection mechanism. Test your implementation in different scenarios (network size, churn, maybe different up-time distributions, etc). Test the resulting network from the point of view of graph properties (path length, clustering, degree distribution, connectivity, robustness to massive node failure). Do you find the same properties as researchers did in the real network?
Hint: in PeerSim you have all the components for generating the required statistics, and to simulate churn; what you need to implement is the rather simple join protocol (probably as a NodeInitializer) and the maintenance protocol (probably with the CDProtocol interface, even in the event driven model).
Pick at least two search protocols designed for unstructured networks (random walk (self-avoiding, degree bias, etc), expanding ring, etc), implement them, and compare them on at least two different topologies, that include a power-law topology and a k-out random graph topology. Do the comparison as a function of popularity of the searched item. Test at least the mean time to finding the searched item, the required number of messages in the whole network, and load balancing (for example, the maximal and minimal number of messages processed by a single node).
Hint: you need to implement only search protocols, because topologies are readily available in PeerSim. When collecting data about the experiment, you might need to count the number of messages processed in each node, and use standard PeerSim components from the vector package to collect statistics.
Implement a simple random walk search algorithm, and implement at least two replication strategies: path replication and owner replication, as learned in class. Start with a uniform distribution of search items, and with a power-law distribution of queries for these items. Test the theoretical result that we have seen in class, that predicts the distribution of the number of replicas in the system. Do you see square-root and proportional replication emerging? Is the average performance of square-root replication indeed better? How about the most popular items only? Test your system on both k-out random graphs and power-law graphs. Do you see the same result on both graphs? Why?
Hint: you need to implement only search protocols, because topologies are readily available in PeerSim. Also, you might need to implement a custom Control component to collect statistics, that iterates through the network. Use IncrementalStats or IncrementalFreq cleverly.
Implement the join protocol of a distributed hash table, such as Chord or Pastry. Test the resulting DHT as a function of protocol parameter, from the point of view of average hop-count while routing random ID-s to their destination, and also test whether routing is successful.
Hint: if you use the event based engine, your work is much more valuable, and you can play with the speed at which nodes are added, etc. If you use the cycle based model, then the joins can be modelled as completely non-overlapping, and you can implement them completely within a NodeInitializer (that is, all traffic generated by the join is finished when the initializer returns). This way it is impossible to test what happens if nodes are added too fast, but you can still test the resulting network.
Compare different coding and chunk selection strategies for BitTorrent-style distribution of large files by implementing at least two strategies. Pick one of the experiments in this paper by Gkantsidis and Rodriguez and reproduce it. Most importantly, test whether the results hold for larger network size (at least 10,000 nodes), larger neighborhoods (around 30-40, as in BitTorrent), realistic traces (realistic node capacity distributions and join/leave times) and a more faithful implementation of BitTorrent techniques such as the endgame? Do you agree with the conclusions of the paper? Do you find some contradictions?
Hint: You do not actually have to implement network coding, you only need to track the coefficients. Use a compact representation of the chunk set and coefficients to save memory. If you use the cycle based engine, take care of modeling chunk download time by not allowing a chunk to do more than one hop in each cycle.
Implement a client for one of the popular P2P networks, for example, a BitTorrent client. Demonstrate that you can use your client to participate in the respective P2P network.
Hint: Do not get lost in details, such as optimizations and user interface; focus on the crucial functionality of the client, first make sure you can join the network and participate, and then if you have time left work on optimizations and other details.

Jelasity Márk
Fri Dec 14 16:47:01 CET 2007