Peer-to-peer and self-organizing algorithms

Márk Jelasity, Fall Semester, 2007, Szeged, Hungary

Introduction| Complex Networks| Search: Unstructured Networks| Search: Structured Networks| Cooperative Content Distribution| Techniques for Hiding

Home Projects


Send anonymous feedback through this form.


Abstract

The course is about distributed algorithms and systems that are fully decentralized, extremely scalable and fault tolerant. We will see techniques for organizing millions of independent components to perform useful functions, while avoiding bottlenecks or single points of failure. The need for studying such algorithms has emerged in many fields independently. Most recently P2P systems, and especially file sharing protocols, have triggered a considerable research effort into this direction. In other fields, such as artificial intelligence (through multi-agent systems) and networking (routing protocols, usenet, etc) similar problems have long been studied, not to mention seemingly unrelated fields such as biological self-organization, sociology, and so on.


Introduction

Fully decentralized systems and algorithms have recently become widely visible and mainstream due to P2P applications. However, in various fields of research decentralization and emergence have long been a topic of investigation. We review the history of decentralized and self-organizing algorithms to better understand their various motivations. Finally, we also discuss some main issues central to our discussion, such as the role of cooperation, dynamism, scale, and the topology of the communication network.

Reading material


Complex Networks

Large and distributed self-managing systems inevitably involve complex networks, either explicitly designed, or unexpected (emergent), if the system is not centrally controlled. We review some of the basic models of complex networks and their main properties.

Reading material

Lecture slides


Search: unstructured networks

The first real decentralized file sharing networks did not pay much attention to overlay network topology: for example, Gnutella used a relaxed approach allowing for self-organizing overlay construction, and flooding-based communication. Since no strict overlay topology is enforced, similar systems are often coined unstructured, although this terminology is somewhat misleading. Motivated by the initial success of Gnutella (and its subsequent evolution) we review the good and bad sides of working with unstructured networks and the techniques one can apply there including search protocols and different ways of adapting the system (replication, topology adaptation) that can substantially improve search performance. The overall conclusion is that (enhanced) unstructured systems are a viable and simple alternative with the only drawback of not being able to deal with rare (unpopular) items.

Reading material

Lecture slides


Search: structured networks

Problems with search protocols in unstructured networks include their inability to locate rare items and sometimes their lack of scalability. Partly to address these issues, and partly to support other applications than search, the distributed hash table (DHT) abstraction has been introduced which supports the storage of (key,value) pairs and their lookup based on the key. DHTs are based on self-organizing overlay networks in which the neighborhood relations are more strictly controlled than in unstructured networks. We will overview the DHT abstraction, the most well-known DHT protocols, and take a look at how they can be utilized for search. Finally, we conclude the discussion of search protocols with discussing approaches to combine the best parts of the unstructured and DHT worlds.

Reading material

Lecture slides


Cooperative Content Distribution

Locating content through search is only one side of the story. Content (large files, audio, video) also has to be actually delivered to the user. Besides search, content delivery is quickly becoming a fundamental issue in the Internet and in P2P networks. In particular, this is one of the applications where P2P protocols represent mainstream technology (for example, BitTorrent). Here decentralization and cooperation is the best way to go due to scalability and robustness issues, and their role is not simply to avoid legal trouble. We overview some of the basic approaches to P2P content delivery for both BitTorrent-style distribution of large files and streaming media.

Reading material

Lecture slides


Techniques for Hiding

Decentralized networks have a controversial application: hiding various activities and information from observers. We review two branches of this theme: botnets and anonym networks. In the former, criminals apply P2P technology to avoid detection of their activity and identity. In the latter, specialized P2P networks offer anonymity for users who wish to access Internet services that are indecent, prohibited or otherwise problematic. In this case the emphasis and intent is "white-hat" applications, such as protecting free speech, etc.

Reading material

Lecture slides


Home Projects

Students are required to do one home project. This will typically involve performing simulation experiments with the PeerSim simulator, except if the project is the real implementation of a particular protocol. Some possible topics for projects are listed below, but other topics are also acceptable (ask me if you have your own project idea). The PeerSim simulator is discussed in class, besides, you can find documentation on the PeerSim homepage. Try to solve problems on your own, but if you really get stuck, ask me.

Presentation of the projects

Graduate students will write a paper about their project in which they should discuss related literature, applied research methodology (what was tested and how, why exactly that way, what is the goal) and the actual results. It is expected that the work shows signs of independent analytical and critical thinking.

Undergraduate students will present their work verbally, with the help of a few slides (a short presentation). They are expected to show a clear understanding of the functioning and the importance of a selected protocol.

Some possible projects

All projects (except implementation projects) will use PeerSim. Both the cycle based and event based models are acceptable, but if the cycle based model is used, some arguments are needed to justify the simplification. Truly excellent projects will compare the two models...

  1. Simulate the evolution of the early unstructured Gnutella network. Implement the join protocol and the failure detection mechanism. Test your implementation in different scenarios (network size, churn, maybe different up-time distributions, etc). Test the resulting network from the point of view of graph properties (path length, clustering, degree distribution, connectivity, robustness to massive node failure). Do you find the same properties as researchers did in the real network?

    Hint: in PeerSim you have all the components for generating the required statistics, and to simulate churn; what you need to implement is the rather simple join protocol (probably as a NodeInitializer) and the maintenance protocol (probably with the CDProtocol interface, even in the event driven model).

  2. Pick at least two search protocols designed for unstructured networks (random walk (self-avoiding, degree bias, etc), expanding ring, etc), implement them, and compare them on at least two different topologies, that include a power-law topology and a k-out random graph topology. Do the comparison as a function of popularity of the searched item. Test at least the mean time to finding the searched item, the required number of messages in the whole network, and load balancing (for example, the maximal and minimal number of messages processed by a single node).

    Hint: you need to implement only search protocols, because topologies are readily available in PeerSim. When collecting data about the experiment, you might need to count the number of messages processed in each node, and use standard PeerSim components from the vector package to collect statistics.

  3. Implement a simple random walk search algorithm, and implement at least two replication strategies: path replication and owner replication, as learned in class. Start with a uniform distribution of search items, and with a power-law distribution of queries for these items. Test the theoretical result that we have seen in class, that predicts the distribution of the number of replicas in the system. Do you see square-root and proportional replication emerging? Is the average performance of square-root replication indeed better? How about the most popular items only? Test your system on both k-out random graphs and power-law graphs. Do you see the same result on both graphs? Why?

    Hint: you need to implement only search protocols, because topologies are readily available in PeerSim. Also, you might need to implement a custom Control component to collect statistics, that iterates through the network. Use IncrementalStats or IncrementalFreq cleverly.

  4. Implement the join protocol of a distributed hash table, such as Chord or Pastry. Test the resulting DHT as a function of protocol parameter, from the point of view of average hop-count while routing random ID-s to their destination, and also test whether routing is successful.

    Hint: if you use the event based engine, your work is much more valuable, and you can play with the speed at which nodes are added, etc. If you use the cycle based model, then the joins can be modelled as completely non-overlapping, and you can implement them completely within a NodeInitializer (that is, all traffic generated by the join is finished when the initializer returns). This way it is impossible to test what happens if nodes are added too fast, but you can still test the resulting network.

  5. Compare different coding and chunk selection strategies for BitTorrent-style distribution of large files by implementing at least two strategies. Pick one of the experiments in this paper by Gkantsidis and Rodriguez and reproduce it. Most importantly, test whether the results hold for larger network size (at least 10,000 nodes), larger neighborhoods (around 30-40, as in BitTorrent), realistic traces (realistic node capacity distributions and join/leave times) and a more faithful implementation of BitTorrent techniques such as the endgame? Do you agree with the conclusions of the paper? Do you find some contradictions?

    Hint: You do not actually have to implement network coding, you only need to track the coefficients. Use a compact representation of the chunk set and coefficients to save memory. If you use the cycle based engine, take care of modeling chunk download time by not allowing a chunk to do more than one hop in each cycle.

  6. Implement a client for one of the popular P2P networks, for example, a BitTorrent client. Demonstrate that you can use your client to participate in the respective P2P network.

    Hint: Do not get lost in details, such as optimizations and user interface; focus on the crucial functionality of the client, first make sure you can join the network and participate, and then if you have time left work on optimizations and other details.


Valid XHTML 1.0! Jelasity Márk
Fri Dec 14 16:47:01 CET 2007