Discovery

Discovery answers the big questions about a network:

What peers exist on the network?
How are the peers organized around their capabilities?
What uniquely identifies a peer?
How does a peer exchange data with another peer?

P2P is forced to identify answers to these questions. Unfortunately for the Java developer, not all P2P technologies are successful. Worse yet, some P2P technologies are closed and proprietary, or they hard-code implementations into one solution that would otherwise use open technology.

Although many P2P techniques exist to build peers, three types of peers have emerged as popular designs:

Simple peer
Rendezvous peers
Router peers

A simple peer is designed to be an endpoint that offers functions and data to peers making requests. Simple peers have the least responsibility of all three peer types. They usually reside outside a general network, and possibly behind a firewall or Network Address Translation (NAT) router. Simple peers are not expected to handle communication on behalf of other peers, or to serve information that they don't directly consume themselves.

Rendezvous peers provide a dating service in which peers discover other peers and peer resources like data and functions. All three types of peers issue discovery queries to rendezvous peers, but the rendezvous peer is also usually a cache of previous requests. When a rendezvous peer lives behind a firewall, it must have the ability to communicate through the firewall to other peers.

Router peers provide a mechanism for peers to communicate through firewalls and NAT routers. A router peer tunnels peer requests across a network. The information needed to use a router peer is enough to replace the need for a Dynamic Naming Service (DNS) and supports dynamic IP addressing.

Let's look at a simple example of the three peers in action. Imagine using a P2P client that looks for magazine articles on human genomics. The user initiates a search for the articles with a simple peer. The peer sends a discovery query to all its known simple peers and rendezvous peers. The rendezvous peers that receive the query look to see whether they have data the simple peer is looking for. If so, the rendezvous peer might return a discovery response message containing advertisements from other peers that are stored in its cache. The rendezvous peer will also likely send along the same query to its list of known peers.

Although we have described three different types of peers, in real-world P2P applications each peer might include a combination of the functions described in simple, rendezvous, and router peers. Let's look at how peers discover data, functions, and services using a variety of P2P techniques.

Router Peers and Dynamic Networks

P2P technology expects to find a network filled with firewalls, dynamic addresses, and changing peer locations. P2P provides a loose coupling of peers, so the P2P network remains functional even when parts of the real network break. Three P2P discovery techniques have become popular in this environment:

Broadcast Sends a discovery request to every network node that is reachable
Selective broadcast Sends a discovery request to every network node based on established heuristics
Adaptive broadcast Sends a discovery request to every network node based on heuristics and rules

These techniques will be joined, modified, and abandoned over time as new ways to dynamically form a network are identified. The following are some of the areas of study from which P2P technology innovations might spring:

Transport How do transport services such as broadcast, multicast, and unicast messaging relate to discovery?
Radius How is the discovery horizon established and maintained?
Frequency of broadcast How often should discovery messages be broadcast to populate the network?
Discovery protocol What information should be defined in a discovery protocol?
Discovery roles Do all peers participate equally in the discovery process? Do all peers have the same broadcast role?

Broadcasts

Traditionally, broadcast messages have been sent by devices that deal with network routing or data packet exchange at a low level, such as routers. Broadcast messages on IP networks contain a special address reserved for broadcasting. The network and host part of the address is set to ones (hex: FFFFFFFF). This indicates to the network layer that the packet is addressed to every device on the subnet, as seen in Figure 6.1.

Figure 6.1. Broadcasts try to reach all nodes on the subnet.

graphics/06fig01.gif

In a P2P context, broadcasting might sound like TCP/IP multicasting, but it isn't. P2P technology plays mostly in the application layer of a software application. The actual method for moving a broadcast message across the Internet might use multicasting or a number of other techniques that we will explore next.

Transport Multicast Versus Unicast Messaging

Multicast messaging is often compared to radio or TV broadcasts, in the sense that only those who have tuned their receivers to a particular frequency receive the information. Only the channels selected are heard. The sender sends the information without knowledge of the number of receivers.

In contrast, when you send a packet and there is only one sender and one recipient, this is referred to as unicast. A unicast transmission is by definition point-to-point. Unicast can be used to send identical information to many different destinations; however, this involves replicating data, and is not the most efficient transport.

Multicast addresses are in the Class D 224 239 range. Multicast messaging uses this range of addresses to define multicast groups, as shown in Table 6.1.

Table 6.1. IPv4 Address Classifications
IP Address Classification	Address Range
Class A	`0.0.0.0` `127.255.255.255`
Class B	`128.0.0.0` `191.255.255.255`
Class C	`192.0.0.0` `223.255.255.255`
Multicast (Class D)	`224.0.0.0` `239.255.255.255`
Reserved	`240.0.0.0` `247.255.255.255`

Note

You can find all the reserved multicast addresses at http://www.iana.org/assignments/multicast-addresses.

Multicasting has produced mixed results in applications that require a number of machines in a distributed group to receive the same data, such as conferencing, group mail, news distribution, and network management. Multicasting suffers from the lack of a control protocol, which makes it unsuitable for large, reliable, and sustained transmissions. Multicasting appears to be well-suited to P2P because peers on a P2P network do not require the synchronization of data among the peers, as multicasting often fails to deliver 100% of its data to everyone listening to the multicast. Figure 6.2 shows multicasting being used in P2P networks for discovery.

Figure 6.2. Multicasting goes beyond simple subnet penetration, but it requires that receivers listen on a specific "channel." The underlying network supports the transport services.

graphics/06fig02.gif

Multicast advantages include the following:

Decreased network utilization Reduces the number of messages required by eliminating redundant packets and decreasing the number of point-to-point connections that must be established.
Resource discovery Discovery and multicasting assume a sender is transmitting to an unknown number of peers without knowledge of their location.
Dynamic participation Multicasting provides flexibility in joining and leaving a group. This membership flexibility supports the transient behavior of peers.
Multimedia support Multimedia transmission continues to increase in popularity and consumes a significant amount of bandwidth. This is one area where network optimization is of paramount importance. Multicasting can be used to transmit multimedia data to receiving stations that compress the transmission and then deliver it to destination nodes, rather than using point-to-point connections for all destinations.

Unfortunately, multicasting is not implemented everywhere. Hardware, specifically routers, often block multicast traffic from penetrating corporate networks or traversing ISP providers. Firewalls and NAT devices often block not only multicast traffic, but constrain traffic in general to well-controlled choke points (ports). As a result, additional means of discovery are generally required in scalable P2P networks.

Radius of Broadcast

Broadcast packets need to have a mechanism to avoid bouncing around the network forever. This can happen when there is invalid addressing or routing information delivered with a packet. The time-to-live (TTL) parameter (an 8-bit field in an IP packet header), has been defined to address this issue. It ensures that packets cannot traverse the network endlessly. Each packet has a TTL value, which is a counter that is decremented every time the packet passes through a hop; for instance, a router between networks.

In the example in Figure 6.3, the TTL parameter is set to 4, and the broadcast request needs to make five hops (pass through five routers) to make it to the nearest peer. Peer-2 will never "hear" the broadcast request, and Peer-1 will never "know" about Peer-2 through this route. The packet will be discarded when the TTL count reaches zero.

Figure 6.3. Time-to-live parameters define the extent to which a packet can travel across the network. Routers typically decrement the TTL value of the packet as it passes through the router. When it reaches zero, the packet is discarded.

graphics/06fig03.gif

When a peer receives a request, it looks at the TTL value. If the value is greater than 1, it decrements the value and transfers the request to the destination address or the next hop. If the value is 1 or less, it discards the message. In this respect, the P2P network is providing a layer of control that "overlays" the network layer.

Frequency of Broadcast

Most systems that use broadcast techniques place some control on the frequency of the broadcast. For instance, when a peer activates, it sends a discovery message on the local subnet and waits a predetermined time before sending another discovery request. If no response is returned within that time interval, a subsequent request will be sent. In effect, the peer has started to poll the network. If responses are returned, the peer builds a map, or view of the peer network. This is important, because the peer view is probably very different from the physical view. The map reflects the peers that responded to the discovery request.

As peers enter and leave the network, they must be able to update their view. One approach is to go into a heartbeat mode of polling. The peer periodically sends a discovery request. As responses are received, the map is updated. During the polling process, some peers might no longer be available. In Java fashion, these peers are eventually removed when the Java garbage collector destroys the object holding the instantiated map. New peers that respond will be added to the map. A simple ping map contains the list of peers that have responded to discovery requests. The ping map can be as simple as a list of active IP addresses, as in Table 6.2.

Table 6.2. Ping Map
IP Addresses
`172.16.1.3`
`172.16.1.4`
`12.239.129.4`
…

The ping map, which might also be viewed as a peer routing table, is built from scratch each time the peer activates. In this model, the peer does not implement the notion of memory. In other words, each time the peer activates, it invokes the discovery process and collects a new image of the peer network. This approach is unable to deal with many of the problems inherent with P2P networks. For instance:

Dynamic IP assignment History of peer interaction is limited to current IP assignment.
Size and scale of network Every peer maintaining maps of connections cannot scale.
Reputation and trust issues No history of past peer interactions is possible.
Equitable resource allocation No controls are placed on resource utilization.
Security in general.

The identity of the peer is directly mapped (implicitly) to the IP address. If a peer changes its IP address, it is considered a new member of the network. A history of prior interactions is not possible.

The ping map can be extended to include the notion of identity, which resolves some of the problems. Persistence or memory of the peer network becomes more viable and attractive with identity. This approach requires each peer to have a unique ID. Once generated, the ID is fixed for the lifetime of the peer. When a discovery request is received, the responding peer returns its IP address (which might be different) and its unique ID (which never changes). This assumes that peers have a consistent method to generate unique IDs (see Table 6.3). ID collision occurs if two peers generate the same ID. Inconsistent ID representation (integer, String, UUID, and so on) causes identification problems throughout the network. Clearly, there are control mechanisms required even when using this simple approach.

Table 6.3. Ping Map with Peer Identity
IP Address:Port	Unique ID
`172.16.1.3`:	ABCD-3456-2345-DEFA
`172.16.1.4`:	DECF-5432-5643-EFDA
`12.239.129.4`:	DCDD-1324-7654-DEAC
…	…

Selective Broadcast

Instead of sending a discovery request to every peer on the network, peers are selected based on heuristics such as quality of service, content availability, or trust relationships.

Trust relationships are commonly used when a specific peer(s) acts as a relay or router to the peer network. Usually the trusting peer is seeded with the IP address of the trusted peer. This is the technique used by JXTA routing and rendezvous peers. The trusted peer has some knowledge of the network and is publicly available.

Selective broadcast requires that you maintain historical information on peer interactions, peer roles, peer identity, and so on. It begins to extend the ping and identity map concept to include the following:

Peer discovery roles Peers have special roles to enable discovery. All peers are "not" equal.
Past performance metadata A historical record of peer performance is maintained. This includes availability metrics, as well as environmental metadata.
Environmental metadata Includes additional information on the peers' capabilities, such as bandwidth, disk space, and processing power (see Table 6.4).

Selective broadcast systems are much more scalable than simple broadcast networks. Instead of sending a request to all peers, it is selectively forwarded to specific peers who have a higher probability of being able to locate other peers or resources.

Each peer must contain or have access to information used to route or direct requests received. Although this might be appropriate from relatively small networks, in larger networks this overhead can quickly grow to levels that are unsupportable.

Table 6.4. Ping Map with Peer Identity and Metadata
IP Address:Port	Unique ID	Metadata
`172.16.1.3:`	ABCD-3456-2345-DEFA	Dial-up, # of concurrent connections
`172.16.1.4:`	DECF-5432-5643-EFDA	DSL, # of concurrent connections
`12.239.129.4:`	DCDD-1324-7654-DEAC	T1, # of concurrent connections
…	…	…

Adaptive Broadcast

As mentioned in Chapter 1, "What Is P2P?," adaptive broadcast tries to minimize network utilization while maximizing connectivity to the network. You can limit the growth of discovery and searching by predefining a resource tolerance level that, if exceeded, will begin to curtail the process. This will ensure that excessive resources are not being consumed because of a malfunctioning element, a misguided peer, or a malicious attack. Adaptive broadcast requires monitoring resources such as peer identity, queue size, port usage, and message frequency.

Rules can be used to complement metadata to build sophisticated discovery techniques (See Table 6.5).

Table 6.5. Ping Map with Peer Identity, Metadata, and Rules
IP Address:Port	Unique ID	Metadata	Rules
`172.16.1.3:`	ABCD-3456-2345-DEFA	Dial-up, # of concurrent connections	Congestion -> Throttle Connections -> Accept
`172.16.1.4:`	DECF-5432-5643-EFDA	DSL, # of concurrent connections	Congestion -> Throttle Connections -> Accept
`12.239.129.4:`	DCDD-1324-7654-DEAC	T1, # of concurrent connections	Congestion -> Throttle Connections -> Accept
…	…	…	…

The ALPINE Network implements a form of adaptive broadcast in its adaptive social discovery protocol. It's based on the ALPINE-defined datagram protocol DTCP. See www.cubicmetercrystal.com/alpine/overview.html for more information on ALPINE networks and protocols.

Router Peers and Dynamic Networks

Broadcasts

Figure 6.1. Broadcasts try to reach all nodes on the subnet.

Transport Multicast Versus Unicast Messaging

Table 6.1. IPv4 Address Classifications

Figure 6.2. Multicasting goes beyond simple subnet penetration, but it requires that receivers listen on a specific "channel." The underlying network supports the transport services.

Radius of Broadcast

Figure 6.3. Time-to-live parameters define the extent to which a packet can travel across the network. Routers typically decrement the TTL value of the packet as it passes through the router. When it reaches zero, the packet is discarded.

Frequency of Broadcast

Table 6.2. Ping Map

Table 6.3. Ping Map with Peer Identity

Selective Broadcast

Table 6.4. Ping Map with Peer Identity and Metadata

Adaptive Broadcast

Table 6.5. Ping Map with Peer Identity, Metadata, and Rules