Update Sequence Numbers

A colleague and I discussed the value in using a 64 bit Update Sequence Number (USN) versus a 128 bit USN. Here are my thoughts assuming 64-bits.

To provide a bit of context, we were considering a solution space that incremented the USN value for each write operation for a specific set of protected data. The USN is 64 bits, and is advanced for each update on a given server. At 100 writes per second, the USN will roll over in 58,494,241,735 years (approximately).

On a very busy server performing 10,000 writes per second it will roll over much sooner, in 584,942,417 years. Either way, the sun will burn out or go nova first, making USN’s moot (for this astronomical neighborhood anyway).

Just using 128 bits for USNs would give us a few more millennia breathing room. Now for the good news. In point of fact, since USN’s from different servers are never compared, all that is required is detecting rollover of the USN on a given server, for which algorithms are available. See RFC 1982 for a discussion.

A Word on Elegance

What do I mean by elegance? Most software engineers equate elegance to efficiency. Still some equate elegance to robustness. Others choose correctness. All are variables in the equation of elegance. Many other variables exist and all are important.

Make no mistake, software is art. Some artists prefer oil on canvas as their medium. Others choose water color. As software design engineers (SDEs), we have chosen 1s and 0s to communicate our creative visions. Elegance must exist at each step in the engineering process and not only as an afterthought in the development phase. Elegance is the responsibility of each and every participant with accountability functioning as the enforcement point. Without accountability, chaos reigns. Accountability starts within.

Many software engineers find themselves searching for or memorizing patterns that “work” to shortly find themselves bulimically spewing the results onto their canvas. What happened to the elegance of originality? It’s far too easy to become a GP (“Google programmer”). Why? Because creativity and originality aren’t as convenient as letting others do the work, which leads to chaos and hard-to-maintain code; especially, when you’re not the original developer.

Stay with me; we’re only scratching the surface. Do you remember the last time you really evaluated the usability of your consumer API surface? Did your evaluation begin in the design phase? Start with a high-level view of the scenarios. Then oscillate between architectural design and the scenario-based consumer API surface. Do not think about technical design until you’re satisfied with the consumer API surface and the architectural design. Keep in mind; it’s certainly acceptable to refactor your architectural design and consumer API based on the technical design.

An apparent and even larger obstacle I’ve noticed lately is the inability for an SDE to make the jump from procedure (function)-oriented design concepts to that of object-oriented design while keeping an eye to efficiency in terms computational complexity and network round-trips. The devil is in the details. In my opinion, one can’t be an effective architect without being an even more effective SDE; period. The implementation details do change the architecture.

In general, people tend to migrate toward what makes them feel good. Rework doesn’t feel good; no one enjoys it. But, if that was true—if you truly don’t enjoy rework—then why would you not elegantly architect, design, and develop your code initially? Why do you make more work for yourself? Elegance as a foundational element is difficult. More often than not, SDEs take the easy way out because they equate efficiency to speed of completion. I’m not advocating “analysis paralysis” but instead a balanced design. Deadlines must be met. Deadlines should not be met at the expense of elegance.

It’s your canvas.

Fabric of a Distributed System

” It has been said that art is a collaboration between God and the artist, which works best when the artist contributes as little as possible; so too with designing distributed systems ” –Unknown 

Try to image the strategic direction of distributed computing frameworks. Just for a moment transition into a state where you think of nothing and yet listen to everything. What key technologies (disruptive or otherwise) have we experienced over recent years? 

Notion of Autonomous Computing

Arguably, one of the most well known papers on the subject of autonomic computing to be recently released is titled “Autonomic Computing: IBM’s Perspective on the State of Information Technology” [URI]; authored by Paul Horn, Senior Vice President of IBM Research. As outlined by Dr. Horn’s manifest, there are many challenges we face over the next decade within Information Technology. 

Within the autonomic landscape, each system node must satisfy the following criteria:

  • self-identification, self-knowing 
  • self-(re)configuration 
  • self-recovery (from perturbations) 
  • self-protection (security) 
  • self-learning (including from errors) 
  • self-regulating (to open standards) 
  • self-resource-allocation

Notion of Grid Computing

Wouldn’t it be nice to economically provide a highly available and fault tolerant system that can support an annual uptime guarantee of 99.999 percent—which equates to 5.256 minutes of annual unscheduled downtime—and that scales up by a factor of 106? That is, an application’s storage and processing capacity can automatically grow by a factor of a million, doing jobs faster (106x speed up) or doing 106 larger jobs in the same time (106x scale up), just by adding more resources. Stay focused.

The concept of Grid (or Utility) computing was coined in the mid 1990s and is best defined—to quote The Anatomy of the Grid—the real and specific problem that underlies the Grid concept is coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations. 

The Grid Global Forum (GGF) has made great headway; so has OGSA (Open Grid Services Architecture). Message Passing Interface (MPI) is widely accepted. Microsoft entered the HPC game with the Microsoft Compute Cluster Server, which I installed many months ago and is now consuming most of my free time. 

Notion of Adaptive Autonomous Agents

An agent is a system that tries to fulfill a set of goals of goals in a complex, dynamic environment. An agent it situated in the environment; it can sense the environment through its sensors and act upon the environment using its actuators. An agent’s goal can take many different forms: they can be “end goals”, or particular states the agent tries to achieve; they can be selective reinforcement or reward that the agent attempts to maximize; they can be internal needs or motivations that the agent has to keep within certain viability zones and so on. An agent is called autonomous if it operates completely autonomously, i.e. if it decides itself how to relate its sensor data to motor commands in such a way that its goals are attended to successfully. An agent it said to be adaptive, if it is able to improve over time, i.e. if the agent becomes better at achieving its goals with experience. 

The study of Adaptive Autonomous Agents is grounded in two important insights, which serve as “guiding principles” for most of the current research performed:

  • Looking at complete systems changes the problems often in a favorable way. 
  • Interaction dynamics can lead to emergent complexity.

Essentially, an agent is viewed as a set of competence modules (often called behaviors). These modules are responsible for a particular small task-oriented competence. Each of the modules is directly connected to its relevant sensors and actuators. Modules interface to one another via extremely simple messages rather than a common representation of beliefs, and so on. The communication between modules is almost never of a “broadcast” nature, but happens rather on a point-to-point (or one-to-one) bases. Typically, the messages consist of activation energy, or simple suppression and inhibition signals, or simple tokens in a restricted language. In addition to communication via simple messages, modules also communicate “via the environment”. One module may change some aspect of the environment, which will trigger another module, etc.

Notion of the Semantic Web

Simply put, the Semantic Web is the representation of data on the web in which information is given well-defined meaning—“to be a universal medium for the exchange of data”. 

The principal technologies of the Semantic Web fit into a set of layered specifications called the Resource Description Framework (RDF). The current components of that framework are the RDF Core Model, the RDF Vocabulary Description Language and the Web Ontology Language, which all build on the foundation of URIs, XML, and XML namespaces.

The most interesting of these languages is the Web Ontology Language (OWL), which is a descriptive layer built on top of RDF used to model classes, properties, and objects. 

Ontology is also a term borrowed from philosophy that refers to the science of describing the kinds of entities in the world and how they are related. Stated another way, an ontology defines the terms used to describe and represent an area of knowledge.

Notion of Service Oriented Architectures (SOA)

I think everyone has been beaten of the head with the “SOA stick”, which is why you won’t feel a thing; keep reading.

In computing, the term Service-Oriented Architecture (SOA) expresses a software architectural concept that defines the use of services to support the requirements of software users. In a SOA environment, nodes on a network make resources available to other participants in the network as independent services that the participants access in a standardized way. Most definitions of SOA identify the use of Web services (i.e. using SOAP or REST) in its implementation. Here and here are two links worth reading. However, one can implement SOA using any service-based technology.  

The WS* specifications have gained wide adoption but other advances are needed.

One last point as related to service orientation; I caution everyone to not forget about the elegance in design of class libraries as they are the underpinnings of every SOA. Design for the in-process consumer first with an eye to areas of the API surface that might benefit from hosting within an SOA.

Other advances…

There are many others such as advances in peer-to-peer algorithms, discrete-event simulation and the event horizon, social networking, recovery-oriented systems, storage and machine virtualization, bioinformatics, biotechnology, and quantum computing to name but a few. In my mind, the two most important questions to answer are which of these variables are of significant weighting in the equation of strategic direction and which are “noise”?  I certainly have my opinion…but then again, you know what those are like.

Most readers and programmers have little patience to read discussions of this length. However, the length (at least to me) seems disproportionate to the importance of the topic. There is so much more to say and even more to ponder. But, before you run off to save the world let me leave you with one additional thought:

“The map is not the territory.”
Alfred Korzybsky
Science and Sanity, 1933


Google’s BigTable Design

I periodically prune and clean several key folders on my hard drive and in doing so this morning ran into notes (I’m not sure from whom) taken from a talk given by Jeff Dean at the University of Washington on Google’s BigTable design. The following notes and the PDF published by Google on their BigTable design are very interesting.

Today Jeff Dean gave a talk at the University of Washington about BigTable – their system for storing large amounts of data in a semi-structured manner. I was unable to find much info about BitTable on the internet, so I decided to take notes and write about it myself.

First an overview. BigTable has been in development since early 2004 and has been in active use for about eight months (about February 2005). There are currently around 100 cells for services such as Print, Search History, Maps, and Orkut. Following Google’s philosophy, BigTable was an in-house development designed to run on commodity hardware. BigTable allows Google to have a very small incremental cost for new services and expanded computing power (they don’t have to buy a license for every machine, for example). BigTable is built atop their other services, specifically GFS, Scheduler, Lock Service, and MapReduce.

Each table is a multi-dimensional sparse map. The table consists of rows and columns, and each cell has a time version. There can be multiple copies of each cell with different times, so they can keep track of changes over time. In his examples, the rows were URLs and the columns had names such as “contents:” (which would store the file data) or “language:” (which would contain a string such as “EN”).

In order to make each manage the huge tables, the tables are split at row boundaries and saved as tablets. Tablets are each around 100-200 MB and each machine stores about 100 of them (they are stored in GFS). This setup allows fine grain load balancing (if one tablet is receiving lots of queries, it can shed other tablets or move the busy tablet to another machine) and fast rebuilding (when a machine goes down, other machines take one tablet from the downed machine, so 100 machines get new tablet, but the load on each machine to pick up the new tablet is fairly small).

Tablets are stored on systems as immutable SSTables and a tail of logs (one log per machine). When system memory is filled, it compacts some tablets. He went kind of fast through this, so I didn’t have time to write everything down, but here is the overview: There are minor and major compactions. Minor compactions involve only a few tablets, while major ones involve the whole system. Major compactions can reclaim hard disk space. The location of the tablets are actually
stored in special BigTable cells. The lookup is a three-level system. The clients get a pointer to the META0 tablet (there is only one). This tablet is heavily used, and so one machine usually ends up shedding all its other tablets to support the load. The META0 tablet keeps track of all the META1 tablets. These tables contain the location of the actual tablet being looked up. There is no big bottleneck in the system, because they make heavy use of pre-fetching and caching.

Back to columns. Columns are in the form of “family:optional_qualifier”. In his example, the row “www.cnn.com” might have the columns “contents:” with the HTML of the page, “anchor:cnn.com/news” with the anchor text of that link (“CNN Homepage”), and  “anchor:stanford.edu/” with that anchor text (“CNN”). Columns have type information. Columns families can have attributes/rules that apply to their cells, such as “keep n time entries” or “keep entries less than n days old”. When tablets are rebuilt, these rules are applied to get rid of any expired entries. Because of the design of the system, columns are easy to create (and are created implicitly), while column families are heavy to create (since you specify things like type and attributes). In order to optimize access, column families can be split into locality groups.

Locality groups cause the columns to be split into different SSTables (or tablets?). This increases performance because small, frequently accessed columns can be stored in a different spot than the large, infrequent columns.

All the tablets on one machine share a log; otherwise, one million tablets in a cluster would result in way too many files opened for writing (there seems to be a discrepancy here, he said 100 tablets per machine and 1000 machines, but that doesn’t equal one million tablets). New log chunks are created every so often (like 64 MB, which would correspond with the size of GFS chunks). When a machine goes down, the master redistributes its log chunks to other machines to process (and these machines store the processed results locally). The machines that pick up the tablets then query the master for the location of the processed results (to update their recently acquired tablet) and then go directly to the machine for their data.

There is a lot of redundant data in their system (especially through time), so they make heavy use of compression. He went kind of fast and I only followed part of it, so I’m just going to give an overview. Their compression looks for similar values along the rows, columns, and times. They use variations of BMDiff and Zippy. BMDiff gives them high write speeds (~100MB/s) and even faster read speeds (~1000MB/s). Zippy is similar to LZW. It doesn’t compresses as highly as LZW or gzip, but it is much faster. He gave an example of a web crawl they compressed with the system. The crawl contained 2.1B pages and the rows were named in the following form: “com.cnn.www/index.html:http”. The size of the uncompressed web pages was 45.1 TB and the compressed size was 4.2 TB, yielding a compressed size of only 9.2%. The links data compressed to 13.9% and the anchors data compressed to 12.7% the original size.

They have their eye on the future with some features under consideration.

1. Expressive data manipulation, including having scripts sent to clients to modify data.
2. Multi-row transaction support.
3. General performance for larger cells.
4. BigTable as a service. It sounds like each service (such as Maps or Search History) have their own cluster running BigTable. They are considering running a Google-wide BigTable system, but that would require fairly splitting resources and compute time, etc.

The talk was very interesting and it contained lots new information I had never heard before (I haven’t seen a BigTable paper yet). This information is accurate to the best of my knowledge, but if you see a mistake, please e-mail me. I wonder what the next Google talk will be about… I can’t wait.

The use of a Bloom filter in their design is quite an interesting application in place of a more traditional Distributed Hash Table (DHT) approach used in P2P implementations.

Event Stream Processor (ESP) using WCF

As a prerequisite to XRing and ZRing P2P algorithm implementations, I needed to write an Event Stream Processor (ESP). Though the version described in this entry is not nearly as mature as what is used within my implementation of these P2P overlay algorithms, I nevertheless decided to share an implementation expressing several overall concepts.

The implementation leverages Windows Communication Foundation (WCF) client callback channels and provides support for:

  1. Topic-based publishers and subscribers.
  2. The System.Transactions namespace; specifically supporting the IPromotableSinglePhaseNotification and IEnlistmentNotification interfaces.
  3. Out-of-process and in-process subscribers and publishers.

The two primary interfaces relevant to this discussion are defined below.

///
/// This interface provides a set of event publisher and event subscriber related operations.
///
[ServiceContract(CallbackContract =typeof(IEventSubscriber), SessionMode = SessionMode.Required,
 Namespace = "http://schemas.idynamics.us.com/2007/Events")]
public interface IEventProcessor
{
    ///
    /// Subscribes to the specified .
    ///
    [OperationContract]
    [TransactionFlow(TransactionFlowOption.Allowed)]
    void Subscribe(Topic topic);

    ///
    /// Subscribes to the specified with the specified configuration.
    ///
    [OperationContract(Name="SubscribeWithConfiguration")]
    [TransactionFlow(TransactionFlowOption.Allowed)]
    void Subscribe(Topic topic, EventSubscriberConfiguration configuration);

    ///
    /// Unsubscribes from the specified .
    ///
    [OperationContract]
    [TransactionFlow(TransactionFlowOption.Allowed)]
    void Unsubscribe(Topic topic);

    ///
    /// Publishes an  to the specified topic.
    ///
    [OperationContract(Name="PublishOne")]
    [TransactionFlow(TransactionFlowOption.Allowed)]
    void Publish(ref Event e);

    ///
    /// Publishes an  to the specified topic.
    ///
    [OperationContract(Name="PublishAll")]
    [TransactionFlow(TransactionFlowOption.Allowed)]
    void Publish(ref IList events);
}

The IEventProcessor interface defines the basic contract for an ESP that associates the IEventSubscriber (described below) interface as a callback contract.

///
/// Represents an event callback contract, which is used in conjunction with the  interface.
///
public interface IEventSubscriber
{
    ///
    /// Called when an  is published.
    ///
    [OperationContract(IsOneWay=true)]
    void OnPublished(Event e);
}

The OnPublished method is called during event publication and notifies the subscriber as specific events are published. Examples of the consumer subscription and publication APIs follow and can be found in the accompanying unit tests.

In-Process Publication:

Topic topic = new Topic("urn:Financial:Stocks:Ticker");
StockPriceChangeEvent e = new StockPriceChangeEvent(topic, "MSFT");
e.CurrentPrice = 95.0;
e.Change = -.35;
e.Effective = DateTime.Now;
using(TransactionScope scope = new TransactionScope())
{
    EventProcessor.Instance.Publish(ref e);
    scope.Complete();
}

In-Process Subscription:

public class StockEventSubscriber : EventSubscriber
{
 public override void OnPublished(Event e)
{
if(!(e is StockPriceChangeEvent))
{             return;
}          

//
// Do something with "e".
//
}
}
StockEventSubscriber callback = new StockEventSubscriber();
Topic topic = new Topic("urn:Financial:Stocks:Ticker"); EventProcessor.Instance.Subscribe(topic, callback);

Out-Of-Process Publication:

Topic topic = new Topic("urn:Financial:Stocks:Ticker");
StockPriceChangeEvent e = new StockPriceChangeEvent(topic, "MSFT");
e.CurrentPrice = 95.0;
e.Change = -.35;
e.Effective = DateTime.Now;

using(EventProcessorClient client = new EventProcessorClient()
{
    using(TransactionScope scope = new TransactionScope())
    {
        client.Publish(ref e);
        scope.Complete();
    }
}

Out-Of-Process Subscription:

public class StockEventSubscriber : EventSubscriber
{
	public override void OnPublished(Event e)
	{
		if(!(e is StockPriceChangeEvent))
		{
			return;
		}          

		//
		// Do something with "e".
		//
	}

	StockEventSubscriber callback = new StockEventSubscriber();
	using(EventProcessorClient client = new EventProcessorClient(new InstanceContext(callback)))
	{
		Topic topic = new Topic("urn:Financial:Stocks:Ticker");
		client.Subscribe(topic);
	}
}

I’ve intentionally left specific requirements surrounding preservation of order, delivery guarantees, and durability of both the subscriber and publisher to the reader and urge you to consider the architecture surrounding these three requirements especially when the Parallel Extensions for .NET are used.

The full source code is attached to this blog entry.

Dynamics.Esp.zip

Singleton<T>

In many ways software engineering strives to strike a balance between testability, scalability (and performance), execution logic and symmetry (in the case of distributed algorithms), and semantics (easily understood and usable) while remaining efficient in the use of system resources. In my opinion, these major variables are all of equal weighting; these define the elegance of a particular design.

For this discussion, I’d like to focus on two of the variables: testability and efficient use of system resources.

Garbage Collection

As a prerequisite, I recommend you read this article on how garbage collection works at a high-level in the Microsoft .NET Framework. If you don’t have time to read the article then do take time to read the next paragraph.

The following is a good idea what sorts of things we should try to avoid to get the best performance out of the garbage collector:

  1. Too many allocations.
  2. Too-large allocation.
  3. Too many pointers.
  4. Too many roots.
  5. Too many object writes.
  6. Too many Almost-Long-Life objects.
  7. If you implement IDisposable then suppress finalization to reduce costs.

Dependency Injection (DI)

A foundational pattern in xUnit testing is Dependency Injection (DI), which is a form of Inversion of Control (IoC). Specifically, DI is a programming technique where the implementation of one class is actually performed partially by another. Inversion of Control is where a program gives up control of its own execution and simply responds to requests made of it. In the same way, a class using dependency injection gives up control over some of its implementation and lets the injected class do the work.

Three types of DI exist; constructor injection, property injection, and method call injection. Constructor injection is a pattern whereby dependencies are “injected” into an object during construction. Whereas, property injection uses set operations [on properties] to inject dependencies, which obviously occurs after construction. Method call injection simply uses method calls to inject dependencies.

Several libraries exist in each category. For example, the Castle Project and ObjectBuilder both provide supports for constructor and property injection. As a side note, the Microsoft Enterprise Library uses ObjectBuilder internally as its DI container framework. Unity is Microsoft’s latest offering from the Pattern’s & Practices team. Unity offers property, constructor, and method call injection.

Static Classes

A static class does not have any instance-level members (including constructors) and is defined by the static modifier. The compiler in turn marks a static class as sealed and automatically creates a private constructor.

The main features of a static class are:

  1. They only contain static members.
  2. They cannot be instantiated.
  3. They are sealed.
  4. They cannot contain Instance Constructors.

Inherently in the context of unit (and integration) testing, static classes are not recommended due to the difficulty in implementing a thread-safe property injector DI pattern and being unable to use constructor DI. However, the advantage of a properly written static class is there’s only one object instance per AppDomain, which can be very efficient on the GC (see the Too Many Object Writes section of the previously referenced article).

This is a very important and annoying impedance that exists between the use of a constructor DI pattern and efficient use (and preservation) of system resources (in this case CPU and memory). We want to be testable and mindful of system resources.

Singletons

For purposes of this discussion, a singleton is a class that only allows a single instance of itself to be created within an AppDomain. The following are recommended singleton patterns.

Sidebar: Understand the Impact of Low-Lock Techniques in Multithreaded Apps

Pattern 1: Niladic Constructor

public class MySingleton
{
    private static readonly MySingleton _instance = new MySingleton();

    private MySingleton()
    {
    }

    public static MySingleton Instance
    {
        get { return _instance; }
    }
}

Pattern 2: Niladic Constructor (Full Lazy Initialization)

public class MySingleton
{
    class Nested
    {
        //
        // Explicit static constructor to tell C# compiler not to mark
        // type as BeforeFieldInit
        //
        static Nested()
        {
        }

        internal static readonly MySingleton Instance = new MySingleton();
    }

    private MySingleton()
    {
    }

    public static MySingleton Instance
    {
        get { return Nested.Instance; }
    }
}

Note, the following alternate method results in identical MSIL…

public class MySingleton
{
    static class Nested
    {
        internal static readonly MySingleton Instance = new MySingleton();
    }

    private MySingleton()
    {
    }

    public static MySingleton Instance
    {
        get { return Nested.Instance; }
    }
}

A rather dated but still applicable discussion regarding the use of the double-check lock pattern written by Vance Morrison can be found here and provides insight into subtle nuances of the Microsoft .NET Framework memory model implementation. I’ve attached a PDF version of the discussion to this entry (at the bottom) in case the archive disappears.

Singleton<T>

So back to the question at hand, which is how do I ensure my production code is testable and uses system resources efficiently as possible? I offer the Singleton<T> class, which is defined as…

public sealed class Singleton where T : new()
{
    private static readonly T _instance = new T();

    private Singleton()
    {
    }

    public static T Instance
    {
        get { return _instance; }
    }
}

With the Singleton<T> class, we’re now able to use the following syntax in production code…

public class CustomerManager : BusinessLogicComponent
{
    private ICustomerRepository _repository;

    //
    // This constructor is required by the Singleton generic class.
    //
    [EditorBrowsable(EditorBrowsableState.Never)]
    public CustomerManager()
        : this(Singleton.Instance)
    {
    }

    //
    //  This constructor is to be used only by test projects.
    //
    [EditorBrowsable(EditorBrowsableState.Never)]
    public CustomerProcessor(ICustomerRepository repository)
    {
        _repository = repository;
    }

    public void Add(Customer customer)
    {
        _repository.Add(customer);
    }
}

…and the following syntax from test projects.

[Test]
public void Customer_Add_Succeeds()
{
    //
    // Create a mock customer repository and inject it into the business
    // logic component (manager).
    //
    ICustomerRepository repository = new MockCustomerRepository();
    CustomerManager manager = new CustomerManager(repository)

    //
    // Perform remaining test operations.
    //
}

Consumers of the CustomerManager business logic class in production simply use the Singleton<T> class just as the CustomerManager does with the CustomerRepository class in its niladic constructor.

A final note on thread-safety. The singleton pattern does require all instance level methods to be thread-safe.

Conclusion

The Singleton<T> class provides an implementation pattern to balance between two important variables in any design; that of testability and efficient use of system resources. I encourage you to experiment.

Garbage Collector Basics and Performance Hints.zip

XRing DHT and Windows Communication Foundation (WCF)

A few weeks ago I provided a brief history of the P2P landscape in which I mentioned several Distributed Hash Table (DHT) implementations. The latest XRing DHT publication from Microsoft Research is dated September 2004. Substantial work has been done since that publication. To my knowledge, the only XRing DHT implementation exists within the Microsoft Research group and is not publicly available.

The XRing design is quite interesting and, unlike most DHT algorithms, isn’t dominated by a minimalist approach and instead focuses on deployment situations where the churn rate (node join and leave rate) is low or the system is of moderate size (between 1 and approximately 1 million nodes). The design of XRing uses a layered routing scheme to deliver O(logN) routing at worse case with a usual 1-hop anywhere route (2 hops for a million nodes), which is facilitated by the Finger Table and Soft-State Routing Table (SSRT), respectively. The third routing table is the Leaf Set, which contains 2L+1 entries, where L is the number of nodes to probe on either side of the home node plus the home node.

Membership in an XRing DHT is guaranteed by a weak and eventual membership protocol, which employs a special anti-entropy protocol that allows the system to coalesce; equilibrium is eventually reached. The membership and routing layer is where all the fun happens. For my implementation, a node may join an existing XRing either (a) by sending a one-way multicast request or (b) by sending a point-to-point request to an existing member. For (a), one common pattern is to send the request over a multicast channel and the response back over a point-to-point binding, which map wonderfully into WCF.

The multicast channel is realized by using the NetPeerTcpBinding class and the point-to-point channel by the NetTcpBinding class.

[ServiceContract(Namespace = "http://ws.idynamicscorp.com/P2P/Dht/XRing")]
public interface IXRingDhtDuplexMembership
{
	[OperationContract]
	void Join(DhtNodeId id);

	[OperationContract]
	void Leave(DhtNodeId id);
}

[ServiceContract(Namespace = "http://ws.idynamicscorp.com/P2P/Dht/XRing", CallbackContract = typeof(IXRingDhtMulticastMembership))]
public interface IXRingDhtMulticastMembership
{
	[OperationContract(IsOneWay = true)]
	void MulticastJoin(DhtNodeId id);

	[OperationContract(IsOneWay = true)]
	void MulticastLeave(DhtNodeId id);
}

An important design goal for any DHT implementation is that of ring simulation. Therefore, it’s important to allow a single process to support multiple nodes. Unfortunately, this meant I couldn’t use WCF configuration files but instead had to programmatically create the contract, binding, and endpoint address for each node. This took a bit of work to accomplish due to two primary reasons:

  1. The current documentation is not in-sync with the bits.
  2. The WCF samples use configuration-defined bindings, contracts, and endpoints.

Incidentally, working with the WCF configuration files (due to bullet 1) reminded me of a quote from David Hansson that I read on Box’s blog but with a minor modification:

WCF configuration files feel more like the doorknob to the gates of hell. In itself, a doorknob is hardly evil. But once you turn…

Fortunately (for me at least), Lutz Roeder’s .NET Reflector filled in the gaps. After some caffeine, a few frustrating moments, and questioning why I even started down this path, I found myself with a working prototype.

If you run Windows 2003 R2 as the operating system on my primary development machine (as I do), then you currently have one more obstable to overcome. Per the WCF peer channel team, the Peer Name Resolution Protocol (PNRP) service is currently available only on Windows XP (SP1 + networking pack installed), Windows XP (SP2), Windows Vista, and Windows XP Professional (64-bit) [URI]. Fortunately, you are able to leverage the CustomPeerResolver sample, which can be found in the “PeerChannelCustomPeerResolver” folder of the WinFX February CTP.

The following code fragment illustrates how to programmatically reference the custom peer resolver:

NetPeerTcpBinding binding = new NetPeerTcpBinding();
binding.Port = 4242;
binding.Security.Mode = SecurityMode.None;

//
// If the default PNRP service is not available then
// fall back to a custom peer resolver.
//
if (!NetPeerTcpBinding.IsPnrpAvailable)
{
	binding.Resolver.Custom.Address = new EndpointAddress("net.tcp://resolverUri");
	binding.Resolver.Custom.Binding = new NetTcpBinding(SecurityMode.None, true);
	binding.Resolver.Custom.Resolver = new CustomPeerResolver();
	binding.Resolver.Mode = PeerResolverMode.Custom;
}

Overall, I’m quite impressed with the flexibility of WCF and the usability that seemingly went into the consumer API surface. Hopefully, the documentation catches up to the bits.

Peer-to-Peer (P2P)

“In a forest a tree will fade; from a forest a tree is made.”

–Unknown

Over the past several years I’ve focused on application of P2P algorithms for the commercial space. Anyway, I thought I’d share some of my thoughts and tertiary research.

First, what is Peer-To-Peer (P2P)?  P2P [URI] is a decentralized, fault tolerant, self-organizing system architecture comprised of many unreliable and heterogeneous nodes operating in a functionally symmetric manner—frequent joins and leaves are the norm. P2P is not a client / server architecture. P2P is a paradigm shift from coordination to cooperation, from centralization to decentralization, and from control to incentives.

Algorithms

To my knowledge, the Plaxton mesh (1997) was the earliest known proposal for a scalable P2P network. One major problem with a Plaxton mesh is its static—there is no concept of node arrival, departure, or failure. Essentially, nodes act as routers, clients, and servers simultaneously. A Plaxton mesh routes a message to the nodes whose name is numerically closest to the destination.

From 2001 through 2003, P2P algorithms such as Chord, CAN, Pastry, Tapestry, Viceroy, P-Grid, Kademlia, Koorde, SkipGraph, and SkipNet appeared.

Applications

Napster (www.napster.com) was one of the first P2P applications that appeared in 1999—though it’s not really a P2P network. With Napster, a single central server kept the directory of nodes and objects, which made is essentially a large file catalog—the discovery and routing algorithms are not P2P.

Gnutella first appeared in 2000 and then again in 2002. Though a pure P2P application, Gnutella generated far too much search traffic with a [broadcast-based] O(n) complexity—not very scalable.

eDonkey appeared in 2001 and presented a slight architectural improvement over Napster in that no single centralized server held the file catalog. Nevertheless, eDonkey was not a pure P2P application since it relied on status node structures.

Kazaa also appeared in late 2001 but I haven’t studied its architecture at all. I listed it here for completeness.

Gnutella-2 and BitTorrent were the P2P applications of 2002.  In 2003, Skype, Steam, and PS3 appeared.

Architecture

Common characteristics of P2P architectures are:

  • No central servers.
  • High level of scalability. Discovery and routing is O(log(n)). Routing table size is also O(log(n)).
  • Highly resilient.
  • Dynamic adaptability to node arrival, departure and failure.

Because of these characteristics, P2P architectures offer challenges in many areas. The most visible of the challenges facing designers of P2P routing algorithms are:

  • Scalability. Scalability is a measure of how a system performs when the number of nodes and/or number of messages on the network grows.
  • Computational Complexity. Computational complexity is the measure of the order of steps required for a packet to travel from one host to another in a worst case scenario.
  • Anonymity. Anonymity is not a requirement of most P2P networks, however if a network is to be designed to provide anonymity then this is a problem that must be solved at the routing level.

Topology

A P2P infrastructure is best described as an overlay network—usually on top of an IP network.  In many ways, you can think of P2P as an application-level router providing services that underlying layers do not (i.e. IP multicast).  Generally speaking, P2P architectures employ one of the following topological structural designs:

  1. Centralized (Napster)
  2. Decentralized
    1. Unstructured (Gnutella)
    2. Structured (Chord)
  3. Hierarchical (Multicast Backbone (MBone))
  4. Hybrid (eDonkey)

Different P2P algorithms employ different topological structures. For example, Chord employs a ring topology whereas Kademlia employs a tree and CAN a hypercube.

Routing

Different topologies require different algorithm tactics and strategies. Routing within P2P architectures are very problematic.

Within P2P architectures, algorithms such as flooding, replication and caching, random walkers and probabilistic algorithms, super-peers, Time to Live (TTL), epidemic and gossip protocols, propagation, and dampening algorithms are widely used. These algorithms work well depending on the size of the P2P network. For example, flooding [http://www9.limewire.com/developer/gnutella_protocol_0.4.pdf] is more suited for a small to medium size networks but it has been shown that the cost of searching on a Gnutella style network increases super-linearly as the number of nodes increases.

Distributed Hash Tables (DHT)

DHTs are a class of decentralized distributed systems that partition ownership of a set of keys among participating nodes, and can efficiently route messages to the unique owner of any given key. A hash function (such as SHA-1) accepts a variable length string of bytes and returns a one-way hash. The physical nodes are the hash buckets. Chord is a good example of a DHT algorithm.

Incidentally, in general DHT algorithms will always beat flooding algorithms in the area of routing.

One advantage of the Chord DHT algorithm is that it guarantees receipt of a reply with log(n) time, which is a far better guarantee in terms of computational and time complexity than that offered by flooding techniques.  That said, the Kademlia DHT algorithm does offer advantages over Chord depending on application requirements.

However, DHTs are not without problem:

  1. Routing state maintenance algorithms generate overhead.
  2. They do not work well when lots of nodes join and leave frequently (churn).
  3. Their structured topology and routing algorithms make them frail and vulnerable to security attacks.
  4. Network locality problem: Nodes numerically-close are not topologically-close.

Strategies do exist to address these problems. Regardless, structured DHT algorithms provide the most interesting P2P infrastructures.

Semantic Routing

This is a relatively new area of routing as related to P2P infrastructures. Semantic Routing is a method of routing which is more focused on the nature of the query to be routed than the network topology. Essentially, semantic routing improves on traditional routing by prioritizing nodes which have been previously good at providing information about the types of content referred to by the query.

Semantic routing differs fundamentally from other routing techniques because prospective nodes are selected because of another node’s confidence in their ability to respond correctly to a given query irrespective of their position within the network. Neurogrid and Remindin are of the most interesting semantic routing systems with the Remindin system incorporating a semantic routing algorithm developed with the intention of mimicking social networks.

Summary

Overall, the most interesting of the DHT algorithms are Chord, Kademlia, and XRing.  I hope to release .NET implementations of all three of these algorithms (and a few variations) in the near future. You’ll find them here.