Site and Rack safety in Coherence 12.1.2

Introduction

This is a followup from the post where I talked about “Making your clusters site or rack safe in 3.7.1”.  Now that Coherence 12.1.2 has been released there are more improvements in this area that I’d like to quickly highlight.

  • SimpleAssignmentStrategy is now the default partition assignment strategy.
  • There is a new PartitionAssignment MBean exposing some very useful information.
  • This MBean now reports the service safety, and target, up to and including RACK-SAFE and SITE-SAFE.

 The Detail

In Coherence 3.7.1 you had to manually configure a service with the SimpleAssignmentStratey but now in 12.1.2 this is the default.  A new MBean, available under Coherence:type=PartitionAssignment,service=serviceName, responsibility=DistributionCoordinator, now exposes some really useful information about services and whats happening with partition distribution.

The new MBean has one entry per service and is maintained by the distribution co-ordinator for that service.  Below you can see the structure of this new MBean in JConsole.

New Partition Assignment MBean

The javadoc available here has the detail about the above MBean, but I’ve included some highlights below.

  • As well as HAStatus or (StatusHA in the old world!), we also have HATarget. The HATarget is target status that the assignment strategy striving to achieve. For example the HATarget may be SITE-SAFE, but that may not be achieved as yet because partition transfers have not completed.
  • RemainingDistributionCount shows the number of outstanding partition transfers left to achieve the HATarget.
  • FairShareBackup and FairSharePrimary show the number of partitions that each member will attempt to maintain.
  • You can run the reportPendingDistributions operation which allows you to see if there are any pending or outstanding partition distributions.
  • You are able to subscribe to the partition.lost JMX notification.

Example

Using my command line utility from the previous post, I’ve just updated the COHERENCE_HOME to point to my 12.1.2.0.0 install. I’ve also opened up JConsole, so you can see what the PartitionAssignment MBean is reporting.

In this example I have two sites with the following configuration:

Site Details

I first start cache servers on machine1 and machine2. We can see we now have a MACHINE-SAFE configuration.

Partition Count: 1049, Unowned: 0
Name            PID        Machine         Rack            Site            Partitions
=============   ========== ==============  ==============  ==============  ==========
cacheserver1         94991 machine1        rack1           SiteA                  525
cacheserver2         94992 machine2        rack1           SiteA                  524

StatusHA is MACHINE-SAFE

Next I will start cache servers on machine3 and machine4. We will see this will eventually become RACK-SAFE. In the previous example, I programatically checked this but from JConsole, You can now see the the PartitionAssignment MBean correctly shows this.

Partition Count: 1049, Unowned: 0
Name            PID        Machine         Rack            Site            Partitions
=============   ========== ==============  ==============  ==============  ==========
cacheserver3         95039 machine3        rack2           SiteA                  262
cacheserver1         94991 machine1        rack1           SiteA                  263
cacheserver2         94992 machine2        rack1           SiteA                  262
cacheserver4         95043 machine4        rack2           SiteA                  262

StatusHA is RACK-SAFE

We can see that the cluster is RACK-SAFE, which means we could lose any single rack and not lose data.

Rack Safe

Next, we will start cache servers on machine5 and machine6 on the second site.

Partition Count: 1049, Unowned: 0
Name            PID        Machine         Rack            Site            Partitions
=============   ========== ==============  ==============  ==============  ==========
cacheserver3         95039 machine3        rack2           SiteA                  175
cacheserver6         95080 machine6        rack3           SiteB                  175
cacheserver1         94991 machine1        rack1           SiteA                  175
cacheserver5         95054 machine5        rack3           SiteB                  175
cacheserver2         94992 machine2        rack1           SiteA                  175
cacheserver4         95043 machine4        rack2           SiteA                  174

StatusHA is RACK-SAFE

You will notice that the cluster is still RACK-SAFE until we startup the cache servers on the remaining rack. One of the key things to note is that to achieve MACHINE/RACK/SITE safety, no one entity (machine/ rack/ site) can hold more that 50% of the data. E.g. at the moment, SiteA would hold 66% of the data and therefore would could not effectively balance the data across sites.
Once we startup cache servers on the remaining 2 machines we will achieve SITE-SAFE state.

Partition Count: 1049, Unowned: 0
Name            PID        Machine         Rack            Site            Partitions
=============   ========== ==============  ==============  ==============  ==========
cacheserver3         95039 machine3        rack2           SiteA                  131
cacheserver6         95080 machine6        rack3           SiteB                  131
cacheserver8         95093 machine8        rack4           SiteB                  131
cacheserver1         94991 machine1        rack1           SiteA                  131
cacheserver5         95054 machine5        rack3           SiteB                  132
cacheserver2         94992 machine2        rack1           SiteA                  131
cacheserver7         95089 machine7        rack4           SiteB                  131
cacheserver4         95043 machine4        rack2           SiteA                  131

StatusHA is SITE-SAFE

During this last cache server startup, I ran reportPendingDistributions and some of the output is below. This shows what transfers the strategy is waiting on for achieve the desired HATarget.

Pending Partition Distributions for Service "DistributedCache"

Machine machine4
    Member 5:
        - waiting to receive 19 Backup partitions:
           -- 1 from member 3: PartitionSet{287}
           -- 5 from member 6: PartitionSet{103, 106, 177, 181, 185}
           -- 5 from member 8: PartitionSet{267, 271, 275, 279, 283}
           -- 8 from member 9: PartitionSet{28, 29, 33, 34, 51, 53, 57, 58}
Machine machine3
    Member 4:
        - waiting to receive 8 Backup partitions:
           -- 3 from member 6: PartitionSet{176, 180, 184}
           -- 5 from member 8: PartitionSet{266, 270, 274, 278, 282}
...

Now in JConsole, you can see that the cluster is now reached is target state of SITE-SAFE.

Site Safe

Thanks for reading, and hope you enjoy the new 12.1.2 Rack and Site safety features.

Advertisements
This entry was posted in Uncategorized and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s