Making your cluster site or rack safe with Coherence 3.7.1

It’s been a while since my last post – things have been very busy! I’ve just moved across to the Coherence Development team and it’s definitely an exciting time to join. After I’ve settled in, I look forward to doing some more regular posts. On to the post…

Overview
Many customers I’ve worked with want to run a single cluster across multiple data centers to provide DR capabilities. In the case where the data centers are connected by relatively slow networks, it is much better to have two separate clusters and connect them via Coherence*Extend. There are patterns on the Incubator site such as the push-replicate pattern, which shows how to replicate data between sites in this configuration.

But many of these data centers are now connected via 10Gb or higher and customers are asking “Why can’t we have a single cluster across both?” Doing this is possible, but not recommended unless latencies are extremely low due to the potential effect of a slower link on the entire cluster. The detailed discussion on this is for another day as there are many other factors to consider, but it is possible under the right conditions.

Distributed Cache Diversion
Before we get into more detail, just a diversion to talk about how partitioning of data works in a distributed cache. (This will help us understand the end result we are trying to achieve.) For a Distributed cache, the data is evenly distributed across all the available members using a common hashing algorithm. Where possible, without affecting the overall balance of data, backup and primary copies of data reside on separate physical machines for data reliability. When the Service, which contains the cache, has data with all backups and primary copies on physically separate machines, it is known as machine-safe. In this state the Service can survive the loss of an entire machine without data loss.

See here for more information on distributed caches.

Taking this further, what about if you have multiple racks within a data centre and a Coherence cluster spans these? What about multiple sites? How do we get rack-safe, or site-safe? Prior to 3.7.1 the site or rack a cache server resided on was not taken into account when making these backup decisions, just the machines. In 3.7.1, with the new Simple Partition Assignment Strategy, Coherence takes into consideration the entire topology of the cluster from machine to rack to site when backing up data.

So now it is possible to not only have clusters that are machine-safe, they can be rack-safe and site-safe as well!

Back to the example
The only viable solution before 3.7.1, for achieving a “site-safe” cluster, was to set the machine-id manually to force coherence to consider the two sites being 2 machines, e.g. Backup across so-called “machines”. That’s a reasonable approach, but if we lose a site then the cluster becomes only node-safe, not machine-safe because Coherence thinks it has only one machine, so loss of a physical machine could cause data loss.
In the diagram below to achieve this, it would be done by setting the following:

-Dtangosol.coherence.machine=siteA   (for all machines on Site A)
-Dtangosol.coherence.machine=siteB   (for all machines on Site B)

Traditional “Site-Safe” Method

Using the Simple Partition Assignment Strategy
With the new Simple Partition Assignment Strategy in 3.7.1, and with the above cluster setup, as long as you set the site name using
–Dtangosol.coherence.site=siteA and SiteB, or using the appropriate override, you will be able to achieve a site-safe configuration. E.g. you could lose an entire site at once, and you would not lose data. Similarly if you have multiple racks in your configuration, as long as you identify them via the –Dtangosol.coherence.rack setting, you can achieve a rack-safe configuration. E.g. you could lose an entire rack you would not lose data.

Demonstration
To demonstrate this I’m using part of the Coherence Incubator functionality, which allows you to easily startup/shutdown multiple cache servers, either in process or as separate processes. I’ve built a wrapper around this and made a simple command line utility that allows me to dynamically specify a machine, rack and site before starting up cache servers.

I’ll provide the details of the code below, but in my cache-configuration, all I have to do to enabled this for a service, is to set the partition-assignment-strategy.

<?xml version="1.0"?>
<cache-config xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xmlns="http://xmlns.oracle.com/coherence/coherence-cache-config"
	xsi:schemaLocation="http://xmlns.oracle.com/coherence/coherence-cache-config http://xmlns.oracle.com/coherence/coherence-cache-config/1.1/coherence-cache-config.xsd">

	<caching-scheme-mapping>
		<cache-mapping>
			<cache-name>*</cache-name>
			<scheme-name>example-distributed</scheme-name>

		</cache-mapping>
	</caching-scheme-mapping>

	<caching-schemes>
		<distributed-scheme>
			<scheme-name>example-distributed</scheme-name>
			<service-name>DistributedCache</service-name>

			<thread-count>5</thread-count>
			<partition-count>1049</partition-count>
			  
			<partition-assignment-strategy>
				<instance>
					<class-name>com.tangosol.net.partition.SimpleAssignmentStrategy</class-name>
				</instance>
			</partition-assignment-strategy>
			
			<backing-map-scheme>
				<local-scheme>
					<unit-calculator>BINARY</unit-calculator>
				</local-scheme>
			</backing-map-scheme>

			<autostart>true</autostart>
       
		</distributed-scheme>


	</caching-schemes>
</cache-config>

Rack-safe Configuration
Consider the example below where you have a single site with 2 racks with 2 servers each for simplicity. Each server will have 2 cache servers running.

Rack-safe Configuration

Running my utility (which I’ll show below) and passing the IP address of my machine for WKA configuration, I can create this setup using set rack and set machine commands which will set the tangosol.coherence.rack and tangosol.cohernece.machine system properties before starting the cache server(s).

$ ./run.sh 192.168.88.1

Oracle Coherence Version 3.7.1.3 Build 31790
 Grid Edition: Development mode
Copyright (c) 2000, 2012, Oracle and/or its affiliates. All rights reserved.

Using the Incubator Extensible Environment for Coherence Cache Configuration
Copyright (c) 2011, Oracle Corporation. All Rights Reserved.

Type help for help or quit to exit.
Command{machine=,rack=,site=}: set rack rack1
Command{machine=,rack=rack1,site=}: set machine server1  
Command{machine=server1,rack=rack1,site=}: start 2
Command{machine=server1,rack=rack1,site=}: 
[cacheserver2:4252]    1: 
[cacheserver2:4252]    2: Oracle Coherence Version 3.7.1.3 Build 31790
[cacheserver2:4252]    3:  Grid Edition: Development mode
[cacheserver2:4252]    4: Copyright (c) 2000, 2012, Oracle and/or its affiliates. All rights reserved.
[cacheserver2:4252]    5: 
[cacheserver1:4251]    1: 
[cacheserver1:4251]    2: Oracle Coherence Version 3.7.1.3 Build 31790
[cacheserver1:4251]    3:  Grid Edition: Development mode
[cacheserver1:4251]    4: Copyright (c) 2000, 2012, Oracle and/or its affiliates. All rights reserved.
[cacheserver1:4251]    5: 
[cacheserver2:4252]    6: 
[cacheserver2:4252]    7: Using the Incubator Extensible Environment for Coherence Cache Configuration
[cacheserver2:4252]    8: Copyright (c) 2011, Oracle Corporation. All Rights Reserved.
[cacheserver2:4252]    9: 
[cacheserver1:4251]    6: 
[cacheserver1:4251]    7: Using the Incubator Extensible Environment for Coherence Cache Configuration
[cacheserver1:4251]    8: Copyright (c) 2011, Oracle Corporation. All Rights Reserved.
[cacheserver1:4251]    9: 

Please enter a command.
Command{machine=server1,rack=rack1,site=}: set machine server2
Command{machine=server2,rack=rack1,site=}: start 2
Command{machine=server2,rack=rack1,site=}: 
[cacheserver3:4257]    1: 
[cacheserver3:4257]    2: Oracle Coherence Version 3.7.1.3 Build 31790
[cacheserver3:4257]    3:  Grid Edition: Development mode
[cacheserver3:4257]    4: Copyright (c) 2000, 2012, Oracle and/or its affiliates. All rights reserved.
[cacheserver3:4257]    5: 
[cacheserver4:4258]    1: 
[cacheserver4:4258]    2: Oracle Coherence Version 3.7.1.3 Build 31790
[cacheserver4:4258]    3:  Grid Edition: Development mode
[cacheserver4:4258]    4: Copyright (c) 2000, 2012, Oracle and/or its affiliates. All rights reserved.
[cacheserver4:4258]    5: 
[cacheserver3:4257]    6: 
[cacheserver3:4257]    7: Using the Incubator Extensible Environment for Coherence Cache Configuration
[cacheserver3:4257]    8: Copyright (c) 2011, Oracle Corporation. All Rights Reserved.
[cacheserver3:4257]    9: 
[cacheserver4:4258]    6: 
[cacheserver4:4258]    7: Using the Incubator Extensible Environment for Coherence Cache Configuration
[cacheserver4:4258]    8: Copyright (c) 2011, Oracle Corporation. All Rights Reserved.
[cacheserver4:4258]    9: 

Please enter a command.
Command{machine=server2,rack=rack1,site=}: show
Partition Count: 1049, Unowned: 0
Name            PID        Machine         Rack            Site            Partitions
=============   ========== ==============  ==============  ==============  ==========
cacheserver3          4257 server2         rack1                                 null
cacheserver4          4258 server2         rack1                                 null
cacheserver1          4251 server1         rack1                                  524
cacheserver2          4252 server1         rack1                                  525

StatusHA is NODE-SAFE

Not machine-safe yet because partitions still being transfered.

Command{machine=server2,rack=rack1,site=}: show
Partition Count: 1049, Unowned: 0
Name            PID        Machine         Rack            Site            Partitions
=============   ========== ==============  ==============  ==============  ==========
cacheserver3          4257 server2         rack1                                  262
cacheserver4          4258 server2         rack1                                  262
cacheserver1          4251 server1         rack1                                  262
cacheserver2          4252 server1         rack1                                  263

StatusHA is MACHINE-SAFE

Machine-safe now, so add new machines in a different rack.

Command{machine=server2,rack=rack1,site=}: set rack rack2
Command{machine=server2,rack=rack2,site=}: set machine server3
Command{machine=server3,rack=rack2,site=}: start 2
Command{machine=server3,rack=rack2,site=}: 
[cacheserver5:4270]    1: 
[cacheserver5:4270]    2: Oracle Coherence Version 3.7.1.3 Build 31790
[cacheserver5:4270]    3:  Grid Edition: Development mode
[cacheserver5:4270]    4: Copyright (c) 2000, 2012, Oracle and/or its affiliates. All rights reserved.
[cacheserver5:4270]    5: 
[cacheserver6:4271]    1: 
[cacheserver6:4271]    2: Oracle Coherence Version 3.7.1.3 Build 31790
[cacheserver6:4271]    3:  Grid Edition: Development mode
[cacheserver6:4271]    4: Copyright (c) 2000, 2012, Oracle and/or its affiliates. All rights reserved.
[cacheserver6:4271]    5: 
[cacheserver5:4270]    6: 
[cacheserver5:4270]    7: Using the Incubator Extensible Environment for Coherence Cache Configuration
[cacheserver5:4270]    8: Copyright (c) 2011, Oracle Corporation. All Rights Reserved.
[cacheserver5:4270]    9: 
[cacheserver6:4271]    6: 
[cacheserver6:4271]    7: Using the Incubator Extensible Environment for Coherence Cache Configuration
[cacheserver6:4271]    8: Copyright (c) 2011, Oracle Corporation. All Rights Reserved.
[cacheserver6:4271]    9: 

Please enter a command.
Command{machine=server3,rack=rack2,site=}: show
Partition Count: 1049, Unowned: 0
Name            PID        Machine         Rack            Site            Partitions
=============   ========== ==============  ==============  ==============  ==========
cacheserver3          4257 server2         rack1                                  262
cacheserver5          4270 server3         rack2                                 null
cacheserver4          4258 server2         rack1                                  262
cacheserver1          4251 server1         rack1                                  262
cacheserver6          4271 server3         rack2                                 null
cacheserver2          4252 server1         rack1                                  263

StatusHA is MACHINE-SAFE

Command{machine=server3,rack=rack2,site=}: show
Partition Count: 1049, Unowned: 0
Name            PID        Machine         Rack            Site            Partitions
=============   ========== ==============  ==============  ==============  ==========
cacheserver3          4257 server2         rack1                                  209
cacheserver5          4270 server3         rack2                                  210
cacheserver4          4258 server2         rack1                                  210
cacheserver1          4251 server1         rack1                                  210
cacheserver6          4271 server3         rack2                                 null
cacheserver2          4252 server1         rack1                                  210

StatusHA is MACHINE-SAFE

Command{machine=server3,rack=rack3,site=}: set rack rack2
Command{machine=server3,rack=rack2,site=}: set machine server4
Command{machine=server4,rack=rack2,site=}: start 2
Command{machine=server4,rack=rack2,site=}: 
[cacheserver8:4283]    1: 
[cacheserver8:4283]    2: Oracle Coherence Version 3.7.1.3 Build 31790
[cacheserver8:4283]    3:  Grid Edition: Development mode
[cacheserver8:4283]    4: Copyright (c) 2000, 2012, Oracle and/or its affiliates. All rights reserved.
[cacheserver8:4283]    5: 
[cacheserver7:4282]    1: 
[cacheserver7:4282]    2: Oracle Coherence Version 3.7.1.3 Build 31790
[cacheserver7:4282]    3:  Grid Edition: Development mode
[cacheserver7:4282]    4: Copyright (c) 2000, 2012, Oracle and/or its affiliates. All rights reserved.
[cacheserver7:4282]    5: 
[cacheserver8:4283]    6: 
[cacheserver8:4283]    7: Using the Incubator Extensible Environment for Coherence Cache Configuration
[cacheserver8:4283]    8: Copyright (c) 2011, Oracle Corporation. All Rights Reserved.
[cacheserver8:4283]    9: 
[cacheserver7:4282]    6: 
[cacheserver7:4282]    7: Using the Incubator Extensible Environment for Coherence Cache Configuration
[cacheserver7:4282]    8: Copyright (c) 2011, Oracle Corporation. All Rights Reserved.
[cacheserver7:4282]    9: 

Please enter a command.
Command{machine=server4,rack=rack2,site=}: show
Partition Count: 1049, Unowned: 0
Name            PID        Machine         Rack            Site            Partitions
=============   ========== ==============  ==============  ==============  ==========
cacheserver3          4257 server2         rack1                                  150
cacheserver8          4283 server4         rack2                                 null
cacheserver7          4282 server4         rack2                                  127
cacheserver5          4270 server3         rack2                                  173
cacheserver4          4258 server2         rack1                                  149
cacheserver1          4251 server1         rack1                                  150
cacheserver6          4271 server3         rack2                                  150
cacheserver2          4252 server1         rack1                                  150

StatusHA is MACHINE-SAFE

Still machine-safe as not all partitions are transferred.


Command{machine=server4,rack=rack2,site=}: show
Partition Count: 1049, Unowned: 0
Name            PID        Machine         Rack            Site            Partitions
=============   ========== ==============  ==============  ==============  ==========
cacheserver3          4257 server2         rack1                                  131
cacheserver8          4283 server4         rack2                                  131
cacheserver7          4282 server4         rack2                                  131
cacheserver5          4270 server3         rack2                                  131
cacheserver4          4258 server2         rack1                                  132
cacheserver1          4251 server1         rack1                                  131
cacheserver6          4271 server3         rack2                                  131
cacheserver2          4252 server1         rack1                                  131

StatusHA is RACK-SAFE

Command{machine=server4,rack=rack2,site=}: 

Now the cluster is rack-safe!

Site-safe Configuration
Now lets look at the case where we have 2 sites with 2 servers on each site (again for simplicity).

Site-safe Configuration

./run.sh 192.168.88.1

Oracle Coherence Version 3.7.1.3 Build 31790
 Grid Edition: Development mode
Copyright (c) 2000, 2012, Oracle and/or its affiliates. All rights reserved.

Using the Incubator Extensible Environment for Coherence Cache Configuration
Copyright (c) 2011, Oracle Corporation. All Rights Reserved.

Type help for help or quit to exit.
Command{machine=,rack=,site=}: set site PrimaryDC
Command{machine=,rack=,site=PrimaryDC}: set machine server1
Command{machine=server1,rack=,site=PrimaryDC}: start 2
Command{machine=server1,rack=,site=PrimaryDC}: [cacheserver2:4381]    1: 
[cacheserver2:4381]    2: Oracle Coherence Version 3.7.1.3 Build 31790
[cacheserver2:4381]    3:  Grid Edition: Development mode
[cacheserver2:4381]    4: Copyright (c) 2000, 2012, Oracle and/or its affiliates. All rights reserved.
[cacheserver2:4381]    5: 
[cacheserver1:4380]    1: 
[cacheserver1:4380]    2: Oracle Coherence Version 3.7.1.3 Build 31790
[cacheserver1:4380]    3:  Grid Edition: Development mode
[cacheserver1:4380]    4: Copyright (c) 2000, 2012, Oracle and/or its affiliates. All rights reserved.
[cacheserver1:4380]    5: 
[cacheserver2:4381]    6: 
[cacheserver2:4381]    7: Using the Incubator Extensible Environment for Coherence Cache Configuration
[cacheserver2:4381]    8: Copyright (c) 2011, Oracle Corporation. All Rights Reserved.
[cacheserver2:4381]    9: 
[cacheserver1:4380]    6: 
[cacheserver1:4380]    7: Using the Incubator Extensible Environment for Coherence Cache Configuration
[cacheserver1:4380]    8: Copyright (c) 2011, Oracle Corporation. All Rights Reserved.
[cacheserver1:4380]    9: 

Please enter a command.
Command{machine=server1,rack=,site=PrimaryDC}: show
Partition Count: 1049, Unowned: 0
Name            PID        Machine         Rack            Site            Partitions
=============   ========== ==============  ==============  ==============  ==========
cacheserver2          4381 server1                         PrimaryDC             null
cacheserver1          4380 server1                         PrimaryDC             1049

StatusHA is ENDANGERED

Command{machine=server1,rack=,site=PrimaryDC}: show
Partition Count: 1049, Unowned: 0
Name            PID        Machine         Rack            Site            Partitions
=============   ========== ==============  ==============  ==============  ==========
cacheserver2          4381 server1                         PrimaryDC              524
cacheserver1          4380 server1                         PrimaryDC              525

StatusHA is NODE-SAFE

All partitions balanced now.


Command{machine=server1,rack=,site=PrimaryDC}: set machine server2
Command{machine=server2,rack=,site=PrimaryDC}: start 2
Command{machine=server2,rack=,site=PrimaryDC}: [cacheserver4:4389]    1: 
[cacheserver4:4389]    2: Oracle Coherence Version 3.7.1.3 Build 31790
[cacheserver4:4389]    3:  Grid Edition: Development mode
[cacheserver4:4389]    4: Copyright (c) 2000, 2012, Oracle and/or its affiliates. All rights reserved.
[cacheserver4:4389]    5: 
[cacheserver3:4388]    1: 
[cacheserver3:4388]    2: Oracle Coherence Version 3.7.1.3 Build 31790
[cacheserver3:4388]    3:  Grid Edition: Development mode
[cacheserver3:4388]    4: Copyright (c) 2000, 2012, Oracle and/or its affiliates. All rights reserved.
[cacheserver3:4388]    5: 
[cacheserver4:4389]    6: 
[cacheserver4:4389]    7: Using the Incubator Extensible Environment for Coherence Cache Configuration
[cacheserver4:4389]    8: Copyright (c) 2011, Oracle Corporation. All Rights Reserved.
[cacheserver4:4389]    9: 
[cacheserver3:4388]    6: 
[cacheserver3:4388]    7: Using the Incubator Extensible Environment for Coherence Cache Configuration
[cacheserver3:4388]    8: Copyright (c) 2011, Oracle Corporation. All Rights Reserved.
[cacheserver3:4388]    9: 

Please enter a command.
Command{machine=server2,rack=,site=PrimaryDC}: show
Partition Count: 1049, Unowned: 0
Name            PID        Machine         Rack            Site            Partitions
=============   ========== ==============  ==============  ==============  ==========
cacheserver4          4389 server2                         PrimaryDC              350
cacheserver3          4388 server2                         PrimaryDC             null
cacheserver2          4381 server1                         PrimaryDC              350
cacheserver1          4380 server1                         PrimaryDC              349

StatusHA is NODE-SAFE

Command{machine=server2,rack=,site=PrimaryDC}: show
Partition Count: 1049, Unowned: 0
Name            PID        Machine         Rack            Site            Partitions
=============   ========== ==============  ==============  ==============  ==========
cacheserver4          4389 server2                         PrimaryDC              262
cacheserver3          4388 server2                         PrimaryDC              262
cacheserver2          4381 server1                         PrimaryDC              262
cacheserver1          4380 server1                         PrimaryDC              263

StatusHA is MACHINE-SAFE

Now machine-safe in one single data centre. Lets start up the other data centre.


Command{machine=server2,rack=,site=PrimaryDC}: set site BackupDC
Command{machine=server2,rack=,site=BackupDC}: set machine server3
Command{machine=server3,rack=,site=BackupDC}: start 2
Command{machine=server3,rack=,site=BackupDC}: [cacheserver5:4399]    1: 
[cacheserver5:4399]    2: Oracle Coherence Version 3.7.1.3 Build 31790
[cacheserver5:4399]    3:  Grid Edition: Development mode
[cacheserver5:4399]    4: Copyright (c) 2000, 2012, Oracle and/or its affiliates. All rights reserved.
[cacheserver5:4399]    5: 
[cacheserver6:4400]    1: 
[cacheserver6:4400]    2: Oracle Coherence Version 3.7.1.3 Build 31790
[cacheserver6:4400]    3:  Grid Edition: Development mode
[cacheserver6:4400]    4: Copyright (c) 2000, 2012, Oracle and/or its affiliates. All rights reserved.
[cacheserver6:4400]    5: 
[cacheserver5:4399]    6: 
[cacheserver5:4399]    7: Using the Incubator Extensible Environment for Coherence Cache Configuration
[cacheserver5:4399]    8: Copyright (c) 2011, Oracle Corporation. All Rights Reserved.
[cacheserver5:4399]    9: 
[cacheserver6:4400]    6: 
[cacheserver6:4400]    7: Using the Incubator Extensible Environment for Coherence Cache Configuration
[cacheserver6:4400]    8: Copyright (c) 2011, Oracle Corporation. All Rights Reserved.
[cacheserver6:4400]    9: 

Please enter a command.
Command{machine=server3,rack=,site=BackupDC}: show
Partition Count: 1049, Unowned: 0
Name            PID        Machine         Rack            Site            Partitions
=============   ========== ==============  ==============  ==============  ==========
cacheserver4          4389 server2                         PrimaryDC              175
cacheserver3          4388 server2                         PrimaryDC              174
cacheserver6          4400 server3                         BackupDC               175
cacheserver2          4381 server1                         PrimaryDC              175
cacheserver5          4399 server3                         BackupDC               175
cacheserver1          4380 server1                         PrimaryDC              175

StatusHA is MACHINE-SAFE

Command{machine=server3,rack=,site=BackupDC}: set machine server4
Command{machine=server4,rack=,site=BackupDC}: start 2
Command{machine=server4,rack=,site=BackupDC}: [cacheserver7:4408]    1: 
[cacheserver7:4408]    2: Oracle Coherence Version 3.7.1.3 Build 31790
[cacheserver7:4408]    3:  Grid Edition: Development mode
[cacheserver7:4408]    4: Copyright (c) 2000, 2012, Oracle and/or its affiliates. All rights reserved.
[cacheserver7:4408]    5: 
[cacheserver7:4408]    6: 
[cacheserver7:4408]    7: Using the Incubator Extensible Environment for Coherence Cache Configuration
[cacheserver7:4408]    8: Copyright (c) 2011, Oracle Corporation. All Rights Reserved.
[cacheserver7:4408]    9: 
[cacheserver8:4409]    1: 
[cacheserver8:4409]    2: Oracle Coherence Version 3.7.1.3 Build 31790
[cacheserver8:4409]    3:  Grid Edition: Development mode
[cacheserver8:4409]    4: Copyright (c) 2000, 2012, Oracle and/or its affiliates. All rights reserved.
[cacheserver8:4409]    5: 
[cacheserver8:4409]    6: 
[cacheserver8:4409]    7: Using the Incubator Extensible Environment for Coherence Cache Configuration
[cacheserver8:4409]    8: Copyright (c) 2011, Oracle Corporation. All Rights Reserved.
[cacheserver8:4409]    9: 

Please enter a command.
Command{machine=server4,rack=,site=BackupDC}: show
Partition Count: 1049, Unowned: 0
Name            PID        Machine         Rack            Site            Partitions
=============   ========== ==============  ==============  ==============  ==========
cacheserver4          4389 server2                         PrimaryDC              150
cacheserver3          4388 server2                         PrimaryDC              150
cacheserver8          4409 server4                         BackupDC              null
cacheserver6          4400 server3                         BackupDC               150
cacheserver2          4381 server1                         PrimaryDC              150
cacheserver5          4399 server3                         BackupDC               149
cacheserver1          4380 server1                         PrimaryDC              150
cacheserver7          4408 server4                         BackupDC               150

StatusHA is MACHINE-SAFE

Not quite site-safe yet because partitions not al transferred and balanced.



Command{machine=server4,rack=,site=BackupDC}: show
Partition Count: 1049, Unowned: 0
Name            PID        Machine         Rack            Site            Partitions
=============   ========== ==============  ==============  ==============  ==========
cacheserver4          4389 server2                         PrimaryDC              131
cacheserver3          4388 server2                         PrimaryDC              131
cacheserver8          4409 server4                         BackupDC               131
cacheserver6          4400 server3                         BackupDC               131
cacheserver2          4381 server1                         PrimaryDC              131
cacheserver5          4399 server3                         BackupDC               132
cacheserver1          4380 server1                         PrimaryDC              131
cacheserver7          4408 server4                         BackupDC               131

StatusHA is SITE-SAFE

Now we have a site-safe configuration using the Simple Partition Assignment Strategy!

Closing thoughts
This is a great new feature which has seemed to slip in without too much fanfare. Definitely something that many people have been asking for.

As mentioned early on, running a Coherence cluster across multiple geographically dispersed sites is possible but care should be taken when doing this. Speeds of 10Gb and extremely low latencies are a must, you must also ensure that you test the link using a tool such as the datagram test as well as assessing the impact of cluster traffic on your other cross-site traffic. Other factors, outside the scope of this discussion, should also be considered.

One of my colleagues has also posted about this new partitioning strategy, and has some good advice down the bottom of his post. Worth a read too!

Source Code
You can download the source code for this small example at https://blogs.oracle.com/felcey/resource/PartitionExample.zip. You will also need to download Coherence from OTN and the Coherence Commons package from the Incubator site.
The following Java classes are part of this:

  • RunPartitionTest.java – main class with command line utility
  • ClusterBuilderHelper.java – helper class to wrap some of the incubator classes
  • PartitionHelper.java – helper class to determine statusHA without using JMX
Advertisements
This entry was posted in Examples and tagged , . Bookmark the permalink.

6 Responses to Making your cluster site or rack safe with Coherence 3.7.1

  1. Slava says:

    Hi Tim,

    Do you happen to know internal details of what happens on the joining node/s when one or more of the nodes in the
    coherence cluster is/are restarted while cluster is populated?
    We are getting OOM error in jboss server when we are joining the cluster. jboss has local storage turned off.

    Thanks

  2. Hi,
    When a cache server is shutdown and it holds data, the primary and backups that the cache server owned will be distributed amongst the remaining storage-enabled cache servers. If it is a graceful shutdown, this will be done before the cache server shuts down. If its not graceful, e.g. failure, the data will be recovered by the normal recovery process.

    In terms of OOM, that can happen when there are not enough cache servers left to hold the data. e.g. you had 20 cache servers and then only 5 are left. If you don’t have size limitations you can get OOM.
    but if you are getting OOM in you storage disabled clients it could be many things and without error messages, config ,etc, difficult to diagnose.
    Probably worth posting a question at https://forums.oracle.com/forums/forum.jspa?forumID=480 or if log an SR with Oracle support.

    Hope that helps.

    Regards

    Tim

  3. Luc Douwen says:

    Hi Tim,
    I see you use WKA configurations.
    We configured our Coherence cluster multicast over our 2 low latency connected datacenters. In the past Ehcache clusters performed well over multicast over our datacenters.
    Do you see advantages using WKA? I guess with 2 datacenters you create different wka’s for each to avoid a single point of failure? Is the wka uptime included in the site-save status?
    Thanks

    • Hi Luc

      I just use WKA in my examples so as not to inadvertently join other clusters people are using.

      If you are able to use multicast on your network, I would use it if you can as there are some operations that perform better with multicast. E.g. when multicast is enabled and a message needs to be sent to > 25% of the cluster, it will be sent via multicast.

      There is a great article from Jon Purdy here that explains the reasons why multicast is preferred option.

      Having said that,WKA will work as well, but you need to be aware of the some of the operations mentioned by Jon, especially in large clusters.
      By setting WKA, you effectively disabled multicast communications in a cluster altogether.

      Regards

      Tim

  4. Michel Herrera says:

    Hi Tim

    It would be very thankfully if you could provide the download link for the source code
    of the following classes:

    RunPartitionTest.java
    ClusterBuilderHelper.java
    PartitionHelper.java

    Regards

    Michel Herrera
    michelherrerasanchez@gmail.com

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s