Sharding:
In my career as a Java Developer, I have encountered cases where data of a particular type is split across multiple database instances having the same structure but segregated by some logical criteria. For example, if we are in a health organization, it is possible that member (you and me, patients) data for one health plan A is in one database while member data for health plan B is in another database.
There could be many reasons for the above. It is possible that there is just too much data for a single database, possible that health plan A just does not want their data mixed with health plan B or maybe the network latency for Health Plan B to store the data in the same database might be high.
Sharding can be thought of segregating and partitioning the data across multiple databases driven by functional or non-functional forces.
When a data consumer is talking to an architecture where the data is sharded, one typically experiences one of the following cases:
The Requirements: We are of to the world cup of soccer. Every country playing in the game maintains a database of its players. The application we would like to develop would like to query across these different databases. We need to develop an application for FIFA that will allow each country to create and maintain players and obtain information about other players from other countries as well. This database instance is provided for each country but the country does not have direct access to be able to insert data or query data against it but instead must use the FIFA application for all operations.
The Example: Denoting a Player in the space of the application, we define a POJO called NationalPlayer. The NationalPlayer has some interesting attributes,
For example, the hibernate configuration for the Indian database is as follows:
To test our example, we have some unit tests. The tests will ensure the following are working:
The Parting: A requirement to using Hibernate Shards is Java 1.5 or higher. I am not quite sure why this requirement exists as one does not necessarily need to use annotations or java 5 features for hibernate configuration. A todo for me. Session Factory configuration via JPA is apparently not supported as yet. Different ID generation strategies can be used. One interesting problem in sharding is sorting of the results. Hibernate Shards works around the same by insisting that all objects returned by a Criteria query with Order By clause implement the Comparable interface. Sorting will only occur after obtaining the result set from each member of the shard within Hibernate Shards. From the documentation, it appears that "Distinct" clauses are not supported as yet. HQL support is not well supported as yet. For example, I wanted to execute a "delete from NationalPlayer" across the shard, i.e., clear the table on every database. I could not execute the same with an unsupported exception being the result. The documentation recommends staying away if possible. Cross Sharding is not yet supported. What this means is that if Object A is on shard 1 and object B is on shard 2, one cannot create an associated between them. Have not encountered a need for this yet. Links,
Hibernate Shards Documentation
Nice Read on Sharding
Now only if I can get my Jersey example working...
In my career as a Java Developer, I have encountered cases where data of a particular type is split across multiple database instances having the same structure but segregated by some logical criteria. For example, if we are in a health organization, it is possible that member (you and me, patients) data for one health plan A is in one database while member data for health plan B is in another database.
There could be many reasons for the above. It is possible that there is just too much data for a single database, possible that health plan A just does not want their data mixed with health plan B or maybe the network latency for Health Plan B to store the data in the same database might be high.
Sharding can be thought of segregating and partitioning the data across multiple databases driven by functional or non-functional forces.
When a data consumer is talking to an architecture where the data is sharded, one typically experiences one of the following cases:
- The application at any time requires interaction with a single shard only.For example, in a health plan application that is supporting Health Plan A and Health Plan B, a query to find members will always be restricted to a particular plan. In such a case a simple routing logic can assist in directing the query to the particular database. I did have a similar challenge in a previous engagement, my favorite framework Spring and the Routing DataSource helped solve the problem. The Blog by Mr.Mark Fisher was the direction that I followed.
- The application needs to interact with data that is part of both databases. For example, to obtain data about members stored in Health Plan A or Health Plan B that match a particular criteria. This requirement is more complicated as one needs to obtain result sets from both databases.
The Requirements: We are of to the world cup of soccer. Every country playing in the game maintains a database of its players. The application we would like to develop would like to query across these different databases. We need to develop an application for FIFA that will allow each country to create and maintain players and obtain information about other players from other countries as well. This database instance is provided for each country but the country does not have direct access to be able to insert data or query data against it but instead must use the FIFA application for all operations.
The Example: Denoting a Player in the space of the application, we define a POJO called NationalPlayer. The NationalPlayer has some interesting attributes,
- First and Last Name of the Player
- Maximum Career Goals scored by the player
- His individual ranking in the world as a player
- The country he plays for
@Entity @Table (name="NATIONAL_PLAYER") public class NationalPlayer { @Id @GeneratedValue(generator="PlayerIdGenerator") @GenericGenerator(name="PlayerIdGenerator", strategy="org.hibernate.shards.id.ShardedUUIDGenerator") @Column(name="PLAYER_ID") private BigInteger id; @Column (name="FIRST_NAME") private String firstName; @Column (name="LAST_NAME") private String lastName; @Column (name="CAREER_GOALS") private int careerGoals; @Column (name="WORLD_RANKING") private int worldRanking; public Country getCountry() { return country; } .... public NationalPlayer withCountry(Country country) { this.country = country; return this; } ...... @Column(name= "COUNTRY", columnDefinition="integer", nullable = true) @Type( type = "com.welflex.hibernate.GenericEnumUserType", parameters = { @Parameter( name = "enumClass", value = "com.welflex.model.Country"), @Parameter( name = "identifierMethod", value = "toInt"), @Parameter( name = "valueOfMethod", value = "fromInt") } ) private Country country; ... .... }We define a Simple Java 5 enum type denoting the different countries that are participating in the World Cup. Sadly, we have only 3, India, Usa and Italy for the 2020 world cup.
public enum Country { INDIA (0), USA(1), ITALY(2); private int code; Country(int code) { this.code = code; } public int toInt() { return code; } ... }The application uses three sharded databases for the three countries participating in the database and each database is defined using hibernate configuration file, hibernate0.cfg.xml for India, hibernate1.cfg.xml for USA and hibernate2.cfg.xml for Italy. To keep the example easy, we have decided that Country code's defined in the Enum map to the hibernate configs. For example, India is Country 0 as denoted by the enum and it maps to shard database 0. One could maintain a map for the same if required.
For example, the hibernate configuration for the Indian database is as follows:
<hibernate-configuration> <session-factory name="HibernateSessionFactory0"> <property name="dialect">org.hibernate.dialect.HSQLDialect</property> <property name="connection.driver_class">org.hsqldb.jdbcDriver</property> <property name="connection.url">jdbc:hsqldb:mem:shard0</property> <property name="connection.username">sa</property> <property name="connection.password"></property> <property name="hibernate.hbm2ddl.auto">update</property> <property name="hibernate.connection.shard_id">0</property> <property name="hibernate.shard.enable_cross_shard_relationship_checks"> true </property> <property name="hibernate.show_sql">true</property> <property name="hibernate.format_sql">true</property> <property name="hibernate.jdbc.batch_size">20</property> </session-factory> </hibernate-configuration>In order to route data to the appropriate database for storage, we define a custom shard selection strategy that uses the country code of the NationalPlayer object to route persistence to a particular database as shown below:
public class ShardSelectionStrategy extends RoundRobinShardSelectionStrategy { public ShardSelectionStrategy(RoundRobinShardLoadBalancer loadBalancer) { super(loadBalancer); } @Override public ShardId selectShardIdForNewObject(Object obj) { if (obj instanceof NationalPlayer) { ShardId id = new ShardId(((NationalPlayer) obj).getCountry().toInt()); return id; } return super.selectShardIdForNewObject(obj); } }To easily work with Hibernate, we have a HibernateUtil class that factories shard sessions and also Sessions to individual databases. How Shards are selected/delegated to can be customized by providing an implementation of the interface ShardStrageyFactory. In the example, we have only chosen to customize the selection strategy.
To test our example, we have some unit tests. The tests will ensure the following are working:
- When Players are persisted, they are stored only in the appropriate database instance. In order to ensure the same, a direct connection to the target sharded database is obtained to ensure the existence of the persisted player. In addition, direct connections to the other databases in the shards are obtained to ensure that the player in question has not been inserted there.
- Queries executed against the shard will obtain data from all the sharded databases accurately. The tests will ensure that all databases in the shard are being accessed.
NationalPlayer indiaPlayer = new NationalPlayer().withCountry(Country.INDIA) .withCareerGoals(100).withFirstName("Sanjay").withLastName("Acharya").withWorldRanking(1); NationalPlayer usaPlayer = new NationalPlayer().withCountry(Country.USA) .withCareerGoals(80).withFirstName("Blake").withLastName("Acharya").withWorldRanking(10); NationalPlayer italyPlayer = new NationalPlayer().withCountry(Country.ITALY) .withCareerGoals(20).withFirstName("Valentino").withLastName("Acharya").withWorldRanking(32);The Persistence test is the following:
@Test public void testShardingPersistence() { BigInteger indiaPlayerId = null; BigInteger usaPlayerId = null; BigInteger italyPlayerId = null; // Save all three players savePlayer(indiaPlayer); savePlayer(usaPlayer); savePlayer(italyPlayer); indiaPlayerId = indiaPlayer.getId(); System.out.println("Indian Player Id:" + indiaPlayerId); usaPlayerId = usaPlayer.getId(); System.out.println("Usa Player Id:" + usaPlayerId); italyPlayerId = italyPlayer.getId(); System.out.println("Italy Player Id:" + italyPlayerId); assertNotNull("Indian Player must have been persisted", getShardPlayer(indiaPlayerId)); assertNotNull("Usa Player must have been persisted", getShardPlayer(usaPlayerId)); assertNotNull("Italy Player must have been persisted", getShardPlayer(italyPlayerId)); // Ensure that the appropriate shards contain the players assertExistsOnlyOnShard("Indian Player should have existed on only shard 0", 0, indiaPlayerId); assertExistsOnlyOnShard("Usa Player should have existed only on shard 1", 1, usaPlayerId); assertExistsOnlyOnShard("Italian Player should have existed only on shard 2", 2, italyPlayerId); }The Simple Criteria based tests are the following:
@Test public void testSimpleCriteira() throws Exception { Session session = HibernateUtil.getSession(); Transaction tx = session.beginTransaction(); try { Criteria c = session.createCriteria(NationalPlayer.class) .add(Restrictions.eq("country", Country.INDIA)); List<Nationalplayer> players = c.list(); assertTrue("Should only return the sole Indian Player", players.size() == 1); assertContainsPlayers(players, indiaPlayer); c = session.createCriteria(NationalPlayer.class).add(Restrictions.gt("careerGoals", 50)); players = c.list(); assertEquals("Should return the usa and india players", 2, players.size()); assertContainsPlayers(players, indiaPlayer, usaPlayer); c = session.createCriteria(NationalPlayer.class) .add(Restrictions.between("worldRanking", 5, 15)); players = c.list(); assertEquals("Should only have the usa player", 1, players.size()); assertContainsPlayers(players, usaPlayer); c = session.createCriteria(NationalPlayer.class) .add(Restrictions.eq("lastName", "Acharya")); players = c.list(); assertEquals("All Players should be found as they have same last name", 3, players.size()); assertContainsPlayers(players, indiaPlayer, usaPlayer, italyPlayer); tx.commit(); } catch (Exception e) { tx.rollback(); throw e; } finally { if (session != null) { session.close(); } } }Running The Example: The example is a Maven 2 project. JDK 1.6.X was used to develop/test. Database of choice for the example is HSQL, why would anyone select Oracle or anything other database :-). One can obtain the project from HERE. Simply run "mvn test" to see the tests being run and/or import the code into Eclipse using Q4E or your favorite maven eclipse plugin. As a note, the examples are simply examples.
The Parting: A requirement to using Hibernate Shards is Java 1.5 or higher. I am not quite sure why this requirement exists as one does not necessarily need to use annotations or java 5 features for hibernate configuration. A todo for me. Session Factory configuration via JPA is apparently not supported as yet. Different ID generation strategies can be used. One interesting problem in sharding is sorting of the results. Hibernate Shards works around the same by insisting that all objects returned by a Criteria query with Order By clause implement the Comparable interface. Sorting will only occur after obtaining the result set from each member of the shard within Hibernate Shards. From the documentation, it appears that "Distinct" clauses are not supported as yet. HQL support is not well supported as yet. For example, I wanted to execute a "delete from NationalPlayer" across the shard, i.e., clear the table on every database. I could not execute the same with an unsupported exception being the result. The documentation recommends staying away if possible. Cross Sharding is not yet supported. What this means is that if Object A is on shard 1 and object B is on shard 2, one cannot create an associated between them. Have not encountered a need for this yet. Links,
Hibernate Shards Documentation
Nice Read on Sharding
Now only if I can get my Jersey example working...