Introduction:
In any distributed architecture, failure of services are inevitable. These could be due to network related issues, bugs in code, unexpected loads, failure of dependencies, what have you? When failure of a single service in a network starts affecting the availability of other services, then that service could become a single point of failure for the entire dependency graph associated with that service. Hystrix is a library developed and subsequently outsourced by Netflix that is geared toward making a distributed architecture resilient while addressing some of the above concerns.
Hystrix function as described by the creators "Hystrix is a library designed to control the interactions between these distributed services providing greater latency and fault tolerance. Hystrix does this by isolating points of access between the services, stopping cascading failures across them, and providing fallback options, all of which improve the system's overall resiliency."
I could talk more about Hystrix but reading their excellent WIKI is definitely the more DRY route. My goal in this BLOG is simply to have fun while providing you an example of Hystrix usage to play around with :-)
Fictional Case Study
Developers of company Acme had developed a GreetingService and had also provided a client for consuming the service. The client developed looked like:
This GreetingService was a core service that was utilized by all their web applications and in some cases consumed by other services as well. The GreetingService itself was a beast that which on load tests showed requests could take upto 1 second to respond at times (probably full GC's kicking in). Accordingly the developers configured the read-timeout for the connections to be 1 second so that any request taking longer than 1 second would time out. Upon deployment all looked good for sometime but then they started encountering problems where the service started degrading, i.e., it was not the occasional request that took a second to respond but all requests in a particular time window that exhibited latency or failure. This resulted in the application container threads all blocking leading to a denial of service.
Bosses were mad, stake holders took to narcotics, stock values plummeted.
A postmortem of the incident was held. Cause of the problem was not singular but multiple things that happened, a perfect storm if you may, a bug was introduced that increased response times, their load balancer ran into unexpected issues, a data center technician spilled his drink on some servers of the cluster..phew. Results of the postmortem were:
The following is the enhancement they made to provide a default greeting for the greet() call in the event of service degradation:
The GetGreetingCommand is a HystrixCommand which will invoke the fallback greeting if Hystrix detected service degration. For the greet(name, languageCode) operation, as the goal was to invoke the cloud deployed Greeting Service in the event of service degradation of their in-house version, the developers provided a fallback client that could be invoked as shown below:
Logs looked like:
public class GreetingJerseyClient implements GreetingClient { private final String baseUri; private final Client client; public GreetingJerseyClient(String baseUri) { this.baseUri = baseUri; client = ClientFactory.newClient(); } /** * @return a Greeting in the specific language, for example "Hello" or "Hola" */ @Override public String getGreeting(String languageCode) { .... } /** * @return a Greeting for the individual in the specfied language, for example "Hello Sanjay", "Hola Sanjay". * @throws InvalidNameException if null was provided as the language code. */ @Override public String greet(String name, String languageCode) throws InvalidNameException { ... } }
This GreetingService was a core service that was utilized by all their web applications and in some cases consumed by other services as well. The GreetingService itself was a beast that which on load tests showed requests could take upto 1 second to respond at times (probably full GC's kicking in). Accordingly the developers configured the read-timeout for the connections to be 1 second so that any request taking longer than 1 second would time out. Upon deployment all looked good for sometime but then they started encountering problems where the service started degrading, i.e., it was not the occasional request that took a second to respond but all requests in a particular time window that exhibited latency or failure. This resulted in the application container threads all blocking leading to a denial of service.
Bosses were mad, stake holders took to narcotics, stock values plummeted.
A postmortem of the incident was held. Cause of the problem was not singular but multiple things that happened, a perfect storm if you may, a bug was introduced that increased response times, their load balancer ran into unexpected issues, a data center technician spilled his drink on some servers of the cluster..phew. Results of the postmortem were:
- To provide a default greeting for the getGreeting() call in the event the Greeting service were experiencing degradation.
- Deploy the Greeting Service in a cloud provider and pay them only for usage ($$$$$) in the event of failure of their existent measly data center. Only the greet(name, languageCode) call would access that.
- Provide a transparent means of switching to the cloud service in the event of degradation of the in house Greeting Service and switch back when the in home Greeting Service is responsive again.
- Be able to track and monitor service degradation so the network operations folk could react quickly.
The following is the enhancement they made to provide a default greeting for the greet() call in the event of service degradation:
public class HystrixGreetingClient implements GreetingClient { private final GreetingClient client; public HystrixGreetingClient(GreetingClient client) { } @Override public String greet() { return new GetGreetingHystrixCommand().execute(); } // Default Greeting Agreed Upon public static final String DEFAULT_GREETING = "Hello!"; class GetGreetingHystrixCommand extends HystrixCommand<String> { public GetGreetingHystrixCommand() { super(Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey("GreetingClient")); } @Override protected String run() throws Exception { // Attempt to obtain greeting from the service return client.getGreeting(); } @Override protected String getFallback() { // Fallback provides a default greeting return DEFAULT_GREETING; } } }
The GetGreetingCommand is a HystrixCommand which will invoke the fallback greeting if Hystrix detected service degration. For the greet(name, languageCode) operation, as the goal was to invoke the cloud deployed Greeting Service in the event of service degradation of their in-house version, the developers provided a fallback client that could be invoked as shown below:
public class HystrixGreetingClient implements GreetingClient { private final GreetingClient primary; private final GreetingClient fallback; /** * @param primary The primary client which communicates with the in-house data center * @param fallback The fallback client which communicates with the $$$$$ cloud provider */ public HystrixGreetingClient(GreetingClient primary, GreetingClient fallback) { } @Override public String greet(String languageCode) { .... } @Override public String greet(String name, String languageCode) { try { return new GreetHystrixCommand(name, languageCode).execute(); } catch (HystrixBadRequestException e) { if (e.getCause() instanceof InvalidNameException) { throw InvalidNameException.class.cast(e.getCause()); } throw e; } } class GreetHystrixCommand extends HystrixCommand<String> { private final String name; private final String languageCode; public GreetHystrixCommand(String name, String languageCode) { super(Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey("GreetingClient")) .andCommandPropertiesDefaults(HystrixCommandProperties.Setter().withExecutionIsolationThreadTimeoutInMilliseconds(500))); this.name = name; this.languageCode = languageCode; } @Override protected String run() throws Exception { try { // Attempt to invoke the primary service return primary.greet(name, languageCode); } catch (InvalidNameException e) { // Throw a HystrixBadRequestException as this is not a network or service issue throw new HystrixBadRequestException("Invalid Name", e); } } @Override protected String getFallback() { // If primary failed or was short circuited, invoke the secondary return new GreetHystrixFallbackCommand().execute(); } /** * Fallback command designed to talk to the cloud provider service */ class GreetHystrixFallbackCommand extends HystrixCommand<String> { public GreetHystrixFallbackCommand() { super(Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey("GreetingClient")); } /** * There is no fallback for this call. Fail fast. */ @Override protected String run() throws Exception { try { return fallback.greet(name, languageCode); } catch (InvalidNameException e) { // This is a user or programming error, not a network or service problem // so throw a HystrixBadRequestException throw new HystrixBadRequestException("Invalid Name", e); } catch (RuntimeException e) { // Demonstrating that all other exceptions are thrown throw e; } } } } }The call to greet(name, language) if experiencing service degradation will result in the fallback cloud provider equivalent being invoked. The cloud provider call is also wrapped in a HystrixCommand so that it also is subjected to SLA constraints. If the fall back provider cannot fulfill the request, then the there is no fallback for it and failure will occur fast. One thing to note is that if a name provided was invalid, for example null, the service would throw an InvalidNameException. The same should not contribute toward calling a fallback or short circuiting metrics. For this reason, a HystrixBadRequestException is thrown which will not result in the fallback being invoked as well as not contributing toward the failure metrics.
Results of their efforts
Once the new client was released every consumer adopted the same were able to have their applications be resilient to the failure of the dreaded Greeting Service :-)Logs looked like:
INFO - com.welflex.example.client.HystrixGreetingClient$GetGreetingHystrixCommand.getFallback(62) | Fallback default greeting is being provided INFO - com.welflex.example.client.HystrixGreetingClient$GetGreetingHystrixCommand.getFallback(62) | Fallback default greeting is being provided INFO - com.welflex.example.client.HystrixGreetingClient$GetGreetingHystrixCommand.getFallback(62) | Fallback default greeting is being provided .. More of the above INFO - com.welflex.example.client.HystrixGreetingClient$GetGreetingHystrixCommand.getFallback(62) | Fallback default greeting is being provided INFO - com.welflex.example.client.HystrixGreetingClient$GetGreetingHystrixCommand.run(56) | Obtained Greeting from Service INFO - com.welflex.example.client.HystrixGreetingClient$GetGreetingHystrixCommand.getFallback(62) | Fallback default greeting is being provided INFO - com.welflex.example.client.HystrixGreetingClient$GetGreetingHystrixCommand.getFallback(62) | Fallback default greeting is being provided INFO - com.welflex.example.client.HystrixGreetingClient$GetGreetingHystrixCommand.getFallback(62) | Fallback default greeting is being provided INFO - com.welflex.example.client.HystrixGreetingClient$GetGreetingHystrixCommand.run(56) | Obtained Greeting from Service INFO - com.welflex.example.client.HystrixGreetingClient$GetGreetingHystrixCommand.run(56) | Obtained Greeting from Service ....
INFO - com.welflex.example.client.HystrixGreetingClient$GreetHystrixCommand.run(82) | Obtaining greeting from Primary service INFO - com.welflex.example.client.HystrixGreetingClient$GreetHystrixCommand.run(82) | Obtaining greeting from Primary service INFO - com.welflex.example.client.HystrixGreetingClient$GreetHystrixCommand.run(82) | Obtaining greeting from Primary service .... More of the above INFO - com.welflex.example.client.HystrixGreetingClient$GreetHystrixCommand$GreetHystrixFallbackCommand.run(107) | Obtaining greeting from Secondary Service INFO - com.welflex.example.client.HystrixGreetingClient$GreetHystrixCommand$GreetHystrixFallbackCommand.run(107) | Obtaining greeting from Secondary Service INFO - com.welflex.example.client.HystrixGreetingClient$GreetHystrixCommand.run(84) | Obtained greeting from Primary service INFO - com.welflex.example.client.HystrixGreetingClient$GreetHystrixCommand$GreetHystrixFallbackCommand.run(109) | Obtained greeting from Secondary Service INFO - com.welflex.example.client.HystrixGreetingClient$GreetHystrixCommand$GreetHystrixFallbackCommand.run(109) | Obtained greeting from Secondary Service INFO - com.welflex.example.client.HystrixGreetingClient$GreetHystrixCommand.run(82) | Obtaining greeting from Primary service^M INFO - com.welflex.example.client.HystrixGreetingClient$GreetHystrixCommand$GreetHystrixFallbackCommand.run(109) | Obtained greeting from Secondary Service INFO - com.welflex.example.client.HystrixGreetingClient$GreetHystrixCommand$GreetHystrixFallbackCommand.run(109) | Obtained greeting from Secondary Service ERROR - com.netflix.hystrix.HystrixCommand.executeCommand(812) | Error executing HystrixCommand java.lang.RuntimeException: Unable to greet:500 at com.welflex.example.client.GreetingJerseyClient.greet(GreetingJerseyClient.java:57) at com.welflex.example.client.HystrixGreetingClient$GreetHystrixCommand.run(HystrixGreetingClient.java:83) at com.welflex.example.client.HystrixGreetingClient$GreetHystrixCommand.run(HystrixGreetingClient.java:67) at com.netflix.hystrix.HystrixCommand.executeCommand(HystrixCommand.java:764) at com.netflix.hystrix.HystrixCommand.access$1400(HystrixCommand.java:81) at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:706) at com.netflix.hystrix.strategy.concurrency.HystrixContextCallable.call(HystrixContextCallable.java:45) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) INFO - com.welflex.example.client.HystrixGreetingClient$GreetHystrixCommand$GreetHystrixFallbackCommand.run(107) | Obtaining greeting from Secondary Service INFO - com.welflex.example.client.HystrixGreetingClient$GreetHystrixCommand$GreetHystrixFallbackCommand.run(107) | Obtaining greeting from Secondary Service INFO - com.welflex.example.client.HystrixGreetingClient$GreetHystrixCommand$GreetHystrixFallbackCommand.run(107) | Obtaining greeting from Secondary Service ..More of the above INFO - com.welflex.example.client.HystrixGreetingClient$GreetHystrixCommand.run(84) | Obtained greeting from Primary service
The getGreeting() command falls back to the fall back greeting agreed upon when the primary service shows degradation. On the call to greet(), the cloud service is invoked when the primary service shows degradation and then reverts back to the primary service when it health improves.
Monitoring
For monitoring the company set up all web applications using the Greeting service to have the Hystrix Event Stream Servlet so that statistics and metrics could be obtained from each deployment of the application on how the command's were performing. They aggregated these metrics using Turbine and finally had a pretty Hystrix Dashboard at which the Network Operations folk spent hours admiring the animated graph.
An Example
As always, herewith is a maven example of the above GreetingWebService and Hystrix for download. It is a simple Jersey 2.0 application with a test that demonstrates fallback and fail fast. It does not exercise all the configuration that Hystrix provides, that's for you to have fun with. You can tweak the Rest resources to include further delays, tweak the Hystrix commands with different SLAs. Simply run the following command and witness a simple test demonstrating the functionality:
>mvn install
Conclusion
Why would you want to adopt Hystrix? I ask why not? Do you believe your network to be infallible, your software bug free, your availability ready to be questioned? Yes there is a cost, a cost toward the extra code that one needs to wrap around network calls with Hystrix Commands but compared to the cost of your leaders taking to narcotics or more appropriately your site being non-responsive, it might be a small price to pay. Adoption is not easy IMHO. Care must be taken to understand the exact SLA parameters to configure a Hystrix Command as getting it wrong might hurt rather than alleviate a problem. For this reason a gradual adoption is the preferred route.
I have taken the liberty of a creating a HystrixClient that is only partially configurable. I do not believe this to be the right approach and feel that separate client that allows full configuration of the HystrixCommand by a service consumer is the correct approach.
As always, I might have got things wrong in which case, I would love to learn the error of my ways :-) In the end, remember its Hystrix not Hysterics as I often confused it to be. Enjoy!
I have taken the liberty of a creating a HystrixClient that is only partially configurable. I do not believe this to be the right approach and feel that separate client that allows full configuration of the HystrixCommand by a service consumer is the correct approach.
As always, I might have got things wrong in which case, I would love to learn the error of my ways :-) In the end, remember its Hystrix not Hysterics as I often confused it to be. Enjoy!
All images linked with in this BLOG are NOT my own. I am simply referring to them for demonstration purposes. If owners have concerns, I will gladly remove them.
Update
Below video is me in 2015 presenting at Overstock.com on the benefits of using Hystrix for architectural resiliency. Since then Overstock.com has embraced the concept and technology and grown to be a highly resilient web site. My content is inspired by many others on the subject from Slide Share and elsewhere. Sharing this if it might benefit others with permission from Overstock.com.