Moving Microservices from Mesos DCOS to Kubernetes

December 22, 2020

Product

Moving Microservices from Mesos DCOS to Kubernetes

We decided to redesign orchestration and find an alternative to Apache Mesos. Docker Swarm and Kubernetes are the leading and highly used container orchestration tools and it is used for DevOps.

For a few microservices, data transmission and stability were causing issues specifically for cases when a user tries to fetch a larger size of data for a long duration. Devices were able to push data to the database, but data load and display were causing data loss or services failure. Because of higher I/O for Microservices, higher uses of CPU & memories were enabling Load Balancer, and ultimately causing higher billing.

We decided to redesign orchestration and find an alternative to Apache Mesos. Docker Swarm and Kubernetes are the leading and highly used container orchestration tools and it is used for DevOps infrastructure management tools.

Before we explore Docker Swarm and Kubernetes we define how we are using Mesos.

Apache Mesos

It provides the ability to run both containerized and non-containerized services in a distributed manner. Mesos designed with distributed kernels so API programming can be designed directly against the datacenter. In our case, MESOS DCOS configured as master/slave is based on database requests and were managed. On service failure, Mesos master never restart services automatically which increases application downtime.

Challenges with Mesos

The existing infrastructure had frequent service failures that caused unavailability of infra for end-users, data loss, and higher AWS billing.

Existing Infrastructure and Orchestration

Cloud: AWS
CI/CD: Jenkins
Programming Language: Python, JAVA, C, C++, etc.
Source Code: Github
Deployment strategy: Automation + Manual
Infrastructure Monitoring: Automation + Manual (Execution of Validation steps on Regular interval)

Current Strategy and Tools

EC2 Auto Scaling Groups
Scaling based on CPU usage
DCOS Microservices on EC2
Notification on Slack and Via call/Email
Other Tools/services: Splunk, Looker, HA Proxy, S3, Graphite, Grafana

Challenges

CPU usage fluctuates based on customer and product usage
Frequent failure of services even after auto scaling
Frequent Downtime
Frequent patches
End Customer concerned about data loss because of Stability and availability
High AWS Billing due to multiple EC2 Instances

Docker Swarm

Docker swarm uses the Docker API and networking concept so we can configure and use it easily. Its architecture can manage failure, strongly. In Docker swarm, new nodes can join existing clusters as worker or master. Docker Swarm doesn’t allow for integration of third party logging tools. Easy integration of Docker Swarm on different cloud service providers such as AWS, Azure, and Google Cloud are not available compared to Kubernetes.

Kubernetes

Kubernetes is easy to configure and is light in size. In case of service failure, Kubernetes perform autoscaling and keep service available. Kubernetes is versatile and widely used. Major Cloud services provide custom master support for Kubernetes.

Also read: A Step by Step guide on EKS (Elastic Kubernetes Services) Deployment

As AWS provides a platform for Kubernetes Master, we decided to go with EKS.

The Amazon EKS pricing model asks users to bear additional costs of $0.20/ hour for each EKS cluster. This made us think, but when we compare benefits, it shouldn’t be as bad as it sounds. As a user, we designed and deployed multiple applications with different namespaces and VPC ranges on a single cluster.

We initiated the process for one cluster, migrated one service, and validated stability on Docker Swarm and Amazon EKS. The other infrastructure was already on AWS and we found that the Docker Swarm configuration would be time consuming and would require many efforts to monitor and manage.

With EKS, we received support and guidance from Amazon to design and deploy services along with how we can reduce costs hence we decided to go with EKS.

Migrating to Kubernetes from Mesos

For environment creation, mapping, and deployment on EKS we used CloudFormation (YAML) templates.

CloudFormation

AWS CloudFormation provides a customized graphical and YAML based interface to create, manage, and modify a larger number of AWS resources, in addition to mapping their dependencies. As CloudFormation is a service from AWS, any new service will be available to use.

Options such as Terraform which is open source and supporting major cloud platforms to set up infra as Code are available, but we used CloudFormation as we have everything on AWS.

How EKS Helped

AWS billing can be reduced by using EKS
Less number of EC2 Instances
Auto scaling using EKS
EKS monitoring services and alerts services

New Infrastructure

Reduced EC2 Instance from 15 Medium to 3 Large
Removed Graphite
Autoscaling using EKS
Reduced Datadog and Pager duty Alert configuration costing and complexity
Prometheus + Grafana based Alert configuration

DATADOG

We configured Datadog with an extension of CloudWatch for monitoring EC2 instances and connected AWS services. We installed the Datadog Agent on instances enabled to collect system-level metrics at 15-seconds for memory, CPU, storage, disk I/O, network, etc.

For additional alert and monitoring of the Kubernetes cluster, we configured Prometheus + Grafana.

Prometheus helps with capturing and retaining data of POD, container, systemd services, etc. We can use these data to analyze the stability and behavior of the application and environment.

GRAFANA uses data stored by Prometheus and gives graphical presentations of statistics and alert configuration for easy assessment.

Post Migration Best Practices

Maintain MTTR (Mean time to Respond/Resolve)
List down Critical conditions and Report
Immediate actions
Incident Reporting
Root cause analysis
Continuous improvement in Define Processes

Strategy to Achieve

Manually

Perform validation steps on Regular interval
Debug when unexpected behavior observed
Follow define Steps of Runbook
Call or Email Dev Support Team if not resolved in stipulated time
Restart services if needed after taking logs of Existing failure

Automation Utilities

Continuous execution of define validation tools using Jenkins + Selenium/Dynatrace
Enhancing validation steps coverage of Python scripts
Notification on Slack channel
Pagerduty

Actions

Email if not resolved within 15 min
Escalate to Level 4 if not resolved within one hour
Escalate to Level 5 if not resolved
Get Environment up and Running

Best Practices

Observe the environment for a few hours
Create a root cause analysis document
Get Approval of identified root cause analysis from the Dev team
Gather resolution information from Dev Team
Gather immediate actions if the same RCA observed in the future to minimize Downtime
Update runbook for future reference

Benefits and Applications

AWS billing reduced by ~40 % in our case as EC2 count reduced to 3 from 15
Auto service restart based on scaling configuration helped in the availability of application
Data loss and end customer escalation reduced
More advanced way of Monitoring which helped DevOps Engineer to identify root cause quickly

Conclusion

When we talk about the conclusion regarding our case, we found EKS was more helpful, as we found more stability of our application after changes into orchestration. With EKS, we observed service stability, auto scaling, and load balancing - which helped us retain product availability. It is also true that both Kubernetes and Mesos provide facilities for application deployment as containers on the cloud. Based on different application needs, solutions may vary.

About the Author

Maulik N Shah is working as Technical Leader at eInfochips, An Arrow Company. He is a Certified Kubernetes Administrator, AWS Certified Solutions Architect - Associate and Certified Professional Scrum Master. He enjoys listening to music, referring technical contents, and learning new tools

Embedded Computing Design

Moving Microservices from Mesos DCOS to Kubernetes

We decided to redesign orchestration and find an alternative to Apache Mesos. Docker Swarm and Kubernetes are the leading and highly used container orchestration tools and it is used for DevOps.