Moving Microservices from Mesos DCOS to Kubernetes
December 22, 2020
We decided to redesign orchestration and find an alternative to Apache Mesos. Docker Swarm and Kubernetes are the leading and highly used container orchestration tools and it is used for DevOps.
For a few microservices, data transmission and stability were causing issues specifically for cases when a user tries to fetch a larger size of data for a long duration. Devices were able to push data to the database, but data load and display were causing data loss or services failure. Because of higher I/O for Microservices, higher uses of CPU & memories were enabling Load Balancer, and ultimately causing higher billing.
We decided to redesign orchestration and find an alternative to Apache Mesos. Docker Swarm and Kubernetes are the leading and highly used container orchestration tools and it is used for DevOps infrastructure management tools.
Before we explore Docker Swarm and Kubernetes we define how we are using Mesos.
It provides the ability to run both containerized and non-containerized services in a distributed manner. Mesos designed with distributed kernels so API programming can be designed directly against the datacenter. In our case, MESOS DCOS configured as master/slave is based on database requests and were managed. On service failure, Mesos master never restart services automatically which increases application downtime.
Challenges with Mesos
The existing infrastructure had frequent service failures that caused unavailability of infra for end-users, data loss, and higher AWS billing.
Existing Infrastructure and Orchestration
- Cloud: AWS
- CI/CD: Jenkins
- Programming Language: Python, JAVA, C, C++, etc.
- Source Code: Github
- Deployment strategy: Automation + Manual
- Infrastructure Monitoring: Automation + Manual (Execution of Validation steps on Regular interval)
Current Strategy and Tools
- EC2 Auto Scaling Groups
- Scaling based on CPU usage
- DCOS Microservices on EC2
- Notification on Slack and Via call/Email
- Other Tools/services: Splunk, Looker, HA Proxy, S3, Graphite, Grafana
- CPU usage fluctuates based on customer and product usage
- Frequent failure of services even after auto scaling
- Frequent Downtime
- Frequent patches
- End Customer concerned about data loss because of Stability and availability
- High AWS Billing due to multiple EC2 Instances
Docker swarm uses the Docker API and networking concept so we can configure and use it easily. Its architecture can manage failure, strongly. In Docker swarm, new nodes can join existing clusters as worker or master. Docker Swarm doesn’t allow for integration of third party logging tools. Easy integration of Docker Swarm on different cloud service providers such as AWS, Azure, and Google Cloud are not available compared to Kubernetes.
Kubernetes is easy to configure and is light in size. In case of service failure, Kubernetes perform autoscaling and keep service available. Kubernetes is versatile and widely used. Major Cloud services provide custom master support for Kubernetes.
As AWS provides a platform for Kubernetes Master, we decided to go with EKS.
The Amazon EKS pricing model asks users to bear additional costs of $0.20/ hour for each EKS cluster. This made us think, but when we compare benefits, it shouldn’t be as bad as it sounds. As a user, we designed and deployed multiple applications with different namespaces and VPC ranges on a single cluster.
We initiated the process for one cluster, migrated one service, and validated stability on Docker Swarm and Amazon EKS. The other infrastructure was already on AWS and we found that the Docker Swarm configuration would be time consuming and would require many efforts to monitor and manage.
With EKS, we received support and guidance from Amazon to design and deploy services along with how we can reduce costs hence we decided to go with EKS.
Migrating to Kubernetes from Mesos
For environment creation, mapping, and deployment on EKS we used CloudFormation (YAML) templates.
AWS CloudFormation provides a customized graphical and YAML based interface to create, manage, and modify a larger number of AWS resources, in addition to mapping their dependencies. As CloudFormation is a service from AWS, any new service will be available to use.
Options such as Terraform which is open source and supporting major cloud platforms to set up infra as Code are available, but we used CloudFormation as we have everything on AWS.
How EKS Helped
- AWS billing can be reduced by using EKS
- Less number of EC2 Instances
- Auto scaling using EKS
- EKS monitoring services and alerts services
- Reduced EC2 Instance from 15 Medium to 3 Large
- Removed Graphite
- Autoscaling using EKS
- Reduced Datadog and Pager duty Alert configuration costing and complexity
- Prometheus + Grafana based Alert configuration
We configured Datadog with an extension of CloudWatch for monitoring EC2 instances and connected AWS services. We installed the Datadog Agent on instances enabled to collect system-level metrics at 15-seconds for memory, CPU, storage, disk I/O, network, etc.
For additional alert and monitoring of the Kubernetes cluster, we configured Prometheus + Grafana.
Prometheus helps with capturing and retaining data of POD, container, systemd services, etc. We can use these data to analyze the stability and behavior of the application and environment.
GRAFANA uses data stored by Prometheus and gives graphical presentations of statistics and alert configuration for easy assessment.
Post Migration Best Practices
- Maintain MTTR (Mean time to Respond/Resolve)
- List down Critical conditions and Report
- Immediate actions
- Incident Reporting
- Root cause analysis
- Continuous improvement in Define Processes
Strategy to Achieve
- Perform validation steps on Regular interval
- Debug when unexpected behavior observed
- Follow define Steps of Runbook
- Call or Email Dev Support Team if not resolved in stipulated time
- Restart services if needed after taking logs of Existing failure
- Continuous execution of define validation tools using Jenkins + Selenium/Dynatrace
- Enhancing validation steps coverage of Python scripts
- Notification on Slack channel
- Email if not resolved within 15 min
- Escalate to Level 4 if not resolved within one hour
- Escalate to Level 5 if not resolved
- Get Environment up and Running
- Observe the environment for a few hours
- Create a root cause analysis document
- Get Approval of identified root cause analysis from the Dev team
- Gather resolution information from Dev Team
- Gather immediate actions if the same RCA observed in the future to minimize Downtime
- Update runbook for future reference
Benefits and Applications
- AWS billing reduced by ~40 % in our case as EC2 count reduced to 3 from 15
- Auto service restart based on scaling configuration helped in the availability of application
- Data loss and end customer escalation reduced
- More advanced way of Monitoring which helped DevOps Engineer to identify root cause quickly
When we talk about the conclusion regarding our case, we found EKS was more helpful, as we found more stability of our application after changes into orchestration. With EKS, we observed service stability, auto scaling, and load balancing - which helped us retain product availability. It is also true that both Kubernetes and Mesos provide facilities for application deployment as containers on the cloud. Based on different application needs, solutions may vary.
About the Author
Maulik N Shah is working as Technical Leader at eInfochips, An Arrow Company. He is a Certified Kubernetes Administrator, AWS Certified Solutions Architect - Associate and Certified Professional Scrum Master. He enjoys listening to music, referring technical contents, and learning new tools