If you automate a mess, you get an automated mess Hadrian B.S
Story of Happy5
Happy5 is a tiny (small is too big) and lean team of 8 engineers, entirely based out of our Jakarta Office. I am part of the DevOps and Infrastructure, the one and only. My goal ( I can’t say our goal, when the team it self is just me) is to architect our infrastructure for robustness and more importantly, to create a set of best practices for engineers which leads to increased developer productivity.
Happy5 primarily came into existence in early of 2013, people come and go by the time, and the standards evolving time by time. Our first system build on the top of Parse, called One on One, I don’t know where the idea come from, why it’s named One on One. There was a surprise in the system, Parse acquired by Facebook and the project is deprecated now, luckily we already moved to Ruby on Rails at the time.
As they grow this business, in the middle of 2017 I recruited to fullfill the SysOps position, the old one was moving out to the bigger company. There are so many things to be maintained, since my first day at Happy5. We don’t have a good standard. I would like to highlight some important points about having a DevOps culture, and give DevOps a formal definition. Hopefully, this can be applicable to other companies as well which are roughly the size of Happy5 in terms of scale or engineering resources.
How it should be
Basically DevOps come from a concept of Sysadmin that can code, infrastructure technology is evolving so fast, everything is running in the top of code, even routing a network. So, based on my experience,
DevOps includes —
- Ability to handle and debug any service level issue.
- Architecting services, in line with standards, best practices for reliability and security, scaling and cost optimisations.
- Automation for the infrastructure, from scratch to service (CI/CD tooling).
- Automation for developer productivity.
- Maintaining infrastructure components such as load balancing, DNS, databases, queues etc.
- As B2B Company, we also dealing with ISO standards and Operational standards, especially in security and data privacy.
I get asked the question often about what am I looking for in a good DevOps engineer as I interview candidates.
A good DevOps engineer must have the following skills –
- Good problem solving capabilities with a knack for not giving up on tough problems.
- Good understanding of the underlying operating system, network and dependencies.
- Interest and love for reading man pages, documentations for open sources systems which are ubiquitously used everywhere.
These ideas stem from the SRE (Site Reliability Engineer) role that is popular in larger companies. SRE roles tend to be more embedded with the services as each core service in a large company usually has a dedicated SRE. Given that we have 25–30 microservices, having an SRE for each is not possible. But having a team that can understand common patterns and look at the architecture from an efficiency, performance and cost standpoint is really valuable.
In my opinion, an operations team that is disconnected from the developers is a bad idea. Teams that only do deploys, handle machine level issues and have no idea what the services running on the machine are doing is not doing what it is capable of.
One of the important things that we focus on is Standardisation and Best Practices. These two terms are quite commonly used along with DevOps but it is important to make sure that these ideas just like other ideas, are only executed to a certain limit. They may have side effects. Before we talk about the benefits (the good parts), lets talk about the side effects.
The biggest side effect is that enforcing standards too strictly, can lead to lack of innovation. Even though its not wrong to use standards as a base for reasoning — it should not limit the scope of thinking. For example, we use Docker with certain standards around how developers should write their Dockerfile, but the developers are not restricted with this standard about what they can run and what they cannot run in a container.
Benefits of Standards
- No reinventing the wheel. Standard common libraries.
- Cost savings across the board.
- Defined security plan with regular security checks.
- Element of least surprise while navigating the system.
- Ease of development for developers, common language.
My Future Standards for Happy5
There are some very basic standards pertaining to code at Happy5 which lead to happier results with engineers. We run an engineering on-boarding for new engineers where we explain these standards to the engineers.
The easiest aspect to standardise is naming (as that is the hardest problem in Computer Science), and we have taken that pain away from developers.
This is a non-exhaustive list –
- Consistent naming across the board for each service (e.g service name = repo name = docker image name = monitoring name).
- No funny name for services or servers. Servers are enumerated along with function. Service names succinctly describe what it does.
- Common terminology for all developers which helps in communication.
- Load balancers as the source of all truth for service success rate and health monitoring.
- Only load balancers should have public IP address, the rest are interconnected with a private network.
- Statsd protocol for metrics, ELK stack for logging, Amazon SNS for notifiation etc.
- Regularly upgrading the docker base image for security updates.
- Standard provisioning through Ansible — leading to standard kernel versions, packages etc.
- Consul for storing all service configuration.
- Docker container startup script with health checks.
- Services integrated into CI pipeline with ease of writing and running tests.
- Everybody should log to /var/log which has standard logrotate policies.
- Dual custody password, monthly security audit. etc. etc.
As we continue forward with our DevOps journey, I hope our ideas prevail and we never end up in this state, the world is evolving.
Thanks for reading!