In the last 15 years of designing highly scalable and distributed for big companies like Naukri.com, Yahoo, and Booking.com among others, I have seen technologies and architectures evolve. From setting up dedicated servers and managing them ourselves to deploying services in the cloud backed by Docker and Kubernetes in a single click, we have come a long way.
Although a lot has changed in the last 15 years in the tech landscape, the basics of building a highly scalable system remains the same. And that is what this post is about.
Below are my learnings as well as also those which I have learned from talking to my peers at other large companies like Facebook, Uber, Ebay, etc. These learnings still hold true as it did 15 years ago, and will continue to hold true in the future too.
- Start Small. Redesign and Rewrite for Scale when required. Experience teaches you that rarely does a software architecture survive the challenge of scale. The best design of a system is what can handle the expected load of this year and the next, and not for the next 3 or 5 years. With each order of magnitude increment of scale, you will inevitably find new business and technical challenges that any system design would have to be redone.
- Use caches as much as possible. When programming your application, always go to memory first. Avoid disk seek as much as possible. For speed and availability, it is a good mantra to cache everything, and to show information on the front end from the cache itself, even if it is a few seconds old. There is a huge delay when reading from disk, and should be avoided as much as possible. Have a separate system to update your caches every few seconds or minutes depending on the use case, and you will be much better off in terms of resource efficiency and performance of your application.
- Keep it simple. Limit to a few technologies to begin with. Or you will only be busy with learning and maintenance in the coming years and decades. Add each new incremental technology to your stack after serious consideration, as it also adds to documentation, learning, maintenance and tooling costs for the whole organisation.
- Focus on shipping products fast instead of being a perfectionist. Once you get traction, iterate and add new features incrementally. Every few years, you might want to rewrite the system for new needs instead of patching together on top of an existing system. The requirements often change enough to warrant a complete rework.
- Focus on doing a few things well, instead of doing a lot of things with diluted attention. When you have limited people, it is easy to take on too many things or wanting to achieve perfection. However, that is often a recipe for future failure. Focus gets diluted, people get distracted and take shortcuts, which leads to bugs and outages. Focus on doing a few things well, and then move on to the next item.
- Build resilience into your systems as a feature.
- Eliminate single point of failures at every level. Build for redundancy and automate around it.
- Build circuit breakers in your services to prevent cascading losses. This will ensure that faults will stay localised and an outage in a subsystem doesn’t end up taking the whole system down.
- Design your services and systems to auto-repair after a failure. The more your services can auto heal and send proper alerts for any manual intervention, the more resilient your entire system would be.
- Log / monitor everything for efficient and relevant alarms to be set up. Monitor at the level of infrastructure, services and product to capture a holistic view of how well the system is performing.
- Build a chaos monkey in your systems. A chaos monkey will randomly kill off parts of your system, in production, to see how the rest of the system reacts. Failure will happen. Plan for it instead of hoping it doesn’t happen. It’s better to be proactive and ready than to be surprised.
- Do failover drills on the level of a service, a cluster, a data center and so on so that you know the gaps in your system, which you can then fix before the failure happens in reality.
- Have a well documented Post Mortem or Incident Management Process to make every outage a learning opportunity.
- Have runbooks ready which list down mitigation steps for common failure scenarios. They will become your first line of defence.
- Once the mitigation is done, a proper investigation or post mortem needs to happen to prevent the issue from happening again.
- The action points coming out should be communicated to the entire company. I must make it clear that the investigation doesn’t need to happen immediately after mitigation, and it is usually a good idea to wait till the next business day to start the investigation process.
- A well done investigation will list down the root causes of the incident, and come up with ways to prevent the same from happening again.
- It will also look into how new alerts and monitoring can be put into place to warn relevant people earlier if it were to happen again.
- Once a culture is established where people learn from outages, the organisation as a whole becomes robust and resilient.
- Keep your culture free of blame and shame. If people are free to speak up about problems without fear of reprimand, you will discover problems early. To make it clear, the person whose code caused an outage and business impact should not be blamed or reprimanded in any way (unless it was an intentional or repeated mistake).
- Don’t re-invent the wheel. Use open source tech and learn from others. Do not spend time building your own search engine, your own bug tracking system, your own internal wiki, or your own monitoring and dashboards. There are enough open source tools and knowledge available which you must capitalise on. This will allow you to focus on your main product, your USP which will make or break your company.
- If something doesn’t exist, don’t hesitate to be the first to build something new. There are times when starting from scratch makes perfect sense. If you a unique solution to a problem, it is worthwhile to build something new yourself. And if that is a problem likely to be shared by others, think how you can make it open source and available to others. We all use open source software to make our jobs easier, and this might be your chance to pay it forward.
- Measure what you want to improve. Collect data on all important metrics. Then work on optimising the most important metrics.
- Monitoring and alerting is crucial. Have all important metrics displayed on a dashboard and setup proper alerts for them. Have op docs ready for people to know how to respond in case a metric goes above or below a predefined threshold. However, ensure that alerts don’t get noisy. If people start getting alerts for minor problems, they will start ignoring them. When this happens, a major alert can get buried in a pile of minor alerts and risk being ignored.
The larger the system, the more challenging it gets in terms of surprises which can pull the system down. Not only should the system be resilient to surprises, it should also be secure, performant and reliable. At the same time efficiency should be high and costs low.
This becomes a fun challenge to solve, especially when working in large organisations with hundreds of developers, thousands of commits a day, and frequent deploys – often across data centres in different continents.