50% savings on energy and maintenance costs while working with a scalable and reliable data platform
Organic growth at a high scale can lead to technical debt and instability
Site Reliability Engineering Principles
A new data platform based on SRE principles
The client’s big data platform is one of the biggest Hadoop cluster in the world. It stores and processes around 600 PB of client data, booking data, and data on client behavior. The company is a truly data-driven organization. Nearly every team has at least one data scientist on board. The data platform is not only critical for the core business functionality but also the most important source of information for business decisions, such as personalized Marketing efforts and website design.
It caters to a huge user group of more than 3000 internal users, like data scientists and reporting analysts.
The company went through an enormous growth period in the last years. The data platform and its functionality has been growing, developed, and adapted accordingly. But tremendous expansion has been accompanied by challenges regarding scale for the system and its architecture.
Having all data located on one on-premise platform resulted in stability issues,
high costs for maintenance and energy and led to limited flexibility regarding updates and releasing new features.
Already existing stability issues could not be resolved within short notice, but had to be addressed with urgency as many teams were experiencing these. Small changes could have big effects on overall business processes. Therefore, the organization decided to choose a Site Reliability engineering approach to ensure consistent data flow and platform stability.
"The choice for single purpose clusters also healed the growth pain of trying to balance out technical debt. We wanted our new architecture to be cutting edge"
In the meantime, two data clusters are up and running. More and more datasets are now being migrated to the new platform. To ensure a sustainable transition, once the assignment is finalized, all teams have been constantly challenged and narrowly coached towards an SRE mindset.
The Product Manager:” The team really appreciated the added role of SRE and the positive effects it had on the quality of our services.”
One example is the subject of observability. Within SRE it is crucial that monitoring differs from observing. While simply monitoring a system, you see an issue occurring but cannot see the underlying cause. The new data platform is observed with different dashboards, not only showing the current performance and possible issues but also the reason causing possible problems.
SLO’s are linked to business objectives. A clear dashboard ensures transparency for the team and a high level of commitment and ownership.
Erasure coding within the Hadoop 3 clusters ensures a high level of availability and data security at lower costs. Costs of operation went down by 45%, and there is no need to constantly invest in new hardware expansion.
Cutting back energy consumption by half has significantly reduced the environmental footprint of the data platform and is therefore perfectly aligned with the company’s sustainability efforts. Xebia is proud to not only contribute to technical improvements but also make an impact beyond technology.
The Product Manager about working together on this project: “Zakaria excels at what he does, he has a wealth of experience and knowledge and is vocal about it. His right set of suggestions, feedback and opinions brought a very positive vibe into the team. This stimulated beyond the assignment for a new data platform, he helped our team to grow.”
Do you want to achieve more speed, flexibility, and simplicity with more effective and efficient IT? DevOps helps create a high-performance IT capability that accelerates your business.
Xebia DevOps focuses on two areas: organization and technology. We help you speed up your software release cycle and enhance autonomy within your teams. Following a proven method. Brought into practice by experienced Consultants.