23 May Transforming Data Engineering: Enhancing Workflows, Minimizing Costs, and Empowering Teams
This case study outlines a data engineering project for a company that specializes in online games. The project lasted for 1 year and 9 months, and its main goal was to improve tools and systems that enable other company departments to do their jobs independently and without delays. The project resulted in faster and easier access to all the data, minimized monthly costs, better workflow automation and control over data pipelines.
|Project duration:||1 Year 9 Months|
|Industry:||Gaming, Healthcare, Technology|
|Project goal:||Make data-driven decisions to stay ahead of the curve in the industry.|
|Challenges:||Seamless project improvements / without disrupting users.|
|Technologies and tools used on the project:|
About the Client
Our client is a company that specializes in online games. Their mobile application has over 10M downloads and a large user base. They have one of the industry-leading analytics, marketing, and customer departments, all powered by a properly set-up data platform.
The goal of this project was to improve tools and systems that allow other company departments to do their jobs independently and without delays. Some of their tools were deprecated, while others just needed to be migrated, and they also wanted better automation and security.
For a data-driven company their top concern is always to effectively collect, distribute, and analyze data at scale. This ensures good decision making, as well as better knowledge about their user base. The project goals focused on automation, tool migration, and maintenance tasks for a data engineering system.
By using Apache Airflow, we were able to achieve better workflow automation, greater visibility and control over our data pipelines, and improved collaboration between different teams and stakeholders involved in the data processing and analysis tasks.
We used Fivetran to set up table replication from MySQL to Snowflake, allowing a much faster and easier access to all the data at one place. The replication process ensured that our data was always up-to-date and consistent across all systems, reducing errors and enabling more accurate analysis.
Role Based Access Control (RBAC)
By integrating Okta with Snowflake we automated account creation for every new employee. Setting up proper RBAC allowed us to granularly display data to certain groups – i.e. display PII (Personal Identifiable Information) only to employees that absolutely need to see that data, thus additionally protecting our user data.
To help the team quickly identify and fix issues, we implemented custom alerts using DataDog, PagerDuty and Snowflake. Alerts varied from resources availability, to data freshness, to vendor integrations. This made bug squashing easier for our team and we lowered team response times.
Migrating Chartio to Snowflake
Soon after Chartio joined Atlassian, the product shut down was announced. The company had many meaningful dashboards that they could no longer access and a smooth transition off Chartio was needed. Our team created dashboards in Snowflake to allow business analysts to investigate near-real time data and gain insights.
Moving AWS Glue to Snowflake orchestrated by Airflow
The company wanted to move old ETL jobs from AWS Glue to Snowflake because they wanted as many of the jobs to be in Snowflake in order to minimize the number of tools used. Additionally, Snowflake’s flexibility in working with structured and semi-structured data and SQL-based transformations made it easier to work with data in various formats.
Scala UDF’s to be used in Snowflake
To simplify the data processing pipeline and retire an outdated component, we migrated a complex Spark ETL job from EMR to Snowflake, leveraging UDFs to reuse shared code and minimize the number of used tools.
Integration with Braze
The integration with Braze was carried out to assist the marketing team in optimizing their campaigns and enhancing customer engagement. This integration helped them to make data-driven decisions and ultimately drive more conversions.
To ensure smooth functioning of the data engineering system we had an on-call schedule that required team members to be familiar with all components.
“With a team working across different time zones, we could provide round-the-clock coverage and ensure timely resolution of any issues.”
Kafka producers and consumers
To successfully develop and maintain Kafka producers and consumers in Scala, it is essential to have a comprehensive understanding of Kafka’s architecture and the roles that consumers and producers play within it. Our team paid close attention to ensuring that the consumers and producers we developed were scalable and fault-tolerant. This involved monitoring performance, troubleshooting issues, and code optimization for better performance.
To achieve this, we closely collaborated with other team members, such as data scientists and business analysts, to ensure that the Kafka consumers and producers are aligned with the organization’s data processing needs.
Our team maintained Spark jobs written in Scala which involved monitoring the performance of the jobs, optimizing their resource usage, and ensuring that they are processing data accurately and efficiently.
The company has their DevOps team, but sometimes they get overwhelmed with tasks, and are left with no time for new features. We created CI/CD pipelines on multiple repos and added new data engineering components from scratch to production by creating deployments.
We created Dockerfiles for multiple Docker images to enable deployment in a Kubernetes cluster. This approach provides better resource utilization and improved consistency and portability of the infrastructure across different environments.
Our team did k8s deployments and some minor changes on existing deployment allowing the team to be more independent.
Overall, the project of migrating tools and automating workflows engineering project allowed the company to improve their systems and to enable their company departments to do their jobs independently and without delays.
“We minimized the number of tools used, resulting in lower monthly costs, faster job execution, and simplified maintenance due to centralized resources.”
|Improved automation and orchestration of data workflows.|
|Minimize the number of used tools.|
|Smooth functioning of the data engineering system|
|Add new data engineering components and improve deployment process.||Components made and deployed from ground up, including CI/CD.|