BDIT Engineering | Transforming Data Engineering: Enhancing Workflows, Minimizing Costs, and Empowering Teams

23 May Transforming Data Engineering: Enhancing Workflows, Minimizing Costs, and Empowering Teams

Posted at 09:15h in Case study, Uncategorized by Jelena

This case study outlines a data engineering project for a company that specializes in online games. The project lasted for 1 year and 9 months, and its main goal was to improve tools and systems that enable other company departments to do their jobs independently and without delays. The project resulted in faster and easier access to all the data, minimized monthly costs, better workflow automation and control over data pipelines.

An overview:

Project duration:	1 Year 9 Months
Industry:	Gaming, Healthcare, Technology
Technology:	Data Engineering
Project goal:	Make data-driven decisions to stay ahead of the curve in the industry.
Challenges:	Seamless project improvements / without disrupting users.
Results:	Faster and easier access to all the data at one place Less used tools lowered monthly costs Quick identification of issues Better workflow automation and control over data pipelines The replication process ensured that our data was always up-to-date and consistent across all systems, reducing errors and enabling more accurate analysis.
Technologies and tools used on the project:	Scala Python MySQL Snowflake Airflow Kafka Spark Docker Kubernetes AWS

About the Client

Our client is a company that specializes in online games. Their mobile application has over 10M downloads and a large user base. They have one of the industry-leading analytics, marketing, and customer departments, all powered by a properly set-up data platform.

The goal of this project was to improve tools and systems that allow other company departments to do their jobs independently and without delays. Some of their tools were deprecated, while others just needed to be migrated, and they also wanted better automation and security.

Project Goals

For a data-driven company their top concern is always to effectively collect, distribute, and analyze data at scale. This ensures good decision making, as well as better knowledge about their user base. The project goals focused on automation, tool migration, and maintenance tasks for a data engineering system.

Automation

Orchestrate tasks

By using Apache Airflow, we were able to achieve better workflow automation, greater visibility and control over our data pipelines, and improved collaboration between different teams and stakeholders involved in the data processing and analysis tasks.

Replication

We used Fivetran to set up table replication from MySQL to Snowflake, allowing a much faster and easier access to all the data at one place. The replication process ensured that our data was always up-to-date and consistent across all systems, reducing errors and enabling more accurate analysis.

Role Based Access Control (RBAC)

By integrating Okta with Snowflake we automated account creation for every new employee. Setting up proper RBAC allowed us to granularly display data to certain groups – i.e. display PII (Personal Identifiable Information) only to employees that absolutely need to see that data, thus additionally protecting our user data.

Custom alerts

To help the team quickly identify and fix issues, we implemented custom alerts using DataDog, PagerDuty and Snowflake. Alerts varied from resources availability, to data freshness, to vendor integrations. This made bug squashing easier for our team and we lowered team response times.

programming

Tool migration

Migrating Chartio to Snowflake

Soon after Chartio joined Atlassian, the product shut down was announced. The company had many meaningful dashboards that they could no longer access and a smooth transition off Chartio was needed. Our team created dashboards in Snowflake to allow business analysts to investigate near-real time data and gain insights.

Moving AWS Glue to Snowflake orchestrated by Airflow

The company wanted to move old ETL jobs from AWS Glue to Snowflake because they wanted as many of the jobs to be in Snowflake in order to minimize the number of tools used. Additionally, Snowflake’s flexibility in working with structured and semi-structured data and SQL-based transformations made it easier to work with data in various formats.

Scala UDF’s to be used in Snowflake

To simplify the data processing pipeline and retire an outdated component, we migrated a complex Spark ETL job from EMR to Snowflake, leveraging UDFs to reuse shared code and minimize the number of used tools.

Integration with Braze

The integration with Braze was carried out to assist the marketing team in optimizing their campaigns and enhancing customer engagement. This integration helped them to make data-driven decisions and ultimately drive more conversions.

Maintenance

Bug squashing

To ensure smooth functioning of the data engineering system we had an on-call schedule that required team members to be familiar with all components.

“With a team working across different time zones, we could provide round-the-clock coverage and ensure timely resolution of any issues.”

Kafka producers and consumers

To successfully develop and maintain Kafka producers and consumers in Scala, it is essential to have a comprehensive understanding of Kafka’s architecture and the roles that consumers and producers play within it. Our team paid close attention to ensuring that the consumers and producers we developed were scalable and fault-tolerant. This involved monitoring performance, troubleshooting issues, and code optimization for better performance.

To achieve this, we closely collaborated with other team members, such as data scientists and business analysts, to ensure that the Kafka consumers and producers are aligned with the organization’s data processing needs.

Spark jobs

Our team maintained Spark jobs written in Scala which involved monitoring the performance of the jobs, optimizing their resource usage, and ensuring that they are processing data accurately and efficiently.

DevOps tasks

New components

The company has their DevOps team, but sometimes they get overwhelmed with tasks, and are left with no time for new features. We created CI/CD pipelines on multiple repos and added new data engineering components from scratch to production by creating deployments.

Docker

We created Dockerfiles for multiple Docker images to enable deployment in a Kubernetes cluster. This approach provides better resource utilization and improved consistency and portability of the infrastructure across different environments.

Regular upkeep

Our team did k8s deployments and some minor changes on existing deployment allowing the team to be more independent.

Results

Overall, the project of migrating tools and automating workflows engineering project allowed the company to improve their systems and to enable their company departments to do their jobs independently and without delays.

“We minimized the number of tools used, resulting in lower monthly costs, faster job execution, and simplified maintenance due to centralized resources.”

Goals	Results
Automation
Improved automation and orchestration of data workflows.	Better workflow automation. Improved collaboration. Easier access to all data at one place. Automated role based access, and granular data display. Quicker bug identification thanks to custom alerts.
Tool migration
Minimize the number of used tools.	Unified chart dashboards and most of the ETL jobs in Snowflake. Lowered monthly costs due to the minimized number of used tools. Braze integration lead to improved marketing campaign cost optimization.
Maintenance
Smooth functioning of the data engineering system	Better coverage for timely issue resolution. Simplified data processing pipeline and ensured that company’s data processing needs are met.
DevOps
Add new data engineering components and improve deployment process.	Components made and deployed from ground up, including CI/CD.

Looking to improve your data engineering systems, streamline workflows, and make data-driven decisions?

Let us help you

Tags:

case-study