Business Objective
Our client is a leading US-based logistics service provider offering logistics management, trucking, warehousing, freight forwarding, brokerage, supply chain management, distribution services, etc.
The client had multiple data sources feeding into their data lake but the data dictionary and understanding of data lineage was missing. They didn’t have a silver layer having clean and standardized data before moving into the data lake. Besides, the performance of Databricks and Power BI was not good. All of these led to an incomplete road map for data lake adoption. The client wanted our help in cleaning the data from the raw/bronze layer and move it to the silver layer from where the data can be pulled by the data science and analytics team.
Challenges
- Gathering information from multiple teams and stakeholders
- Building governance processes, data vetting, and legal compliance of data
- Handling data inconsistencies across multiple sources
- Building a flexible, configurable framework that allows easy addition of data sources in future
Solution Methodology
- Started with a series of discovery workshops (30+ meetings and 25+ participants over 4 weeks)
- Identified and prioritized use cases, developed road map across technology and business initiatives for next year with defined milestones and deliverables
- Understood data lake adoption roadblocks as well as proposed architecture for silver layer
- Created a target schema for MVP which would support the client’s data science team
- Databricks/ADF pipelines were developed to load data from bronze later to the silver layer target schema
- Designed data quality framework for automated monitoring using “Great Expectations” Python library
Business Impact
- Fixed multiple challenges related to data architecture and integration including data quality issues, data catalog, automated code deployment/management, etc.
- Designed “Data Quality Framework” for automated data quality monitoring, including alerts when thresholds are exceeded