Exploring Tiger’s Data Fabric on GCP
Author: Deepak Kurumpan
Having all their data in one place helps organizations work with large datasets and gain insights earlier in the data modeling process. A cloud-native, unified data lake can help data engineers and data analysts to speed up enabling a product life cycle. The primary advantage of a data lake is to bring in data from multiple silos to one platform making it easier to derive insight that can create actual business impact. Analytical capabilities can be built for transactional data so that data modeling process repetition can be avoided, and the data flow and quality can be streamlined.
GCP Data Fabric, a component of Tiger Data Fabric is a GCP-based engineering solution with reusable components which helps organizations onboard various data assets from disparate sources to their Data Lake.
At Tiger Analytics, we implemented a Data Lake solution for a leading mobile computing company based out of the US. The idea was to build a data store that could provide intelligence reporting for the marketing function with analysis of the different campaigns and the associated performance measurement. Data from multiple data sources were ingested and processed using Tiger’s Data Fabric on GCP which follows a self-service-based low code approach. Overall, Tiger’s framework helped the organization improve the time to market and helped them measure performance across different marketing initiatives.
How it works
• Loosely coupled Configurable application model
• Event-based Metadata Driven Approach
• Ingests data from RDBMS, File Systems, and Streaming Sources.
• Ingests data to GCP (GCS, Big Query with/without schema)
• Partitioning and Clustering on ingested data
• Supports Incremental Ingestion
• Orchestrates the ingestion and Schedule it based on your choice
• Ability to choose data processing services (Dataflow/Dataproc/Datafusion)
• Provides the option to ingest data in GCS in CSV and Parquet format
• Data Quality & Data Security Enablement
• Logging, Audit, and Monitoring
Batch Data Transfer
The Data Lake supports the ingestion of batch data to GCS and Big Query. GCS is blob storage with great features like high availability and lifecycle management. BigQuery is a fast, serverless data warehouse providing a platform to run very large-scale SQL queries. The data lake supports ingestion from the Postgres server and SQL server. Incremental ingestion is achieved through the timestamp column. The Data Lake lets the user decide the processing engine required for ingestion (Dataflow/Dataproc). The Data Lake also supports the transfer of files with full security and authentication through SSH File Transfer Protocol and enables ingestion of files to GCP.
Streaming Data Transfer
The Data Lake enables the user to ingest streaming data directly to the GCP platform (GCS or Big Query) through Pub/Sub. Pub/Sub’s fully managed messaging services provide easy integration with independent applications to send and receive messages.
The entire data ingestion pipeline is orchestrated through Cloud Composer. Cloud Composer integrates with Cloud Logging and Cloud Monitoring to view workflow logs. Low-Code approach for data lake development through DAG Template approach.
Data lake uses great expectations framework to ensure the quality of the dataset being ingested.
Enable KMS encryption on ingested data and support security for data in rest and transit.
The Data Lake also has auditing capabilities which in turn provides insights on various user activities like who did what, when and where.
Monitoring and Logging
The Logging feature in the Data Lake is achieved through Data Lineage which provides information on data’s journey from source to destination visualized through a lineage graph.
To summarize, Tiger Analytics’ Data Fabric on GCP takes out the complexities of building data pipelines to ingest, cleanse and perform data quality capabilities in GCP. It provides an intuitive self-service portal that abstracts the backend processes and makes it easier for anyone to configure sources and data assets with a few clicks of a button, making the process easier and simpler, while helping organizations manage their data better.