Our client is one of the world’s largest online payment processing companies with annual revenue of USD 10 billion.
The client wanted to build a robust and highly scalable application to process and store incoming click events in real-time.
- Architectural challenges in combining stream and batch events to ensure the resiliency of the input data
- Extracting, transforming, and merging unstructured events from multiple sources
- High volume of data (~ 15 TB daily)
- Used Open Source Technologies like Apache Kafka, Hadoop HDFS, for data ingestion from various sources
- Processed the incoming big-data using Spark computing framework
- Stored the processed data into Apache Hive warehouse for querying and analysis
- 15 mins for processing a day’s worth of data – processed 100 million real-time events/hour
- Brought together data from several sources creating a single source of truth
- The new streaming pipeline has fault tolerance enabled via checkpointing