Our client is a Fortune 100 life insurance provider. The client had ten years of application data available in a digitized format for modeling of mortality risk. It wanted to digitize an additional 15 years of applications (~1 million forms) to improve survival prediction. To accomplish this, the client needed a robust and automated process to assess data quality and ensure that digitized applications (post-OCR) meet a certain standard for modeling purposes.
The client wanted us to-
- Create a scalable solution for automated validation of digitized applications
- Deliver over 90% accurate data for integration with modeling datasets
- The underwriting process had evolved significantly over the past 25 years – forms with varying templates, changes in questions (and language), incorrect applicant tagging, etc.
- Significant digitizing errors such as incorrect values, page duplication, wrong order of pages, etc.
- Identified variations in application questions and designed custom workflows to create a single source of truth (SSOT), by
- Creating a stratified (smaller) sample of policies to manually translate them into digital data to eliminate all digitizing errors
- Analyzing different trends to define benchmarks for missing percentages, outliers, and invalid entries
- Built an automated process to analyze digitized data in terms of sanity checks (missing values/outliers), variable distribution, and comparison with SSOT
- Designed reports for underwriters to track data quality metrics at a variable and policy level
- The initial prototype helped in making a go-no-go decision with regards to scaling the OCR process and justifying millions of dollars of investment
- The final solution is able to process and validate 50k digitized policies in less than 2 hours