Implementing a data lake for a manufacturing company
We brought together big data in supply chain management with a data lake (DL) solution to reduce manual labor, improve reporting, advance performance, and increase competitiveness.
Business context
Our client is a manufacturer of electronic circuits with several plants in Europe and Asia. They needed to securely store any type of information from multiple sources and easily access it for business analysis. When the company’s executives came to Yalantis, the company was experiencing the following issues:
Performance downtime
While our client could easily resolve most causes of downtime, there were still unexpected idle periods in production with hidden root causes. It was difficult to define these root causes due to the limited amount of data available for analysis. Delays in dealing with downtime slowed production and decreased competitiveness.
Scattered business records
Each department in the client’s company had its own separate data management platform, and these systems didn’t have a proper integration. Thus, the client struggled to keep proper track of business performance across facilities.
Semi-manual data management
Employees had to manually collect and input certain types of business and production data. This process was time-consuming, error-prone, and labor-intensive.
Lack of a unified security policy
Employees’ workstations stored lots of business-specific records without any security and access management policies in place. Unauthorized users could easily compromise this information.
Overview of our data lake solutions
To efficiently tackle our client’s business issues, we built a data lake architecture to provide centralized cloud-based storage and enable supply chain data analytics. The DL can store and structure the data in all formats (structured, semi-structured, and unstructured) and from all internal and external sources. Authorized company employees can easily access this data for analysis.
The data lake we created stores data from
Third-party software
Excel spreadsheets
IoT sensors
Internal ERP and CRM systems
Detailed descriptions of stored data
IoT data
on humidity, temperature, and heat in production facilities
Production data
including the number of products produced per day, number of errors and malfunctions, and downtime frequency
Business data
including information about vendors, suppliers, and clients, invoices, documents from email attachments, items in stock, and information about supplier production capacity
Equipment logs
including information on who used certain equipment in the facility and for how long as well as information on all equipment maintenance activities
External data
such as employees’ timesheets, work schedules, payrolls from the third-party Hubstaff logistics software, and real-time data on material tracking and production planning from Katana (another piece of external software)
Ways to use the data lake
With a DL repository, our client’s business analysts can:
Generate reports and analytics in data analysis software to conduct efficient root-cause analysis of production downtimes. Based on performance reports, production technicians can improve product quality control. And with real-time analytics of production rate, material planners can always maintain a sufficient level of raw materials in the facilities.
Adopt machine learning technology to compare the production rate with market demand and analyze how to adjust the company’s production to improve competitiveness. Machine learning also allows for predicting timely equipment maintenance to optimize the equipment lifecycle based on operational activity.
Technical perspective
To build a data lake solution, we used a wide range of AWS services. With their help, we ensured:
Data movement and storage
Our team set up a fully automated data flow from all sources in the company to a single source of truth to eliminate the need for manual data management.
We built our client’s solution on Amazon S3, a scalable cloud storage service. To transfer data in real time to Amazon S3 and then to the data lake, we configured the Amazon Kinesis Firehose service. AWS Data Sync helped us transfer all records from our client’s on-premises databases to the DL. Using AWS Storage Gateway, we set up a file exchange between our client’s on-premises legacy systems and the data lake.
We also implemented the AWS Lake Formation tool to automatically extract, transform, and load raw data. AWS Lake Formation and AWS Glue are responsible for deduplication of records, matching and partitioning data attributes from various sources.
Cataloging and access
To help business analysts quickly find and directly access the necessary information to analyze the root causes of downtime, we ensured our solution can properly group all gathered information.
AWS Lake Formation allowed us to create catalogs with specific datasets in the DL. Plus, AWS Glue Crawler examines all data received in the data lake and composes queryable tables with catalogs. Apart from datasets, catalogs contain information about the users who can access these datasets.
Data security
To ensure secure data access and retrieval, we combined server-side encryption and client-side encryption. The AWS Key Management Service helped us orchestrate the accurate exchange of encryption keys. With the help of AWS Identity and Access Management (IAM), we provided user policies with user roles that have different permissions for big data processing and accessing.
Project results
Difficulty with promptly resolving unexpected performance downtime
Decreased competitiveness due to production downtime
Complex data analysis due to lack of a single data storage
Insecure storage of some business and production data
Manual handling of certain data assets
Scarce analytical data to thoroughly evaluate business performance
Improved competitiveness with more analytical data to proactively tackle business challenges
Security and access management policies for all types of data
Better root cause analysis of performance downtimes
Single source of truth for all data gathered in the client’s facilities
Simplified data search and access with accurate data cataloging
Potential to analyze business from new perspectives due to storing a large variety of data in one place