<iframe src="https://www.googletagmanager.com/ns.html?id=GTM-WKXBVFF" height="0" width="0" style="display:none;visibility:hidden"></iframe>

Implementing a data lake for a manufacturing company

We brought together big data in supply chain management with a data lake (DL) solution to reduce manual labor, improve reporting, advance performance, and increase competitiveness.

Business context

Our client is a manufacturer of electronic circuits with several plants in Europe and Asia. They needed to securely store any type of information from multiple sources and easily access it for business analysis. When the company’s executives came to Yalantis, the company was experiencing the following issues:

Performance downtime

While our client could easily resolve most causes of downtime, there were still unexpected idle periods in production with hidden root causes. It was difficult to define these root causes due to the limited amount of data available for analysis. Delays in dealing with downtime slowed production and decreased competitiveness.

Scattered business records

Each department in the client’s company had its own separate data management platform, and these systems didn’t have a proper integration. Thus, the client struggled to keep proper track of business performance across facilities.

Semi-manual data management

Employees had to manually collect and input certain types of business and production data. This process was time-consuming, error-prone, and labor-intensive.

Lack of a unified security policy

Employees’ workstations stored lots of business-specific records without any security and access management policies in place. Unauthorized users could easily compromise this information.

Overview of our data lake solutions

To efficiently tackle our client’s business issues, we built a data lake architecture to provide centralized cloud-based storage and enable supply chain data analytics. The DL can store and structure the data in all formats (structured, semi-structured, and unstructured) and from all internal and external sources. Authorized company employees can easily access this data for analysis.

The data lake we created stores data from

  • Third-party software

  • Excel spreadsheets

  • IoT sensors

  • Internal ERP and CRM systems

Detailed descriptions of stored data

  • IoT data

    on humidity, temperature, and heat in production facilities

  • Production data

    including the number of products produced per day, number of errors and malfunctions, and downtime frequency

  • Business data

    including information about vendors, suppliers, and clients, invoices, documents from email attachments, items in stock, and information about supplier production capacity

  • Equipment logs

    including information on who used certain equipment in the facility and for how long as well as information on all equipment maintenance activities

  • External data

    such as employees’ timesheets, work schedules, payrolls from the third-party Hubstaff logistics software, and real-time data on material tracking and production planning from Katana (another piece of external software)

Ways to use the data lake

With a DL repository, our client’s business analysts can:

Generate reports and analytics in data analysis software to conduct efficient root-cause analysis of production downtimes. Based on performance reports, production technicians can improve product quality control. And with real-time analytics of production rate, material planners can always maintain a sufficient level of raw materials in the facilities.

Adopt machine learning technology to compare the production rate with market demand and analyze how to adjust the company’s production to improve competitiveness. Machine learning also allows for predicting timely equipment maintenance to optimize the equipment lifecycle based on operational activity.

Technical perspective

To build a data lake solution, we used a wide range of AWS services. With their help, we ensured:

Data movement and storage

Our team set up a fully automated data flow from all sources in the company to a single source of truth to eliminate the need for manual data management.

We built our client’s solution on Amazon S3, a scalable cloud storage service. To transfer data in real time to Amazon S3 and then to the data lake, we configured the Amazon Kinesis Firehose service. AWS Data Sync helped us transfer all records from our client’s on-premises databases to the DL. Using AWS Storage Gateway, we set up a file exchange between our client’s on-premises legacy systems and the data lake.

We also implemented the AWS Lake Formation tool to automatically extract, transform, and load raw data. AWS Lake Formation and AWS Glue are responsible for deduplication of records, matching and partitioning data attributes from various sources.

Cataloging and access

To help business analysts quickly find and directly access the necessary information to analyze the root causes of downtime, we ensured our solution can properly group all gathered information.

AWS Lake Formation allowed us to create catalogs with specific datasets in the DL. Plus, AWS Glue Crawler examines all data received in the data lake and composes queryable tables with catalogs. Apart from datasets, catalogs contain information about the users who can access these datasets.

Data security

To ensure secure data access and retrieval, we combined server-side encryption and client-side encryption. The AWS Key Management Service helped us orchestrate the accurate exchange of encryption keys. With the help of AWS Identity and Access Management (IAM), we provided user policies with user roles that have different permissions for big data processing and accessing.

Project results


  • Difficulty with promptly resolving unexpected performance downtime

  • Decreased competitiveness due to production downtime

  • Complex data analysis due to lack of a single data storage

  • Insecure storage of some business and production data

  • Manual handling of certain data assets

  • Scarce analytical data to thoroughly evaluate business performance

  • After

  • Improved competitiveness with more analytical data to proactively tackle business challenges

  • Security and access management policies for all types of data

  • Better root cause analysis of performance downtimes

  • Single source of truth for all data gathered in the client’s facilities

  • Simplified data search and access with accurate data cataloging

  • Potential to analyze business from new perspectives due to storing a large variety of data in one place