Implementing a data lake for a manufacturing company

    We brought together big data in supply chain management with a data lake (DL) solution to reduce manual labor, improve reporting, advance performance, and increase competitiveness.

    Business context

    Our client is a manufacturer of electronic circuits with several plants in Europe and Asia. They needed to securely store any type of information from multiple sources and easily access it for business analysis. When the company’s executives came to Yalantis, the company was experiencing the following issues:

    Performance downtime

    Performance downtime

    While our client could easily resolve most causes of downtime, there were still unexpected idle periods in production with hidden root causes. It was difficult to define these root causes due to the limited amount of data available for analysis. Delays in dealing with downtime slowed production and decreased competitiveness.

    Scattered business records

    Scattered business records

    Each department in the client’s company had its own separate data management platform, and these systems didn’t have a proper integration. Thus, the client struggled to keep proper track of business performance across facilities.

    Semi-manual data management

    Semi-manual data management

    Employees had to manually collect and input certain types of business and production data. This process was time-consuming, error-prone, and labor-intensive.

    Lack of a unified security policy

    Lack of a unified security policy

    Employees’ workstations stored lots of business-specific records without any security and access management policies in place. Unauthorized users could easily compromise this information.

    Overview of our data lake solutions

    To efficiently tackle our client’s business issues, we built a data lake architecture to provide centralized cloud-based storage and enable supply chain data analytics. The DL can store and structure the data in all formats (structured, semi-structured, and unstructured) and from all internal and external sources. Authorized company employees can easily access this data for analysis.

    The data lake we created stores data from

    • Third-party software

    • Excel spreadsheets

    • IoT sensors

    • Internal ERP and CRM systems

    Detailed descriptions of stored data

    • IoT data

      on humidity, temperature, and heat in production facilities

    • Production data

      including the number of products produced per day, number of errors and malfunctions, and downtime frequency

    • Business data

      including information about vendors, suppliers, and clients, invoices, documents from email attachments, items in stock, and information about supplier production capacity

    • Equipment logs

      including information on who used certain equipment in the facility and for how long as well as information on all equipment maintenance activities

    • External data

      such as employees’ timesheets, work schedules, payrolls from the third-party Hubstaff logistics software, and real-time data on material tracking and production planning from Katana (another piece of external software)

    Ways to use the data lake

    With a DL repository, our client’s business analysts can:

    Generate reports and analytics

    Generate reports and analytics in data analysis software to conduct efficient root-cause analysis of production downtimes. Based on performance reports, production technicians can improve product quality control. And with real-time analytics of production rate, material planners can always maintain a sufficient level of raw materials in the facilities.

    Adopt machine learning technology

    Adopt machine learning technology to compare the production rate with market demand and analyze how to adjust the company’s production to improve competitiveness. Machine learning also allows for predicting timely equipment maintenance to optimize the equipment lifecycle based on operational activity.

    Technical perspective

    To build a data lake solution, we used a wide range of AWS services. With their help, we ensured:

    Data movement and storage

    Our team set up a fully automated data flow from all sources in the company to a single source of truth to eliminate the need for manual data management.

    We built our client’s solution on Amazon S3, a scalable cloud storage service. To transfer data in real time to Amazon S3 and then to the data lake, we configured the Amazon Kinesis Firehose service. AWS Data Sync helped us transfer all records from our client’s on-premises databases to the DL. Using AWS Storage Gateway, we set up a file exchange between our client’s on-premises legacy systems and the data lake.

    We also implemented the AWS Lake Formation tool to automatically extract, transform, and load raw data. AWS Lake Formation and AWS Glue are responsible for deduplication of records, matching and partitioning data attributes from various sources.

    Cataloging and access

    To help business analysts quickly find and directly access the necessary information to analyze the root causes of downtime, we ensured our solution can properly group all gathered information.

    AWS Lake Formation allowed us to create catalogs with specific datasets in the DL. Plus, AWS Glue Crawler examines all data received in the data lake and composes queryable tables with catalogs. Apart from datasets, catalogs contain information about the users who can access these datasets.

    Data security

    To ensure secure data access and retrieval, we combined server-side encryption and client-side encryption. The AWS Key Management Service helped us orchestrate the accurate exchange of encryption keys. With the help of AWS Identity and Access Management (IAM), we provided user policies with user roles that have different permissions for big data processing and accessing.

    scheme

    Project results

      Before

    • cross

      Difficulty with promptly resolving unexpected performance downtime

    • cross

      Decreased competitiveness due to production downtime

    • cross

      Complex data analysis due to lack of a single data storage

    • cross

      Insecure storage of some business and production data

    • cross

      Manual handling of certain data assets

    • cross

      Scarce analytical data to thoroughly evaluate business performance

    • After

    • checkmark

      Improved competitiveness with more analytical data to proactively tackle business challenges

    • checkmark

      Security and access management policies for all types of data

    • checkmark

      Better root cause analysis of performance downtimes

    • checkmark

      Single source of truth for all data gathered in the client’s facilities

    • checkmark

      Simplified data search and access with accurate data cataloging

    • checkmark

      Potential to analyze business from new perspectives due to storing a large variety of data in one place