Status quo of big data engine

In the field of big data computing and storage , Due to different business scenarios , Different data sizes , Many kinds of big data engines have been born to meet different needs , For example, the computing engine class has a data analysis engine Hive, Interactive analysis engine Presto, Iterative computing engine spark And stream processing engine Flink etc. , Storage class has log storage system SLS, distributed file system HDFS etc. , These engines and systems meet the business needs of a certain field , But there is also a very serious problem of data island : Use these systems together on the same data , It is bound to face a large number of ETL work , What's more, it's very common to use this method on various company business links , At the same time, due to data processing , The cost and overall latency of the dump are greatly increased , Business decision-making time is also longer , The key to solve this problem lies in the interoperability of engine metadata , Only by building a unified metadata service view of the data lake to meet the needs of various engines , In order to achieve data sharing , Avoid the extra ETL Cost and reduce link delay .

Design of metadata service for data Lake

The design goal of data Lake metadata service is to be able to run in big data engine , Under the environment of storage diversity , Building different storage systems , Format and unified metadata view of different computing engines , And have unified authority , metadata , And it needs to be compatible with and expand open source big data ecological metadata services , Support automatic metadata acquisition , And achieve the purpose of one time management and multiple use , This is compatible with open source ecology , It also has great ease of use . In addition, metadata should support traceability , audit , This requires the data Lake unified metadata service to have the following capabilities and values :

Provide unified authority , Metadata Management Module : Unified authority / The metadata management module is the foundation of various engines and storage interoperability , Not only authority / The metadata model needs to meet the business needs for permission isolation , Also need to be able to reasonably support the current engine of various permission models .

Provide large-scale metadata storage and service capabilities , Enhance the ability limit of metadata service , Meet large data scale and scenarios

Provides a unified view of metadata management for storage : All kinds of storage systems ( object , file , Log system ) The data on the structured can facilitate the management of data , Because of the unified metadata , In order to carry out the next step of analysis and processing .

Support rich computing engine : All kinds of engines , Access and calculate the data through the unified metadata service view , Meet the needs of different scenarios . such as PAI/MaxCompute/Hive Waiting can be in the same share OSS Data for calculation and analysis . Diversification through engine support , Business scenarios will become easier to transform and use .

Traceability of metadata operations / audit

Metadata automatic discovery and collection capabilities : Through the directory of file storage / file / Automatic perception of file format , Automatically create and maintain metadata consistency , It is convenient for automatic maintenance and management of stored data .

The framework of metadata service of data Lake

The upper layer of metadata service is the engine access layer

By providing a variety of protocols SDK And plug-ins , It can flexibly support the docking of various engines , To meet the access needs of the engine for metadata services . And the view provided by metadata service , Analyze and process the underlying file system .

Seamless compatibility through plug-in system EMR engine , Can make EMR The whole family can be used out of the box , Users have no perception in the whole process , Experience unified metadata service , Avoid the original Mysql The problem of equal scalability

Metadata services provide storage views

Through different storage formats / Abstract of storing directory files , Providing unified metadata service for engine , At the same time, it can avoid the inconsistency between multiple engines using metadata services independently

Management and automatic discovery of metadata

Metadata can be flexible in various ways , Managing metadata across engines , It can integrate metadata service conveniently , Extending metadata service capability , It can also reduce management costs .

Web Console,Sdk, Various engine clients and interfaces

1. Various databases compatible with open source ecological engine / surface / On partition DDL operation .

2. Provide multi version metadata management / Traceability capability

3. Through the opening of metadata capability , stay ETL part / Open source tools can also be connected through various plug-ins in the future , Further improve the overall ecology

Automatic discovery of metadata

Metadata automatic discovery capability is another core part of metadata management capability , It can automatically collect scattered data from file systems everywhere , It greatly broadens the scenario of unified metadata service , It saves the cost and complexity of management . The capabilities include

1. Automatic analysis of directory hierarchy , Dynamic incremental creation database/table/partition Meta data

2. Automatic analysis of file formats , For various formats, such as conventional text format and open source big data format parquet,orc And so on

The future of metadata services

Data Lake metadata service is born for big data , Living for ecological exchange , It is expected to continue to improve its service capability and support more big data engines in the future , Through open service capability , Storage capacity , Unified authority and metadata management capability , Saving management for customers / human resources / Storage and other costs , Realize customer's own business value .

Technology