Tool comparison :

Kettle( conventional ETL tool )

characteristic : pure Java to write

advantage : Can be found in Windows,linux,Unix On the implementation ; Data extraction is efficient and stable ; Sub components spoon There are plenty of Steps Complex business logic scenarios can be developed , It is convenient to realize the total quantity , Incremental synchronization ;

shortcoming : By timing operation , Poor real-time performance ;

component :

Spoon: Allow graphical interface implementation ETL Data conversion process

Pan: Batch operation Spoon Data conversion process

Chef:job( State , It can be monitored whether it is executed or not , Speed of execution, etc )

Kitchen: Batch operation chef

 

 

Sqoop( Less used )

characteristic : Mainly used for HDFS Data conversion between and relational database ;

Datax( Offline data statistics tool used by Alibaba , Open source ):

characteristic : Implement different types of data sources ( Include relational database , Distributed file system, etc ) Data synchronization between ;

advantage : The operation is simple , only 2 step , One is to create the configuration file of the job ; The second is to start the configuration file job ;

shortcoming : Lack of support for incremental updates , But you can write it yourself shell Script and other ways to achieve incremental synchronization ;

Job: A data synchronization job Splitter: Job segmentation module , Decompose a large task into multiple concurrent small tasks .Sub-job: Synchronization of small tasks Reader(Loader): Data input module , Responsible for running small tasks after segmentation , Load data from source DataXStorage:Reader and Writer adopt Storage Exchange data Writer(Dumper): Data writing module , Responsible for transferring data from DataX Import to destination data destination
DataX Inside the framework, there are double buffered queues , Thread pool encapsulation and other technologies , The problem of high-speed data exchange is dealt with , Provide simple interface and plug-in interaction , Plug ins are divided into Reader and Writer Two types , Plug in interface based on Framework , Can be very convenient to develop the plug-ins needed
.

 

 

StreamSets( At present, it is widely used )

characteristic : Lightweight , Powerful engine , Real time stream data extraction can be realized ; Developers can easily build batch and streaming data streams , And the code is small

assembly :

Data Collector: Routing and processing data

The Conduit :

Technology