<> preface

Through the previous chapter , We learned that clickhouse It is a high-performance OLAP data storage , Data analysis database engine , It is mainly used for online analysis and query processing (OLAP), Can use SQL
Query and generate analysis data report in real time , Based on this , It has also become a new dark horse in the field of big data .

Since it can be used as data analysis , Based on its own characteristics , There is such a scenario in production , When relational database , such as mysql The amount of data is huge , But these data are more for analysis , Statistics , When summarizing other businesses , Migrate this data to mysql Is a very good choice , Let's talk about how to clickhouse Docking mysq Data, so as to realize data synchronization

<>MaterializeMySQL engine

ClickHouse New version added MaterializeMySQL of database engine , Through this engine , Be able to database
Map to MySQL One of database , and since move stay ClickHouse Create corresponding in ReplacingMergeTree.ClickHouse
Service as MySQL copy , read Binlog And execute DDL and DML request , Based on MySQL Binlog Real time synchronization function of business database based on mechanism .

characteristic :

* MaterializeMySQL Support full and incremental synchronization at the same time , stay database It will be fully synchronized at the beginning of creation
MySQL Tables and data in , Then it will pass binlog Perform incremental synchronization
* MaterializeMySQL database Each created for it ReplacingMergeTree Automatically added
_sign and _version field .
* among ,_version be used as ReplacingMergeTree of ver Version parameters , Whenever you listen insert,update and delete
Event time , stay databse Internal global self increment . and _sign Used to mark whether it has been deleted , Value 1 or person -1
Usage rules :

DDL query

DDL query ,MySQL DDL The query is converted to the corresponding ClickHouse DDL query (ALTER, CREATE, DROP, RENAME). If
ClickHouse Some cannot be resolved DDL query , The query will be ignored

Data replication

MaterializeMySQL Direct insertion is not supported , Delete and update queries , But will DDL Statement :
MySQL INSERT The query is converted to INSERT with _sign=1;
MySQL DELETE The query is converted to INSERT with _sign=-1;
MySQL UPDATE The query is converted to INSERT with _sign=1 and INSERT with _sign=-1

SELECT query

If in SELECT Not specified in query _version, Then use FINAL Modifier , return _version Data corresponding to the maximum value of , That is, the latest version of data ;

If in SELECT Not specified in query _sign, It is used by default WHERE _sign=1, Return to undeleted status (_sign=1) Data ;

Index conversion

ClickHouse The database table will automatically MySQL Convert primary keys and index clauses to ORDER BY tuple

ClickHouse There is only one physical order , from ORDER BY Clause decision . If necessary, create a new physical order , Please use materialized view

<> Synchronization principle

Core same mysql The principle of master-slave replication is basically the same ,clickhouse The same is true , Bottom layer through listening binlog The change of log data realizes the synchronous writing of data , Change, etc

<> Environmental preparation

* clickhouse service ( Install and start )
* mysql service ( Install and start )
<> Construction steps

1, open mysql of binlog to configure
[mysqld] server-id=1 log-bin=mysql-bin binlog_format=row binlog-do-db=bank1
# Database name of specific synchronization # Sync to clickhouse Configuration of gtid-mode=on enforce-gtid-consistency=1 ##
Set master-slave strong consistency log-slave-updates=1
explain :

If it is mysql reach mysql Data synchronization for ,“# Sync to clickhouse Configuration of ”
Several configurations in this note are not required , however clickhouse Words , Configuration must be done here , because GTID yes MySQL Copy enhanced , from MySQL 5.6
Version start support , It's already MySQL Mainstream replication mode . It's for each event Assign a globally unique ID And serial number , We don't have to care MySQL
Cluster master-slave topology , Direct notification MySQL this GTID that will do

2, restart mysql service
sudo systemctl restart mysqld
3,mysql of bank1 Create two tables under the database , For testing
CREATE TABLE `t_organization` ( `id` int(11) NOT NULL AUTO_INCREMENT, `code`
int NOT NULL,`name` text DEFAULT NULL, `updatetime` datetime DEFAULT NULL,
testck.t_organization(code, name,updatetime) VALUES(1000,'Realinsight',NOW());
INSERT INTO testck.t_organization(code, name,updatetime) VALUES(1001,
'Realindex',NOW()); INSERT INTO testck.t_organization (code, name,updatetime)
VALUES(1002,'EDT',NOW()); CREATE TABLE `t_user` ( `id` int(11) NOT NULL
testck.t_user(code) VALUES(1);
INSERT INTO t_user (code) VALUES(1);

be careful : Test table created here , You must have a primary key field

4, open ClickHouse Materialized engine

stay clickhouse of shell Client window , Execute the following command
set allow_experimental_database_materialize_mysql=1;

5, Create replication pipeline

ClickHouse Create in MaterializeMySQL database , stay clickhouse of shell Client window , Execute the following command

among 4 The two parameters are MySQL address ,databse,username and password

After the above command is executed , You can check ClickHouse Does the database have the above database name

Then switch to mysql_sync Under database , Can see , Without doing anything ,mysql The two tables and data created in have been synchronized , It shows that full data synchronization has been done for the first time

5,mysql Modify data in

stay mysql Execute the following modification statement in
update t_organization set name = CONCAT(name,'-v1') where id = 1
see clickhouse Log can see binlog Listening events , At this time, query again clickhouse Table data above in , You can see that the data is synchronized almost in real time

MySQL Delete data
DELETE FROM t_organization where id = 2;
ClicKHouse, Log yes DeleteRows of binlog Listening events , View data ,id by 2 The data was deleted

Add in the query just now _sign and _version Virtual field , Look at the query results again
select *,_sign,_version from t_organization order by _sign desc,_version desc;

on top , We deleted id by 2 Data , therefore _ sign Field is marked as -1 , It shows that this data is not really deleted , But only in normal use select
The result of the statement is not displayed , meanwhile , Through the back version Field can be found , Operate the data once at a time ,version Fields are incremented