<> One article takes you to understand kafka Why so fast ?

<>1. First, let's talk about why we use the messaging system

Before using the messaging system , Many traditional system services usually use serial or parallel methods to process messages ;

for example , You go to the website to register an account , Here is the serial and parallel processing .

Serial mode :

User registration example : User responsibility , Record the registration information to the database , Send registration message email , Send registration SMS again for verification , Each process trumpet 50 millisecond , All in all 150 millisecond

Parallel mode :

What's different from serial is that , After the registration information is recorded in the database , Sending message and sending mail are completed synchronously , Shorten the waiting time of users :

Message system :

The messaging system is responsible for transferring data from one application to another , So applications can focus on data , But don't worry How to share it
. Distributed message system is based on the concept of reliable message queue . Asynchronous queuing of messages between client applications and messaging systems .

<>2. Classification of message system :

Point to point :

Mainly uses the queue way to carry on the message transmission , as A->B ,A production B consumption , When B Consumption of data in the queue , Then the data of the queue will be consumed , That is to delete it .

release - subscribe :

There are three main components :

theme : Classification of a message , If there is a kind of message, all of them are orders , One is all about users , One is all about orders . Then create different themes and store different things according to these .

Publisher : The message is pushed to the message system by active push

subscriber : Pull can be used , Push to get data from the message system

<>3.kafka Application scenarios and architecture of

kafka It's a distributed distribution - Subscription message system and a powerful message queue , use scala Language writing is a distributed system , Partitioned , Multi copy , Multi subscriber log system , Can be used to search logs , Monitoring log , Access logs, etc .

kafka Architecture diagram :

Prodecers: producer , Mainly used for production data . Then save to kafka colony .

Consumers: Cluster consumers , Consumption of data produced by producers from clusters .

Connectors: Allows you to build and run reusable producers or consumers , Be able to kafka Topics connect to existing applications or data systems . for example : A company
A connector connected to a relational database may get changes for each table .

Stream processors: Allow applications to act as stream processors (stream processor), Gets the input stream from one or more topics , And produce an output stream to one or more
There are many themes , Can effectively change the input stream to the output stream .

Related terms :

Broker:kafka The cluster contains one or more service instances , This service instance is called Broker

Topic: Each post to kafka All messages in a cluster have a category , This category is called Topic

Partition:Partition It's a physical concept , each Topic Contains one or more Partition ( partition )

Producer: Responsible for publishing information to kafka Of Broker in .

Consumer: News consumer , towards kafka Of broker Client that reads messages from

Consumer Group: every last Consumer Belong to a specific Consumer Group( For each Consumer appoint groupName)

Architecture diagram :

Process introduction :Zookeeper It's a distributed system , Open source , User distributed coordination service , producer push Data to cluster , Consumers through pull Pull , But both producers and consumers need to take action zookeeper The management of . His role is to , producer push Data to kafka colony , You have to find it kafka Where are the nodes of the cluster , These are all through zookeeper To look for . Which data do consumers consume , Also need zookeeper Our support , from zookeeper get offset,offset Record where the data of last consumption went , In this way, the next data can be consumed .

<>kafka Partition offset

offset It's a long Type number , It uniquely identifies a message , Consumers through (offset,partition,topic) Track record .

Any post here partition Will be appended directly to the log End of file , The location of each message in the file is called offset( Offset ).
Record the location of the last consumption , After that, it tracks to the next time, and then continues to consume at the location of the last consumption . Guaranteed that every time from the next start consumption , There will be no repeat consumption and no loss of consumption .

kafka Why so fast, mainly from below 4 Understand from three aspects :

1.kafka In terms of storage design

stay Kafka File storage , The same topic There are many differences in the following partition, each partition For a directory ,partiton The naming rule is topic name + Ordinal number , first partiton Serial number from 0 start , The maximum value of serial number is partitions Quantity decrease 1. each partition( catalogue ) Is evenly distributed to multiple equal size groups segment( paragraph ) In the data file . But every segment segment
file The number of messages is not necessarily equal , Through multiple small file segments , It's easy to clear or delete consumed files on a regular basis , Reduce disk footprint . every last sgement It also includes index Documents and log file , It can locate data quickly , adopt index All metadata mapped to memory, avoidable segment
file Of IO Disk operation .

2. utilize Page cache+mmap

page cache Page data used to cache files , Page is a logical concept , therefore page
cache Is at the same level as the file system ; His role is to speed up the development of data IO, When writing data, write to the cache first , Mark written pages as dirty, Then store it externally flush; When reading data, read the cache first , No read to go to external storage to read .page
cache Each file in is a cardinal tree , Each node of the tree is a page . According to the offset in the file, you can quickly locate the page .

Why? kafka To use page cache Storage management

1.JVM Everything in the world is an object , Object storage of data wastes space

2. use JVM Administration , It will reduce the throughput

3, If the system program crashes, the managed data will be lost , Serious consequences

mmap That is Memory Mapped Files Memory file mapping , You can connect physical disk files with page
cache Mapping , Enable processes to read and write memory , It is helpful to the interaction between data reading and writing and disk

3.kafka Design of batch compression based on FPGA

In large enterprises , The flow of data is extremely fast , For many cases of message queue, add , The problem that the system has to face is that it's not just the disk IO, It's more about the Internet IO. So the compression of messages is very important for kafka It is particularly important to improve the performance of the system .

Kafka in , Compression can happen in two places : Producer side and Broker end ,kafka Adopt the way of batch compression , Instead of using a single message queue compression .
If every message is compressed , The efficiency of compression will be greatly reduced .kafka Supports many compression methods , Allow recursive message collection .

4.kafka The process of reading and writing messages

1.Producer according to zookeeper Connected to or broker, from zookeeper Node found the partition Of leader

2.producer Send the message to the server leader,leader Write message to log,follows from leader Pull message , Write to local log Later leader send out ACK,leader Received backward producer send out ACK

Producers send batch compressed data to broker after ,beocker Through the function mapping compressed file address to memory , Then you can write according to this function , When writing, it will directly enter the PageCache, The cause of the fire os Thread to disk asynchronously , To achieve one-time optimization .

kafka When reading data , Will determine whether the data exists in the page cache, If it exists, it will be directly from the page cache Medium consumption , So the consumption of real-time data will be much faster .


stay linux There are two contexts in , They are kernel mode and user mode , We'll have one File It takes experience to read and send 4 second Copy:

1. call read, Copy the file to kernel Kernel state

2.CPU control kernel State data copy To user mode

3. call write Time ,user The content in this state will change copy To kernel state socket Of buffer in

4. Finally, the kernel state is changed socket buffer Data for copy Transfer to network card device

The disadvantage is that context switching is added , It's a waste 2 Invalid copies

ZeroCopy technology :

request kernel Directly disk Of data Transmit to socket, Not through the application .Zero
copy Greatly improve the performance of the application , Reduce unnecessary copy of kernel buffer and user buffer , So as to reduce the cost CPU The cost and cost are reduced kernel and user Context switch of mode , Achieve performance improvement

The corresponding zero shellfish technology has mmap and sendfile:

1.mmap: Fast transfer of small files

2.sendfile: Large file transfer ratio mmap fast

application :Kafka,Netty,RocketMQ Zero copy technology is used in message queue

Come here kafka Why is it so fast , I believe you can answer the interviewer like a stream !