Java Interview questions -- search (solr&elasticsearch) - Blog

[{"createTime":1735734952000,"id":1,"img":"hwy_ms_500_252.jpeg","link":"https://activity.huaweicloud.com/cps.html?fromacct=261f35b6-af54-4511-a2ca-910fa15905d1&utm_source=V1g3MDY4NTY=&utm_medium=cps&utm_campaign=201905","name":"华为云秒杀","status":9,"txt":"华为云38元秒杀","type":1,"updateTime":1735747411000,"userId":3},{"createTime":1736173885000,"id":2,"img":"txy_480_300.png","link":"https://cloud.tencent.com/act/cps/redirect?redirect=1077&cps_key=edb15096bfff75effaaa8c8bb66138bd&from=console","name":"腾讯云秒杀","status":9,"txt":"腾讯云限量秒杀","type":1,"updateTime":1736173885000,"userId":3},{"createTime":1736177492000,"id":3,"img":"aly_251_140.png","link":"https://www.aliyun.com/minisite/goods?userCode=pwp8kmv3","memo":"","name":"阿里云","status":9,"txt":"阿里云2折起","type":1,"updateTime":1736177492000,"userId":3},{"createTime":1735660800000,"id":4,"img":"vultr_560_300.png","link":"https://www.vultr.com/?ref=9603742-8H","name":"Vultr","status":9,"txt":"Vultr送$100","type":1,"updateTime":1735660800000,"userId":3},{"createTime":1735660800000,"id":5,"img":"jdy_663_320.jpg","link":"https://3.cn/2ay1-e5t","name":"京东云","status":9,"txt":"京东云特惠专区","type":1,"updateTime":1735660800000,"userId":3},{"createTime":1735660800000,"id":6,"img":"new_ads.png","link":"https://www.iodraw.com/ads","name":"发布广告","status":9,"txt":"发布广告","type":1,"updateTime":1735660800000,"userId":3},{"createTime":1735660800000,"id":7,"img":"yun_910_50.png","link":"https://activity.huaweicloud.com/discount_area_v5/index.html?fromacct=261f35b6-af54-4511-a2ca-910fa15905d1&utm_source=aXhpYW95YW5nOA===&utm_medium=cps&utm_campaign=201905","name":"底部","status":9,"txt":"高性能云服务器2折起","type":2,"updateTime":1735660800000,"userId":3}]

Elasticsearch Advantages and disadvantages of ：

advantage ：

1.Elasticsearch It's distributed . No other components are required , Distribution is real time , It's called ”Push replication”.

2.Elasticsearch Full support Apache Lucene Near real time search .

3. Deal with multi tenant （multitenancy） No special configuration is required , and Solr You need more advanced settings .

4.Elasticsearch use Gateway The concept of , Make it easier to complete the backup .

5. Each node constitutes a peer-to-peer network structure , When some nodes fail, other nodes will be assigned to work instead of them .

shortcoming ：

1. There is only one developer （ current Elasticsearch GitHub Organizations are more than that , There are already quite active defenders ）

2. It's not automatic enough （ Not suitable for the current new Index Warmup API）

Solr Advantages and disadvantages of ：

advantage

1.Solr There is a bigger one , More mature users , Community of developers and contributors .

2. Support to add multiple formats of index , as ：HTML,PDF, Microsoft Office Series software format and JSON,XML,CSV Equal plain text format .

3.Solr Relatively mature , stable .

4. Search without considering indexing , Faster .

shortcoming

1. When indexing , Search efficiency decline , The efficiency of real-time index search is not high .

Elasticsearch And Solr A comparison of ：

1. Both are easy to install ;

2.Solr utilize Zookeeper Distributed management , and Elasticsearch It has the function of distributed coordination management ;

3.Solr Support more data formats , and Elasticsearch Support only json file format ;

4.Solr More official functions , and Elasticsearch It pays more attention to core functions , Advanced functions are provided by third-party plug-ins ;

5.Solr In the traditional search application, it performs better than Elasticsearch, But the efficiency of real-time search application is lower than that of real-time search application Elasticsearch.

6.Solr It is a powerful solution for traditional search applications , but Elasticsearch More suitable for emerging real-time search applications .

solr How to realize search ?

Inverted index , First extract the words in the document , And create words and documents id The mapping relation of , Then the document will be queried according to the word id, And find out the document

Solr filter

Solr Filter on the received tag stream （TokenStream ） Do extra processing

Filter query , Set at query time

Solr principle

Solr It's based on Lucene Development of full text retrieval server , and Lucene Is a set of full-text retrieval system api, Its essence is a full-text retrieval process . Full text retrieval is to divide the original document into several keywords according to certain rules , Then create an index based on keywords , When querying, first query the index to find the corresponding keywords , And find the corresponding document according to the keywords , That is, query results , Finally, the process of showing the query results to users

Solr Based on what

be based on lucene A search engine framework of search database ,lucene Is an open source full text search engine toolkit

solr How to set the top ranking of search results

Set the name of the domain in the document boost value , The higher the value, the higher the correlation , Top of the list

IK The principle of word segmentation

In essence, it is dictionary segmentation , Initialize a dictionary in memory , Then read the characters one by one in the process of word segmentation , Matches the characters in the dictionary , The process of splitting all the words in a document

solr Why is index query faster than database

Solr Yes Lucene
API Implementation of full text retrieval . Full text retrieval is essentially an index of queries . Not all fields in the database are indexed , What's more, if you use like It is very likely that the index is not used in query , So use solr Query is faster than database

solr How to deal with index loss of individual data in index library

first Solr Individual data will not be lost . If the index library is missing data , Then add it to the index library

Lucene Index optimization

Direct use Lucene The implementation of full-text retrieval is out of date , Recommended solr.Solr A complete full-text retrieval solution has been provided

Data import of multiple tables solr( solve id conflict )

stay schema.xml Add in uuid, then solrconfig Over there update Part of , Change to use uuid generate

solr How to segment words , How to solve new words and forbidden words

schema.xml File IK Tokenizer , Then the field specifies the word participator as IK

Add new words to dictionary configuration file ext.dic, Disable words are added to the disable dictionary configuration file stopword.dic, And then in the schema.xml Disable dictionary in file configuration ：<filter
class="solr.StopFilterFactory" ignore="true" words="/ Forbidden word file directory "/>

solr Multi condition combination query

Create multiple query objects , Specify their combination ,Occur.MUST（ Must meet and）,Occur.SHOULD（ It should be satisfied or）,Occur.MUST_NOT（ Must not be satisfied not）

elasticsearch What do you know , Tell me about your company es Cluster architecture based on XML , Index data size , How many pieces are there , And some tuning tools .elasticsearch
What is the inverted index of .

ElasticSearch（ abbreviation ES） It's a distributed system ,Restful Search and analysis server based on , Designed for distributed computing ; Can achieve real-time search , stable , reliable , fast . and Apache
Solr equally , It's also based on Lucence Index server for , and ElasticSearch contrast Solr The advantage of this is that ：

Lightweight ： Easy to install and start , After downloading the file, a command can be started .

Schema free： You can submit any structure to the server JSON object ,Solr Used in schema.xml The index structure is specified .

Multi index file support ： Use different index Parameter to create another index file ,Solr It needs to be configured separately .

Distributed ：Solr Cloud The configuration is complex

Inverted index is the key to realize “ word - Document matrix ” A specific storage form of , By inverted index , You can quickly get the list of documents containing the word according to the word . Inverted index is mainly composed of two parts ：“ Word dictionary ” and “ Inverted file ”.

elasticsearch How to deal with too much index data , How to tune , deploy .

use bulk API

At the time of the first index , hold replica Set to 0

enlarge threadpool.index.queue_size

enlarge indices.memory.index_buffer_size

enlarge index.translog.flush_threshold_ops

enlarge index.translog.sync_interval

enlarge index.engine.robin.refresh_interval

What is? ElasticSearch?

Elasticsearch It's based on Lucene Search engine of . It provides a HTTP
Web Interface and no architecture JSON Document distribution , Full text search engine with multi tenant capability .Elasticsearch Yes Java Developed , according to Apache License terms released as open source .

What basic operations can you perform on a document ?

You can do the following in a document ：

a. use ELASTICSEARCH Index document content .

b. use ELASTICSEARCH Capture document content .

C. use ELASTICSEARCH Update document content .

d. use ELASTICSEARCH Delete document content .

Elasticsearch What is the inverted index in ?

Inverted index is the core of search engine . The main goal of search engine is to provide fast search when searching for documents with search conditions . Inverted index is a hash diagram like data structure , Users can be directed from words to documents or web pages . It is the core of search engine . Its main goal is to quickly search for data from millions of files .

Normally , Like the one below , In the book we have inverted the index . According to the word , We can find the page where the word is .

ElasticSearch Clusters in , node , Indexes , file , What is the type ?

A cluster is one or more nodes （ The server ） Collection of , Together, they hold your entire data , It also provides joint index and search function across all nodes . The cluster is identified by a unique name , By default, it is “elasticsearch”. This name is important , Because if the node is set to join the cluster by name , Then the node can only be part of the cluster .

A node is a single server that is part of a cluster . It stores data and participates in cluster index and search functions .

An index is like an index in a relational database “ database ”. It has a mapping that defines multiple types . An index is a logical namespace , Map to one or more master tiles , And there can be zero or more copies sliced . MySQL
=> database ElasticSearch => Indexes

It's like a row in a relational database . The difference is that each document in the index can have a different structure （ field ）, But for common fields, they should have the same data type . MySQL =>
Databases => Tables => Columns / Rows ElasticSearch => Indices => Types
=> Documents with properties

Type is the logical category of the index / partition , Its semantics depends entirely on the user .

ElasticSearch Is there a framework ?

ElasticSearch There can be an architecture . A schema is a description of one or more fields that describe the type of document and how to handle the different fields of the document .Elasticsearch The architecture in is a mapping , It describes JSON Fields in a document and their data types , And how they should work together Lucene Index in the index . therefore , stay Elasticsearch In terms , We usually refer to this pattern as “ mapping ”.

Elasticsearch Have the ability of flexible architecture , This means that you can index documents without explicitly providing a schema . If no mapping is specified , By default ,Elasticsearch A mapping is dynamically generated when new fields in the document are detected during indexing .

ElasticSearch What is the fragmentation in ?

In most environments , Each node runs on a separate box or virtual machine .

Indexes - stay Elasticsearch in , An index is a collection of documents .

Slicing - because Elasticsearch It's a distributed search engine , Therefore, the index is usually divided into elements called fragments distributed on multiple nodes .

ElasticSearch What is the copy in ?

An index is fragmented for distribution and expansion . A copy is a fragmented copy . A node belongs to a cluster ElasticSearch Running examples of . A cluster consists of one or more nodes sharing the same cluster name .

ElasticSearch What is the parser in ?

stay ElasticSearch When indexing data in , The data is defined by the Analyzer Transform internally .
The analyzer consists of a Tokenizer And zero or more TokenFilter form . The compiler can run on one or more CharFilter before . The analysis module allows you to register the analyzer under a logical name , You can then define or define in the mapping API Reference them in .

Elasticsearch Comes with a number of ready to use pre built analyzers . perhaps , You can combine built-in character filters , To create custom parsers and filters .

What is? ElasticSearch Compiler in ?

The compiler term is used to decompose a stream into strings or tags . A simple compiler might split a string into any space or punctuation encountered .Elasticsearch There are many built-in markers , Can be used to build custom parsers .

Technology

Java296 blogs
Python265 blogs
Vue125 blogs
C Language122 blogs
Algorithm108 blogs
MySQL96 blogs
Flow Chart85 blogs
JavaScript79 blogs
More...