MySQL fuzzy queries no longer need to like+%

MySQL fuzzy queries no longer need to like+%

We all know that InnoDB's use of '%xx' when fuzzy querying data will cause index failure, but sometimes this is the requirement, and there are many similar requirements

最后更新 4/15/2022 9:31 PM
_沸羊羊_
预计阅读 12 分钟
分类
MySQL
标签
MySQL simulation query

preface

We all know that InnoDB's use of "%xx" when fuzzy query data will cause the index to fail, but sometimes this is the requirement. There are many similar requirements. For example, search engines need to conduct full-text searches based on keywords in user data. E-commerce websites need to search in the detailed introduction of products based on the user's query conditions. These are not tasks that B+ tree indexing can easily accomplish.

Through numerical comparison, range filtering, etc., we can complete most of the queries we need. However, if you want to filter queries through keyword matching, then you need a similarity-based query rather than the original exact numerical comparison. Full-text indexing is designed for this scenario.

Full-Text Search is a technology that finds out any information in an entire book or an entire article stored in a database. It can obtain relevant chapters, sections, paragraphs, sentences, words and other information in the full text as needed, and can also conduct various statistics and analysis.

In the early days of MySQL, InnoDB did not support full-text retrieval technology. Starting from MySQL 5.6, InnoDB began to support full-text retrieval.

inverted index

Full-text retrieval is usually implemented using inverted index. Like B+Tree, inverted index is also an index structure. It stores a mapping between the word and the position of the word itself in one or more documents in an auxiliary table, which is usually achieved using associative arrays and has two manifestations:

  • inverted file index: {word, id of the document where the word is located}
  • full inverted index: {word,(id of the document in which the word is located, and position in the specific document)}

The above figure shows the inverted file index associative array. You can see that the word "code" exists in documents 1 and 4. In this way, storing and performing full-text query will be simple. You can directly obtain documents containing query keywords based on Documents; The full inverted index stores pairs, i.e.(DocumentId,Position), so the inverted index stored therein is as shown in the figure below, for example, the keyword "code" exists in the 6th word of Document 1 and the 8th word of Document 4. In contrast, the full inverted index takes up more space, but can better locate data and expand some other search features.

full-text retrieval

create a full-text index

  1. The syntax for creating a full-text index when creating a table is as follows:
CREATE TABLE table_name ( id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, author VARCHAR(200),
title VARCHAR(200), content TEXT(500), FULLTEXT full_index_name (col_name) ) ENGINE=InnoDB;

Enter the query statement:

SELECT table_id, name, space from INFORMATION_SCHEMA.INNODB_TABLES
WHERE name LIKE 'test/%';

The above six index tables form an inverted index, which is called an auxiliary index table. When incoming documents are tokenized, individual words, together with location information and associated DOC_ID, are fully sorted and partitioned in six index tables based on the character set sorting weight of the first character of the word.

  1. The syntax for creating a full-text index on the created table is as follows:
CREATE FULLTEXT INDEX full_index_name ON table_name(col_name);

Use full-text indexing

MySQL database supports full-text retrieval queries. Full-text indexes can only be used on tables in InnoDB or MyISAM, and can only be used to create columns of types char,varchar, and text.

Its syntax is as follows:

MATCH(col1,col2,...) AGAINST(expr[search_modifier])
search_modifier:
{
    IN NATURAL LANGUAGE MODE
    | IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION
    | IN BOOLEAN MODE
    | WITH QUERY EXPANSION
}

全文搜索使用  MATCH() AGAINST()语法进行,其中,MATCH() 采用逗号分隔的列表,命名要搜索的列。AGAINST()接收一个要搜索的字符串,以及一个要执行的搜索类型的可选修饰符。全文检索分为三种类型:自然语言搜索、布尔搜索、查询扩展搜索,下面将对各种查询模式进行介绍。

Natural Language

自然语言搜索将搜索字符串解释为自然人类语言中的短语,MATCH()默认采用 Natural Language 模式,其表示查询带有指定关键字的文档。

Next, combine demo to better understand Natural Language

SELECT
    count(*) AS count
FROM
    `fts_articles`
WHERE
    MATCH ( title, body ) AGAINST ( 'MySQL' );

The above statement queries the number of rows containing the 'MySQL' keyword in the title and body columns. The above statement can also be written like this:

SELECT
    count(IF(MATCH ( title, body )
    against ( 'MySQL' ), 1, NULL )) AS count
FROM
    `fts_articles`;

Although the results obtained by the above two statements are the same, from the perspective of internal operation, the execution speed of the second SQL sentence is faster, because the first SQL sentence (based on the where index query method) also requires correlation sorting statistics, while the second method is not needed.

You can also query correlations through SQL statements:

SELECT
    *,
    MATCH ( title, body ) against ( 'MySQL' ) AS Relevance
FROM
    fts_articles;

Correlation is calculated based on the following four conditions:

  • Whether word appears in the document
  • The number of times word appears in the document
  • The number of words in the index column How many documents contain the word

For full-text retrieval of the InnoDB storage engine, the following factors also need to be considered:

  • The query word is in the stopword column, ignore the query for this string
  • Whether the character length of the queried word is within the range [innodb_ft_min_token_size,innodb_ft_max_token_size]

If the word is in stopword, the word is not queried. For example, the word 'for' is queried, and the result is as follows:

SELECT
    *,
    MATCH ( title, body ) against ( 'for' ) AS Relevance
FROM
    fts_articles;

You can see that although 'for' appears in documents 2 and 4, its correlation is 0 because it is a stopword.

The parameters innodb_ft_min_token_size and innodb_ft_max_token_size control the length of the InnoDB engine query character. When the length is less than innodb_ft_min_token_size or greater than innodb_ft_max_token_size, the search for the word will be ignored. In the InnoDB engine, the default value for the parameter innodb_ft_min_token_size is 3, and the default value for innodb_ft_max_token_size is 84

Boolean

Boolean search uses the rules of a special query language to interpret a search string that contains the word to search for. It can also contain operators specified required, such as a word must or may not be present in the matching line, or its weight should be higher or lower than usual. For example, the following statement requires a query for documents with the string "Pease" but no "hot", where + and-indicate that the word must exist or must not exist, respectively.

select * from fts_test where MATCH(content) AGAINST('+Pease -hot' IN BOOLEAN MODE);

The types supported by Boolean full-text search include:

  • +: indicates that the word must exist
  • : means that the word must not exist
  • (no operator) means that the word is optional, but if it appears, it will be more relevant
  • @distance 表示查询的多个单词之间的距离是否在 distance 之内,distance 的单位是字节,这种全文检索的查询也称为 Proximity Search,如 MATCH(context) AGAINST('"Pease hot"@30' IN - BOOLEAN MODE)语句表示字符串 Pease 和 hot 之间的距离需在 30 字节内
  • Represents increased relevance when the word appears

  • <: indicates that the correlation is reduced when the word appears
  • ~: means that the word is allowed, but the correlation is negative when it occurs
    • :表示以该单词开头的单词,如 lik*,表示可以是 liklikelikes
  • ": expresses a phrase

Here are some demos to see how Boolean Mode is used.

demo1:+ -

SELECT
    *
FROM
    `fts_articles`
WHERE
    MATCH ( title, body ) AGAINST ( '+MySQL -YourSQL' IN BOOLEAN MODE );

The above statement queries information that contains 'MySQL' but does not contain 'YourSQL'

demo2: no operator

SELECT
    *
FROM
    `fts_articles`
WHERE
    MATCH ( title, body ) AGAINST ( 'MySQL IBM' IN BOOLEAN MODE );

In the above statement, the query 'MySQL IBM' does not have the symbols of '+' and '-', which means that word is optional and will have higher relevance if it appears

demo3:@

SELECT
    *
FROM
    `fts_articles`
WHERE
    MATCH ( title, body ) AGAINST ( '"DB2 IBM"@3' IN BOOLEAN MODE );
复制代码
上述语句,代表 "DB2" ,"IBM"两个词之间的距离在3字节之内

demo4:> <

SELECT
    *
FROM
    `fts_articles`
WHERE
    MATCH ( title, body ) AGAINST ( '+MySQL +(>database <DBMS)' IN BOOLEAN MODE );

In the above statement, the query contains row information for both 'MySQL','database' and 'DBMS', but rows that do not contain 'DBMS' are more relevant than rows that contain 'DBMS'.

demo5: ~

SELECT
    *
FROM
    `fts_articles`
WHERE
    MATCH ( title, body ) AGAINST ( 'MySQL ~database' IN BOOLEAN MODE );

The above statement queries rows that contain 'MySQL', but if the row also contains 'database', the correlation is reduced.

demo6:*

SELECT
    *
FROM
    `fts_articles`
WHERE
    MATCH ( title, body ) AGAINST ( 'My*' IN BOOLEAN MODE );

In the above statement, the query keyword contains the line information of 'My'.

demo7:"

SELECT
    *
FROM
    `fts_articles`
WHERE
    MATCH ( title, body ) AGAINST ( '"MySQL Security"' IN BOOLEAN MODE );

In the above statement, query line information containing the exact phrase 'MySQL Security'.

Query Expansion

Query expansion search is a modification of natural language search. This kind of query is usually performed when the query keywords are too short and the user needs enforced knowledge. For example, for a query on the word database, the user may want to query not only the document containing the database, but may also refer to the words containing MySQL, Oracle, and RDBMS. At this time, the Query Expansion pattern can be used to enable enforced knowledge for full-text retrieval

Blind query expansion (also known as automatic relevance feedback) can be turned on by adding WITH QUERY EXPANSION / IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION to the query statement, which is divided into two phases.

  • Phase 1: Conduct a full-text index query based on the searched words
  • The second stage: Conduct another full-text search query based on the word segmentation generated in the first stage

Let's take a look at an example of how Query Expansion is used.

-- 创建索引
create FULLTEXT INDEX title_body_index on fts_articles(title,body);
-- 使用 Natural Language 模式查询
SELECT
    *
FROM
    `fts_articles`
WHERE
    MATCH(title,body) AGAINST('database');

The query results before using Query Expansion are as follows:

-- 当使用 Query Expansion 模式查询
SELECT
    *
FROM
    `fts_articles`
WHERE
    MATCH(title,body) AGAINST('database' WITH QUERY expansion);

After using Query Expansion, the query results are as follows:

Since Query Expansion's full-text search may bring many irrelevant queries, users may need to be very cautious when using it.

Delete full-text index

  1. The syntax for directly deleting the full-text index is as follows:
DROP INDEX full_idx_name ON db_name.table_name;
  1. Use alter table to delete full-text indexes. The syntax is as follows:
ALTER TABLE db_name.table_name DROP INDEX full_idx_name;

summary

This article introduces fulltext index from the perspective of combining theory and practice. If you are interested in MySQL, you can continue to pay attention to the MySQL column.

Keep Exploring

延伸阅读

更多文章