python Using regular expressions （Regular Expression） Super detail - Blog

[{"createTime":1735734952000,"id":1,"img":"hwy_ms_500_252.jpeg","link":"https://activity.huaweicloud.com/cps.html?fromacct=261f35b6-af54-4511-a2ca-910fa15905d1&utm_source=V1g3MDY4NTY=&utm_medium=cps&utm_campaign=201905","name":"华为云秒杀","status":9,"txt":"华为云38元秒杀","type":1,"updateTime":1735747411000,"userId":3},{"createTime":1736173885000,"id":2,"img":"txy_480_300.png","link":"https://cloud.tencent.com/act/cps/redirect?redirect=1077&cps_key=edb15096bfff75effaaa8c8bb66138bd&from=console","name":"腾讯云秒杀","status":9,"txt":"腾讯云限量秒杀","type":1,"updateTime":1736173885000,"userId":3},{"createTime":1736177492000,"id":3,"img":"aly_251_140.png","link":"https://www.aliyun.com/minisite/goods?userCode=pwp8kmv3","memo":"","name":"阿里云","status":9,"txt":"阿里云2折起","type":1,"updateTime":1736177492000,"userId":3},{"createTime":1735660800000,"id":4,"img":"vultr_560_300.png","link":"https://www.vultr.com/?ref=9603742-8H","name":"Vultr","status":9,"txt":"Vultr送$100","type":1,"updateTime":1735660800000,"userId":3},{"createTime":1735660800000,"id":5,"img":"jdy_663_320.jpg","link":"https://3.cn/2ay1-e5t","name":"京东云","status":9,"txt":"京东云特惠专区","type":1,"updateTime":1735660800000,"userId":3},{"createTime":1735660800000,"id":6,"img":"new_ads.png","link":"https://www.iodraw.com/ads","name":"发布广告","status":9,"txt":"发布广告","type":1,"updateTime":1735660800000,"userId":3},{"createTime":1735660800000,"id":7,"img":"yun_910_50.png","link":"https://activity.huaweicloud.com/discount_area_v5/index.html?fromacct=261f35b6-af54-4511-a2ca-910fa15905d1&utm_source=aXhpYW95YW5nOA===&utm_medium=cps&utm_campaign=201905","name":"底部","status":9,"txt":"高性能云服务器2折起","type":2,"updateTime":1735660800000,"userId":3}]

one , Import re library

python Use regular expressions to import re library .
import re
stay re In the library . Regular expressions are often used to retrieve lookups , Replace those that match a pattern ( rule ) Text for .

two , To use regular expressions

1, Looking for rules ;

2, Regular symbols are used to represent rules ;

3, Extract information , If every character can match , Then the matching is successful ; Once there are characters that fail to match, the matching fails .

three , Common basic symbols in regular expressions

1． Point number “.”

    A dot can be used instead of a newline character （\n） Any character other than , Including but not limited to English letters , number , chinese characters , English punctuation and Chinese punctuation .

2． asterisk “*”

    An asterisk can represent a subexpression before it （ Ordinary character , Another regular expression symbol or symbols ）0 Times to infinity .

3． question mark “?”

    The question mark indicates the subexpression before it 0 Times or 1 second . be careful , The question mark here is in English .

4． Backslash “\”


Backslashes cannot be used alone in regular expressions , Even throughout Python You can't use it alone . Backslashes need to be used with other characters to turn special symbols into ordinary symbols , Change ordinary symbols into special symbols . as ：“\n”.

5． number “\d”

    Used in regular expressions “\d” To represent one digit . Again ,“\d” Although it is composed of backslashes and letters d Constitutive , But you have to “\d” As a regular expression symbol as a whole .

6． parentheses “()”

Parentheses can extract the contents of parentheses .

four , Examples of common regular expressions

1. .*?（ Match all ）

for example ：'<title>(.*?)</title>'   Climb down the title of the page .

2,\w Word character [A-Za-z0-9_], "+" Match previous character 1 Times or infinite times
for example ： A person's mailbox is like this [email protected], So how do we extract it from a lot of strings ?
pattern: \w+@\w+\.com

reflection ： If mailbox is [email protected] <mailto:[email protected]>, How to match ?
pattern:\w+@(\w+\.)?\w+\.com

? Represents a match 0 Times or 1 Matches within the sub bracket group ,"()" Indicates that the included content is a group , Group sequence number from pattern The string starts and ends in sequence . Because it's a match 0 Times or 1 second , Then it means that the part in parentheses is dispensable , So this pattern It is possible to match the above two mailbox formats .

extend ： \w+@(\w+\.)*\w+\.com The mode is even more powerful ," * " Can match 0 Times or infinite times .

five ,re Core functions of Library

1,compile() function （ not essential ）

•     Function definition ： compile（pattern, flag=0）

•     Function description ： Compiling regular expressions pattern, Then return a regular expression object .

Why pattern To compile ?《Python Core programming 》 That's how it's explained ：

Using precompiled code objects is faster than using strings directly , Because the interpreter must compile the string into a code object before executing the code in the form of string .

2,match() function

•     Function definition ： match（pattern, string, flag=0）

•     Function description ： Only from the beginning of the string and pattern Match , Matching object returned after successful matching （ There is only one result ）, Otherwise return None.

Here comes the problem , Why? result1 There are so many things ? It seems that the last one is the object to match . How do you extract this ?
take it easy , What we get now is the matching object , It needs to be extracted by certain methods , It'll be in the back 《 Method of matching objects 》 Chapter to solve this problem , Keep looking down .
3,search() function

* Function definition ： search（pattern, string, flag=0）
* letter
Number Description ： And match() Work the same way , however search() Not from the beginning , Instead, find the first match from anywhere . If all strings fail to match , return None, Otherwise, the matching object is returned .

4,findall() function

* Function definition ： findall（pattern, string [,flags]）
* Function description ： Finds all occurrences of regular expression patterns in a string , And returns a list of matches

It also lists match,search,findall Three function usage .findall And match and search The difference is that it returns a list of all non duplicate matches . If no match is found , An empty list is returned .
six , Method of matching objects （ extract ）

above re The return contents of module functions can be divided into two types ：

*      Return matching object ： It's like above <_sre.SRE_Match object; span=(0, 5), match='12345'>
Such objects , The functions that can return matching objects are match,search,finditer.
*      Returns a list of matches ： What returns the list is findall.
Therefore, the method of matching objects is only applicable match,search,finditer, Not applicable and findall.

There are two common methods to match objects ：group,groups, There are also several questions about location, such as start,end,span It's described in the code .

1,group method

* Method definition ：group(num=0)
* Method description ： Returns the entire matching object , Or a specially numbered word group

Look at the following example ：

Here we need to use the grouping concept we mentioned earlier .

The significance of grouping is ： We don't just want to get the whole string that matches , We also want to get a specific substring in the whole string . In the example above , The entire string is “ I 12345+abcde”, But I want to
“abcde”, We can use it （） Enclose . therefore , You can be right pattern Make any grouping , Extract what you want .

2,groups method

* Method definition ：groups(default =None)
* Method description ： Returns a tuple containing all matching subgroups , If matching fails, an empty tuple is returned

seven ,re Properties of the module （flag）

re Common attributes of modules are as follows ：

* re.I: Matching is not case sensitive ;( Commonly used )
* re.L: According to the local locale used \w, \W, \b, \B, \s, \S Realize matching ;
* re.M: ^ and $ Match the beginning and end of the line in the target string, respectively , Instead of strictly matching the beginning and end of the entire string itself ;
* re.S: “.”（ Point number ） Usually match except \n（ Newline character ） All single characters except , This flag indicates “.”（ Point number ） Can match all characters ;（ Commonly used ）
* re.X: Escape by backslash , Otherwise, add all spaces #（ And all subsequent text in that line ） Are ignored , Unless in a character class or allow comments and improve readability ;

be careful :

* If we define compile compile , You need to flag Fill in compile In function , Otherwise, an error will be reported in the matching function ;
If not defined compile, Can be directly in the matching function findall Fill in flag.
appendix ：

Syntax list in regular expressions

Technology

Java296 blogs
Python265 blogs
Vue125 blogs
C Language122 blogs
Algorithm108 blogs
MySQL96 blogs
Flow Chart85 blogs
JavaScript79 blogs
More...