preface

In the process of crawler learning , Once the number of crawls is too large , It's easy to bring efficiency problems , In order to be able to quickly crawl the content we want . For this reason, we can use multithreading or multiprocessing .

Multithreading is different from multiprocessing ! One is threading library , One is multiprocessing library . And multithreading threading stay Python
It's called chicken ribs ! about Python Multithreading has such a famous saying ——“Python Multithreading is a chicken rib , Multi process is recommended !”

Why Python Multithreading is chicken ribs ? The reasons are as follows :

In most environments , Mononuclear CPU In this case , In essence, only one thread can be executed at any one time , Multicore CPU Can support multiple threads to execute at the same time . But in Python in , whether CPU
How many cores are there ,CPU Only one thread can be executed at a time . This is because GIL The existence of .

GIL What is the full name of Global Interpreter Lock( Global interpreter lock ), yes Python Decisions made at the beginning of design for data security .Python
A thread in wants to execute , You have to get it first GIL. You can GIL As a way to carry out a task “ pass check ”, And in one Python In progress ,GIL
only one . Thread that can't get the pass , You're not allowed in CPU implement .GIL Only in CPython Only in the interpreter , because CPython Called c
Native thread of language , It can't be operated directly CPU, Can only use GIL Ensure that only one thread can get data at the same time . stay PyPy and JPython None of them GIL.

stay Python Under multithreading , How each thread executes :

* Get public data
* apply GIL
* Python The interpreter calls the native thread of the operating system
* cpu Perform an operation
* When the thread runs out of time , No matter whether the task has been completed or not , Will be released GIL
* The next one is CPU The scheduled thread repeats the above process

whatever Python Before thread execution , You have to get it first GIL lock , then , Every execution 100 Byte code , The interpreter releases itself GIL lock , Let other threads have a chance to execute . this GIL The global lock actually locks the execution code of all threads , therefore , Multithreading in Python Can only be executed alternately in , even if 100 Threads running in 100 nucleus CPU upper , It can only be used 1 Nucleus .

So, isn't it python Multithreading is useless ?

Python For different types of tasks , The efficiency of multithreading is different :

* about CPU Intensive tasks ( Various circulation treatment , Calculation and so on ), Due to a lot of calculation work ,ticks The count will soon reach the threshold , Then trigger GIL
The release and re competition of the market economy ( Switching multiple threads back and forth consumes resources ), therefore Python Multithreading pairs in CPU Intensive tasks are not friendly .
* IO Intensive tasks ( Document processing , Network communication and other operations involving data reading and writing ), Multithreading can effectively improve efficiency ( Under single thread IO The operation will proceed IO
wait for , Cause unnecessary waste of time , While multithreading can be turned on in a thread A While waiting , Switch to thread automatically B, No waste CPU Our resources , So it can improve the efficiency of program execution ). therefore Python
Multithreading pairs for IO Intensive tasks are more friendly .
Suggestions in practice

Python We want to make full use of multi-core CPU, Just use multi process . Because each process has its own independent process GIL, They don't interfere with each other , In this way, parallel execution can be realized in a real sense . stay Python
in , The efficiency of multi process is better than that of multi thread ( Multi core only CPU for ). It is also suggested that IO Multithreading in intensive tasks , Using multiprocessing in compute intensive tasks .

Python Create multithreading

1. Creating multithread directly by function

stay Python3 in ,Python A built-in module is provided threading.Thread, It's very convenient for us to create multithreads .
threading.Thread() Generally, two parameters are received :

* target( Thread function name ): The function to place the thread to execute in the background , It's up to us to define , Be careful not to add ();
* args( Parameters of thread function ): Parameters required for thread function name , Pass in as a tuple . If no parameters are required , Can not be specified . import time from threading
import Thread # Custom thread function . def main(name="Python"): for i in range(2):
print("hello", name) time.sleep(2) # Create thread 01, Do not specify parameters thread_01 =
Thread(target=main) # Start thread 01 thread_01.start() # Create thread 02, Specify parameters , Note the comma thread_02 =
Thread(target=main, args=("MING",)) # Start thread 02 thread_02.start()
Output results :
hello Python hello MING hello Python hello MING
2. use Threading Module construction class object

use Threading Module creation thread , Directly from threading.Thread inherit , And then rewrite it init Methods and run method .
import time from threading import Thread class MyThread(Thread): def
__init__(self, name="Python"): # be careful ,super().__init__() Be sure to write # And it should be written at the front , Otherwise, it will report an error .
super().__init__() self.name=name def run(self): for i in range(2):
print("hello", self.name) time.sleep(2) if __name__ == '__main__': #
Create thread 01, Do not specify parameters thread_01 = MyThread() # Create thread 02, Specify parameters thread_02 = MyThread("MING")
thread_01.start() thread_02.start()
Output results :
hello Python hello MING hello MING hello Python
The above is the most basic method of multithreading creation , It looks simple , But in practical application , It's going to be a lot more complicated .

next , Multithreading practice in my practical study , I'll record them one by one .

Multithreading exercises

Before talking about the following thread synchronization , Let me first use an example to describe that the thread is not synchronized .
import threading import time class myThread(threading.Thread): def
__init__(self, threadID, name, counter): threading.Thread.__init__(self)
self.threadID = threadID self.name = name self.counter = counter def run(self):
print('Starting '+self.name) print_time(self.name, self.counter, 3)
print("Exiting " + self.name) def print_time(threadName, delay, count): while
count: time.sleep(1) print('{0} processing {1}'.format(threadName,
time.ctime(time.time()))) count -= 1 if __name__ == '__main__': threadList =
["Thread-1", "Thread-2"] threads = [] threadID = 1 for tName in threadList:
thread = myThread(threadID, tName, threadID) thread.start()
threads.append(thread) threadID += 1 # for t in threads: # t.join()
print("Exiting Main Thread")
Output results :
Starting Thread-1Starting Thread-2 Exiting Main Thread Thread-1 processing Sat
May 11 11:06:11 2019 Thread-2 processing Sat May 11 11:06:12 2019 Thread-1
processing Sat May 11 11:06:12 2019 Thread-1 processing Sat May 11 11:06:13
2019 Exiting Thread-1 Thread-2 processing Sat May 11 11:06:14 2019 Thread-2
processing Sat May 11 11:06:16 2019 Exiting Thread-2
It can be seen from the output results , The second thread does not enter to print during the printing process , Indicates that there is no synchronization between threads .
Thread synchronization ——Lock

In multithreading , All variables are shared by all threads , therefore , Any variable can be modified by any thread , therefore , The biggest danger of sharing data between threads is that multiple threads change a variable at the same time , I've messed up the content .

use Thread Object Lock and Rlock Can achieve simple thread synchronization , There are two objects acquire Methods and release method , For data that requires only one thread to operate at a time , You can put its operation in the acquire and release Between methods .
import threading import queue import time exitFlag = 0 class
myThread(threading.Thread): def __init__(self, que):
threading.Thread.__init__(self) self.que = que def run(self): print("Starting "
+ threading.currentThread().name) process_data(self.que) print("Ending " +
threading.currentThread().name) def process_data(que): while not exitFlag:
queueLock.acquire() if not workqueue.empty(): data = que.get()
queueLock.release() print('Current Thread Name %s, data: %s ' %
(threading.currentThread().name, data)) else: queueLock.release()
time.sleep(0.1) if __name__ == '__main__': start_time = time.time() queueLock =
threading.Lock() workqueue = queue.Queue() threads = [] # Fill queue
queueLock.acquire() for i in range(1,10): workqueue.put(i) queueLock.release()
for i in range(4): thread = myThread(workqueue) thread.start()
threads.append(thread) # Waiting for queue to clear while not workqueue.empty(): pass # Notify thread to exit
exitFlag = 1 for t in threads: t.join() print("Exiting Main Thread") end_time =
time.time() print(' time consuming {}s'.format((end_time - start_time)))
Output results :
Starting Thread-1 Current Thread Name Thread-1, data: 1, time: Sat May 11
10:47:25 2019 Starting Thread-2 Current Thread Name Thread-2, data: 2, time:
Sat May 11 10:47:25 2019 Starting Thread-3 Current Thread Name Thread-3, data:
3, time: Sat May 11 10:47:25 2019 Starting Thread-4 Current Thread Name
Thread-4, data: 4, time: Sat May 11 10:47:25 2019 Current Thread Name Thread-1,
data: 5, time: Sat May 11 10:47:25 2019 Current Thread Name Thread-2, data: 6,
time: Sat May 11 10:47:25 2019 Current Thread Name Thread-3, data: 7, time: Sat
May 11 10:47:25 2019 Current Thread Name Thread-4, data: 8, time: Sat May 11
10:47:25 2019 Current Thread Name Thread-2, data: 9, time: Sat May 11 10:47:25
2019 Ending Thread-4 Ending Thread-1 Ending Thread-2 Ending Thread-3 Exiting
Main Thread time consuming 0.31415677070617676s
Thread synchronization ——Queue

Python Of Queue In the module, the synchronization function is provided , Thread safe queue class , include FIFO( First in first out ) queue
Queue,LIFO( Last in first out ) queue LifoQueue, And priority queues
PriorityQueue. All of these queues implement lock primitives , It can be used directly in multithreading . You can use queues to synchronize threads .
import threading import time import queue def get_num(workQueue):
print("Starting " + threading.currentThread().name) while True: if
workQueue.empty(): break num = workQueue.get_nowait() print('Current Thread
Name %s, Url: %s ' % (threading.currentThread().name, num)) time.sleep(0.3) #
try: # num = workQueue.get_nowait() # if not num: # break # print('Current
Thread Name %s, Url: %s ' % (threading.currentThread().name, num)) # except: #
break # time.sleep(0.3) print("Ending " + threading.currentThread().name) if
__name__ == '__main__': start_time = time.time() workQueue = queue.Queue() for
i in range(1,10): workQueue.put(i) threads = [] thread_num = 4 # Number of threads for i in
range(thread_num): t = threading.Thread(target=get_num, args=(workQueue,))
t.start() threads.append(t) for t in threads: t.join() end_time = time.time()
print(' time consuming {}s'.format((end_time - start_time)))
Output results :
Starting Thread-1 Current Thread Name Thread-1, data: 1, time: Sat May 11
10:48:27 2019 Starting Thread-2 Current Thread Name Thread-2, data: 2, time:
Sat May 11 10:48:27 2019 Starting Thread-3 Current Thread Name Thread-3, data:
3, time: Sat May 11 10:48:27 2019 Starting Thread-4 Current Thread Name
Thread-4, data: 4, time: Sat May 11 10:48:27 2019 Current Thread Name Thread-2,
data: 5, time: Sat May 11 10:48:27 2019 Current Thread Name Thread-1, data: 6,
time: Sat May 11 10:48:27 2019 Current Thread Name Thread-4, data: 7, time: Sat
May 11 10:48:27 2019 Current Thread Name Thread-3, data: 8, time: Sat May 11
10:48:27 2019 Current Thread Name Thread-2, data: 9, time: Sat May 11 10:48:28
2019 Ending Thread-1 Ending Thread-4 Ending Thread-3 Ending Thread-2
time consuming 0.9028356075286865s
Because of the practical application of crawler , So it's used in the program Python Of Queue modular , Using queue to achieve synchronization between threads ( We will talk about thread synchronization later ). stay
get_num() Method is another way to judge whether the queue is empty , Code implementation of ending subthread task .

In the actual operation , By adjusting the number of threads, we can see that , The execution time will decrease with the increase of the number of threads .

Technology