Python 多线程学习 - 博客

[{"createTime":1735734952000,"id":1,"img":"hwy_ms_500_252.jpeg","link":"https://activity.huaweicloud.com/cps.html?fromacct=261f35b6-af54-4511-a2ca-910fa15905d1&utm_source=V1g3MDY4NTY=&utm_medium=cps&utm_campaign=201905","name":"华为云秒杀","status":9,"txt":"华为云38元秒杀","type":1,"updateTime":1735747411000,"userId":3},{"createTime":1736173885000,"id":2,"img":"txy_480_300.png","link":"https://cloud.tencent.com/act/cps/redirect?redirect=1077&cps_key=edb15096bfff75effaaa8c8bb66138bd&from=console","name":"腾讯云秒杀","status":9,"txt":"腾讯云限量秒杀","type":1,"updateTime":1736173885000,"userId":3},{"createTime":1736177492000,"id":3,"img":"aly_251_140.png","link":"https://www.aliyun.com/minisite/goods?userCode=pwp8kmv3","memo":"","name":"阿里云","status":9,"txt":"阿里云2折起","type":1,"updateTime":1736177492000,"userId":3},{"createTime":1735660800000,"id":4,"img":"vultr_560_300.png","link":"https://www.vultr.com/?ref=9603742-8H","name":"Vultr","status":9,"txt":"Vultr送$100","type":1,"updateTime":1735660800000,"userId":3},{"createTime":1735660800000,"id":5,"img":"jdy_663_320.jpg","link":"https://3.cn/2ay1-e5t","name":"京东云","status":9,"txt":"京东云特惠专区","type":1,"updateTime":1735660800000,"userId":3},{"createTime":1735660800000,"id":6,"img":"new_ads.png","link":"https://www.iodraw.com/ads","name":"发布广告","status":9,"txt":"发布广告","type":1,"updateTime":1735660800000,"userId":3},{"createTime":1735660800000,"id":7,"img":"yun_910_50.png","link":"https://activity.huaweicloud.com/discount_area_v5/index.html?fromacct=261f35b6-af54-4511-a2ca-910fa15905d1&utm_source=aXhpYW95YW5nOA===&utm_medium=cps&utm_campaign=201905","name":"底部","status":9,"txt":"高性能云服务器2折起","type":2,"updateTime":1735660800000,"userId":3}]

前言

在爬虫学习的过程中，一旦爬取的数量过大，很容易带来效率问题，为了能够快速爬取我们想要的内容。为此我们可以使用多线程或者多进程来处理。

多线程和多进程是不一样的！一个是 threading 库，一个是 multiprocessing 库。而多线程 threading 在 Python
里面被称作鸡肋的存在！关于 Python 多线程有这样一句名言——“Python下多线程是鸡肋，推荐使用多进程！”

为什么称 Python 多线程是鸡肋？原因如下：

在大多数环境中，单核CPU情况下，本质上某一时刻只能有一个线程被执行，多核 CPU 则可以支持多个线程同时执行。但是在 Python 中，无论 CPU
有多少核，CPU在同一时间只能执行一个线程。这是由于 GIL 的存在导致的。

GIL 的全称是 Global Interpreter Lock(全局解释器锁)，是 Python 设计之初为了数据安全所做的决定。Python
中的某个线程想要执行，必须先拿到 GIL。可以把 GIL 看作是执行任务的“通行证”，并且在一个 Python 进程中，GIL
只有一个。拿不到通行证的线程，就不允许进入 CPU 执行。GIL 只在 CPython 解释器中才有，因为 CPython 调用的是 c
语言的原生线程，不能直接操作 CPU，只能利用 GIL 保证同一时间只能有一个线程拿到数据。在 PyPy 和 JPython 中没有 GIL。

在 Python 多线程下，每个线程的执行方式：

* 拿到公共数据
* 申请GIL
* Python解释器调用操作系统原生线程
* cpu执行运算
* 当该线程执行一段时间消耗完，无论任务是否已经执行完毕，都会释放GIL
* 下一个被CPU调度的线程重复上面的过程

任何Python线程执行前，必须先获得GIL锁，然后，每执行100条字节码，解释器就自动释放GIL锁，让别的线程有机会执行。这个GIL全局锁实际上把所有线程的执行代码都给上了锁，所以，多线程在Python中只能交替执行，即使100个线程跑在100核CPU上，也只能用到1个核。

那么是不是python的多线程就完全没用了呢？

Python 针对不同类型的任务，多线程执行效率是不同的：

* 对于 CPU 密集型任务(各种循环处理、计算等等)，由于计算工作多，ticks 计数很快就会达到阈值，然后触发 GIL
的释放与再竞争（多个线程来回切换是需要消耗资源的），所以 Python 下的多线程对 CPU 密集型任务并不友好。
* IO 密集型任务(文件处理、网络通信等涉及数据读写的操作)，多线程能够有效提升效率(单线程下有 IO 操作会进行 IO
等待，造成不必要的时间浪费，而开启多线程能在线程 A 等待时，自动切换到线程 B，可以不浪费 CPU 的资源，从而能提升程序执行效率)。所以 Python
的多线程对 IO 密集型任务比较友好。
实际中的建议

Python 中想要充分利用多核 CPU，就用多进程。因为每个进程有各自独立的 GIL，互不干扰，这样就可以真正意义上的并行执行。在 Python
中，多进程的执行效率优于多线程(仅仅针对多核 CPU 而言)。同时建议在 IO 密集型任务中使用多线程，在计算密集型任务中使用多进程。

Python 创建多线程

1.直接利用函数创建多线程

在 Python3 中，Python 提供了一个内置模块 threading.Thread，可以很方便地让我们创建多线程。
threading.Thread()一般接收两个参数：

* target（线程函数名）：要放置线程让其后台执行的函数，由我们自已定义，注意不要加()；
* args（线程函数的参数）：线程函数名所需的参数，以元组的形式传入。若不需要参数，可以不指定。 import time from threading
import Thread # 自定义线程函数。 def main(name="Python"): for i in range(2):
print("hello", name) time.sleep(2) # 创建线程01，不指定参数 thread_01 =
Thread(target=main) # 启动线程01 thread_01.start() # 创建线程02，指定参数，注意逗号 thread_02 =
Thread(target=main, args=("MING",)) # 启动线程02 thread_02.start()
输出结果：
hello Python hello MING hello Python hello MING
2. 使用 Threading 模块构建类对象

使用 Threading 模块创建线程，直接从 threading.Thread 继承，然后重写 init 方法和 run 方法。
import time from threading import Thread class MyThread(Thread): def
__init__(self, name="Python"): # 注意，super().__init__() 一定要写 # 而且要写在最前面，否则会报错。
super().__init__() self.name=name def run(self): for i in range(2):
print("hello", self.name) time.sleep(2) if __name__ == '__main__': #
创建线程01，不指定参数 thread_01 = MyThread() # 创建线程02，指定参数 thread_02 = MyThread("MING")
thread_01.start() thread_02.start()
输出结果：
hello Python hello MING hello MING hello Python
上述是最基本的多线程创建方法，看起来很简单，但在实际的应用中，会复杂许多。

接下来，关于我实际学习中的多线程练习，我会一一记录下来。

多线程练习

在讲述下面的线程同步前，我先用个实例描述一下线程未同步。
import threading import time class myThread(threading.Thread): def
__init__(self, threadID, name, counter): threading.Thread.__init__(self)
self.threadID = threadID self.name = name self.counter = counter def run(self):
print('Starting '+self.name) print_time(self.name, self.counter, 3)
print("Exiting " + self.name) def print_time(threadName, delay, count): while
count: time.sleep(1) print('{0} processing {1}'.format(threadName,
time.ctime(time.time()))) count -= 1 if __name__ == '__main__': threadList =
["Thread-1", "Thread-2"] threads = [] threadID = 1 for tName in threadList:
thread = myThread(threadID, tName, threadID) thread.start()
threads.append(thread) threadID += 1 # for t in threads: # t.join()
print("Exiting Main Thread")
输出结果：
Starting Thread-1Starting Thread-2 Exiting Main Thread Thread-1 processing Sat
May 11 11:06:11 2019 Thread-2 processing Sat May 11 11:06:12 2019 Thread-1
processing Sat May 11 11:06:12 2019 Thread-1 processing Sat May 11 11:06:13
2019 Exiting Thread-1 Thread-2 processing Sat May 11 11:06:14 2019 Thread-2
processing Sat May 11 11:06:16 2019 Exiting Thread-2
由输出结果中可以看出，第二个线程在打印过程中没有回车打印，说明线程间没有同步。
线程同步——Lock

多线程中，所有变量都由所有线程共享，所以，任何一个变量都可以被任何一个线程修改，因此，线程之间共享数据最大的危险在于多个线程同时改一个变量，把内容给改乱了。

使用Thread对象的Lock和Rlock可以实现简单的线程同步，这两个对象都有acquire方法和release方法，对于那些需要每次只允许一个线程操作的数据，可以将其操作放到acquire和release方法之间。
import threading import queue import time exitFlag = 0 class
myThread(threading.Thread): def __init__(self, que):
threading.Thread.__init__(self) self.que = que def run(self): print("Starting "
+ threading.currentThread().name) process_data(self.que) print("Ending " +
threading.currentThread().name) def process_data(que): while not exitFlag:
queueLock.acquire() if not workqueue.empty(): data = que.get()
queueLock.release() print('Current Thread Name %s, data: %s ' %
(threading.currentThread().name, data)) else: queueLock.release()
time.sleep(0.1) if __name__ == '__main__': start_time = time.time() queueLock =
threading.Lock() workqueue = queue.Queue() threads = [] # 填充队列
queueLock.acquire() for i in range(1,10): workqueue.put(i) queueLock.release()
for i in range(4): thread = myThread(workqueue) thread.start()
threads.append(thread) #等待队列清空 while not workqueue.empty(): pass #通知线程退出
exitFlag = 1 for t in threads: t.join() print("Exiting Main Thread") end_time =
time.time() print('耗时{}s'.format((end_time - start_time)))
输出结果：
Starting Thread-1 Current Thread Name Thread-1, data: 1, time: Sat May 11
10:47:25 2019 Starting Thread-2 Current Thread Name Thread-2, data: 2, time:
Sat May 11 10:47:25 2019 Starting Thread-3 Current Thread Name Thread-3, data:
3, time: Sat May 11 10:47:25 2019 Starting Thread-4 Current Thread Name
Thread-4, data: 4, time: Sat May 11 10:47:25 2019 Current Thread Name Thread-1,
data: 5, time: Sat May 11 10:47:25 2019 Current Thread Name Thread-2, data: 6,
time: Sat May 11 10:47:25 2019 Current Thread Name Thread-3, data: 7, time: Sat
May 11 10:47:25 2019 Current Thread Name Thread-4, data: 8, time: Sat May 11
10:47:25 2019 Current Thread Name Thread-2, data: 9, time: Sat May 11 10:47:25
2019 Ending Thread-4 Ending Thread-1 Ending Thread-2 Ending Thread-3 Exiting
Main Thread 耗时0.31415677070617676s
线程同步——Queue

Python 的 Queue 模块中提供了同步的、线程安全的队列类，包括 FIFO（先入先出)队列
Queue，LIFO（后入先出）队列LifoQueue，和优先级队列
PriorityQueue。这些队列都实现了锁原语，能够在多线程中直接使用。可以使用队列来实现线程间的同步。
import threading import time import queue def get_num(workQueue):
print("Starting " + threading.currentThread().name) while True: if
workQueue.empty(): break num = workQueue.get_nowait() print('Current Thread
Name %s, Url: %s ' % (threading.currentThread().name, num)) time.sleep(0.3) #
try: # num = workQueue.get_nowait() # if not num: # break # print('Current
Thread Name %s, Url: %s ' % (threading.currentThread().name, num)) # except: #
break # time.sleep(0.3) print("Ending " + threading.currentThread().name) if
__name__ == '__main__': start_time = time.time() workQueue = queue.Queue() for
i in range(1,10): workQueue.put(i) threads = [] thread_num = 4 #线程数 for i in
range(thread_num): t = threading.Thread(target=get_num, args=(workQueue,))
t.start() threads.append(t) for t in threads: t.join() end_time = time.time()
print('耗时{}s'.format((end_time - start_time)))
输出结果：
Starting Thread-1 Current Thread Name Thread-1, data: 1, time: Sat May 11
10:48:27 2019 Starting Thread-2 Current Thread Name Thread-2, data: 2, time:
Sat May 11 10:48:27 2019 Starting Thread-3 Current Thread Name Thread-3, data:
3, time: Sat May 11 10:48:27 2019 Starting Thread-4 Current Thread Name
Thread-4, data: 4, time: Sat May 11 10:48:27 2019 Current Thread Name Thread-2,
data: 5, time: Sat May 11 10:48:27 2019 Current Thread Name Thread-1, data: 6,
time: Sat May 11 10:48:27 2019 Current Thread Name Thread-4, data: 7, time: Sat
May 11 10:48:27 2019 Current Thread Name Thread-3, data: 8, time: Sat May 11
10:48:27 2019 Current Thread Name Thread-2, data: 9, time: Sat May 11 10:48:28
2019 Ending Thread-1 Ending Thread-4 Ending Thread-3 Ending Thread-2
耗时0.9028356075286865s
因为考虑到爬虫方面的实际应用，所以在程序中使用了 Python 的 Queue 模块，使用队列来实现线程间的同步（关于线程同步在后面会讲到）。在
get_num() 方法中注释的内容是另一种判断队列为空，结束子线程任务的代码实现。

在实际运行中，通过调节线程数可以看到，执行时间会随着线程数的增加而缩短。

技术

Java1212 篇
Python927 篇
开发语言608 篇
c语言463 篇
算法461 篇
MySQL438 篇
数据库394 篇
前端387 篇
更多...