Taxi data for a city , One day 33210000 Records , How to separate the data of each car and put it into a special file ?
The idea is simple ：
It's a cycle 33210000 Records , Move each vehicle's data to its proper file .
But for 3000 More than ten thousand data , One cycle after another is too time consuming , I spent it 2 It took hours to move 60 Ten thousand data , Calculation 3000 I need to spend 100 Hours , That's what you need 4-5 day . And it needs to be turned on all day for five days , There must be no jamming .
therefore , It needs to be done in parallel for Skills of circulation ：
because 3000 Ten thousand data csv In the middle csv Cannot be opened , So I took one csv adopt split The software cuts it into pieces 60 ten thousand , common 53 individual csv.
My original idea was to read the folder , Get by each 60 Ten thousand csv List of documents , And then to each one separately 60 Ten thousand csv Processing . It's actually a cycle 33210000 second , parallel for Loop is dealing with several at the same time 60 Ten thousand csv file , It can reduce time consumption by times .
In parallel for The loop is inspired by the following approach ：
I did something like this before ：
words = ['apple', 'bananan', 'cake', 'dumpling'] for word in words: print word
parallel for The loop is like this ：
from multiprocessing.dummy import Pool as ThreadPool items = list() pool =
ThreadPool() pool.map(process, items) pool.close() pool.join()
among ,process Is the function to process
The example code is as follows ：
# -*- coding: utf-8 -*- import time from multiprocessing.dummy import Pool as
ThreadPool def process(item): print(' In parallel for loop ') print(item) time.sleep(5)
items = ['apple', 'bananan', 'cake', 'dumpling'] pool = ThreadPool()
pool.map(process, items) pool.close() pool.join()