爬虫实践教程3-多进程

python爬虫多数都是运行在linux内核的服务器上的,多进程比多线程更合适,调度开销差不多,但是进程性能更好。
爬虫不是服务器交互,一般情况下提升几倍效率即可。

直接来看看代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#coding:utf-8
import multiprocessing
from bs4 import BeautifulSoup
import requests

#url详情页
def pageUrls(url):
web_data = requests.get(url)
soup = BeautifulSoup(web_data.text, 'lxml')
sum = int(soup.select('span.total > em:nth-of-type(1)')[0].get_text())
pageNum = sum/50
return [url+'/loupan/s?p={}'.format(str(i)) for i in range(1, pageNum+2, 1)]

def detailPage(myurl):
urls = pageUrls(myurl)
for url in urls:
web_data = requests.get(url)
soup = BeautifulSoup(web_data.text, 'lxml')
titles = soup.select('div.list-results > div.key-list > div > div.infos > div > h3 > a')
for title in titles:
print url
print title.get_text()
print title.get('href')
print "**********"

def main(urls):
pool = multiprocessing.Pool(multiprocessing.cpu_count())
for url in urls:
pool.apply_async(detailPage, (url, ))
# pool.map(detailPage, urls)
pool.close()
pool.join()


if __name__ == "__main__":
startUrl = 'http://gz.fang.anjuke.com/?from=navigation'
web_data = requests.get(startUrl)
soup = BeautifulSoup(web_data.text, 'lxml')
urls = [url.get('href') for url in soup.select('.city-mod > dl > dd > a')]
main(urls)

这次我们准备爬取的是安居客的网站。
先从主代码开始,startUrl,使用requests和beautifulsoup进行解析。获取地址之后,进入main函数。
multiprocessing 创建一个进程池,进程个数为cpu内核数,一般写4或6个即可。
apply_async函数从进程池中取出一个进程执行func,args为func的参数,我们这段代码不断地从进程池中取出进程去执行我们的detailPage方法。当然,也可以采用下面这种方式:

pool.map(detailPage, urls)

接着就是关闭进程池还有wait进程池中的全部进程。