[python 爬虫]我的第一个爬虫(3)

链接爬虫

我们为了遍历网站上所有的内容,我们需要让爬虫表现的像普通用户,跟踪链接,访问感兴趣的内容,通过跟踪链接的方式,我们可以很容易地下载整个网站的页面,但是这种方法会下载我们不需要的网页,比如我们想要再接下来的网站里抓取索引页面,可是它也有可能抓取到了国家的页面内容,我们怎么办呢?我们以http://example.webscraping.com作为实例,我们发现,所有的索引内容的网址都是在/places/default/index的根目录下,需要通过正则表达式来确定哪些内容需要抓取:

def link_crawler(seed_url, link_regex):#regex里面是正则表达式需要抓取的页面
    crawl_queue = [seed_url]
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url).text
        for link in get_links(html):
            #匹配该链接是不是我想要爬取得内容
            if re.match(link_regex, link):
                crawl_queue.append(link)

def get_links(html):
    #搜寻网页中所有的链接:
    return re.findall(r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')", html)

我们运行

link_crawler('http://example.webscraping.com', '/places/default/index')

把这一段代码加进去以后运行似乎有一点点不成功,问题出在相对链接绝对链接上,通过调试我们发现,及时匹配成功,它传入的url似乎是这个相对链接/places/default/index/1,而不是一个完整的绝对链接,那么问题来了,我们如何把这个相对链接转到绝对链接中去呢?直接把之前的路径衔接上去不就得了!这里很方便,我们使用python的一个模块:urlparse,妙哉妙哉!注意,这里我使用的是python3 的版本,python2的版本另有用法,下一段代码我也做了一下改进(我在download()函数中删去了html = None这一行),这样html.text即可以在不能够正确提取到网页内容时也可以使用(原来如果返回None,碰到html.text就会报错),因为Python不存在None.text这样的用法。

import requests
import re
from urllib.parse import urljoin

def download(url, user_agent='wswp', num_retries = 2):
    print('downloading:', url)
    headers = {'User-Agent': user_agent}
    html = requests.get(url, headers = headers)
    if html.status_code != requests.codes.ok:
        e = html.status_code
        print('Download error:', e)
        if num_retries > 0:
            if 500 <= e < 600:
                return download(url, num_retries-1)
    return html

def link_crawler(seed_url, link_regex):#regex里面是正则表达式需要抓取的页面
    crawl_queue = [seed_url]
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url).text
        for link in get_links(html):
            if re.match(link_regex, link):
                link = urljoin(seed_url, link)
                crawl_queue.append(link)

def get_links(html):
    return re.findall(r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')", html)

运行的结果刚开始是很喜人的,但是我到后来发现他怎么在24~25之间来回跳动?这是因为网站之间有相互连接,为了避免重复链接,我们需要记录哪一些链接已经被爬去过,下面是修改了的函数:

import requests
import re
from urllib.parse import urljoin
def download(url, user_agent='wswp', num_retries = 2):
    print('downloading:', url)
    headers = {'User-Agent': user_agent}
    html = requests.get(url, headers = headers)
    if html.status_code != requests.codes.ok:
        e = html.status_code
        print('Download error:', e)
        if num_retries > 0:
            if 500 <= e < 600:
                return download(url, num_retries-1)
    return html

def link_crawler(seed_url, link_regex):#regex里面是正则表达式需要抓取的页面
    crawl_queue = [seed_url]
    #重复链接检查:
    seen = set(crawl_queue)
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url).text
        for link in get_links(html):
            if re.match(link_regex, link):
                link = urljoin(seed_url, link)
                #确定网址以前没有下载过
                if link not in seen:
                    seen.add(link)
                    crawl_queue.append(link)

def get_links(html):
    return re.findall(r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')", html)

如此我们便有了第一个可用的爬虫!它能够爬取所有的index内容,并能够如期终止,一下是运行结果:

downloading: http://example.webscraping.com
downloading: http://example.webscraping.com/places/default/index/1
downloading: http://example.webscraping.com/places/default/index/2
downloading: http://example.webscraping.com/places/default/index/3
downloading: http://example.webscraping.com/places/default/index/4
downloading: http://example.webscraping.com/places/default/index/5
downloading: http://example.webscraping.com/places/default/index/6
downloading: http://example.webscraping.com/places/default/index/7
downloading: http://example.webscraping.com/places/default/index/8
downloading: http://example.webscraping.com/places/default/index/9
downloading: http://example.webscraping.com/places/default/index/10
downloading: http://example.webscraping.com/places/default/index/11
downloading: http://example.webscraping.com/places/default/index/12
downloading: http://example.webscraping.com/places/default/index/13
downloading: http://example.webscraping.com/places/default/index/14
downloading: http://example.webscraping.com/places/default/index/15
downloading: http://example.webscraping.com/places/default/index/16
downloading: http://example.webscraping.com/places/default/index/17
downloading: http://example.webscraping.com/places/default/index/18
downloading: http://example.webscraping.com/places/default/index/19
downloading: http://example.webscraping.com/places/default/index/20
downloading: http://example.webscraping.com/places/default/index/21
downloading: http://example.webscraping.com/places/default/index/22
downloading: http://example.webscraping.com/places/default/index/23
downloading: http://example.webscraping.com/places/default/index/24
downloading: http://example.webscraping.com/places/default/index/25
downloading: http://example.webscraping.com/places/default/index/0
downloading: http://example.webscraping.com/places/default/index

Process finished with exit code 0

但是在这里我还要留一个小悬念,那就是如果碰到429错误怎么办,我们下一篇文章还将会介绍如何避免爬虫陷阱

Leave a Reply

Your email address will not be published. Required fields are marked *