[python 爬虫]通过回调函数爬取数据并且下载下来!

大家还记得我们以前的那段把所有index都下载下来的那段代码吗? import requests import re import time from urllib.parse import urljoin import lxml.html import cssselect import csv def download(url, user_agent=’wswp’, num_retries = 2):     print(’downloading:’, url)     time.sleep(0.5)     headers = {’User-Agent’: user_agent}     html = requests.get(url, headers = headers)     if html.status_code != requests.codes.ok:         e […]

[python 爬虫]数据抓取

现在我们已经成功地把网页下载下来了,可是光下载下来有什么用呢?我们需要把这个数据抓取下来而完成整个爬虫功能的实现。 分析网页 我们还是以http://example.webscraping.com作为实例,建议用Chrome浏览器,比如我们想要抓取某个国家的面积(area)那么我们只需选中那一行然后再右键Inspect Element

[Daily Leetcode] 463. Island Perimeter

You are given a map in form of a two-dimensional integer grid where 1 represents land and 0 represents water. Grid cells are connected horizontally/vertically (not diagonally). The grid is completely surrounded by water, and there is exactly one island (i.e., one or more connected land cells). The island doesn’t have “lakes” (water inside that […]

[python 爬虫]我的第一个爬虫(3)

链接爬虫 我们为了遍历网站上所有的内容,我们需要让爬虫表现的像普通用户,跟踪链接,访问感兴趣的内容,通过跟踪链接的方式,我们可以很容易地下载整个网站的页面,但是这种方法会下载我们不需要的网页,比如我们想要再接下来的网站里抓取索引页面,可是它也有可能抓取到了国家的页面内容,我们怎么办呢?我们以http://example.webscraping.com作为实例,我们发现,所有的索引内容的网址都是在/places/default/index的根目录下,需要通过正则表达式来确定哪些内容需要抓取: def link_crawler(seed_url, link_regex):#regex里面是正则表达式需要抓取的页面     crawl_queue = [seed_url]     while crawl_queue:         url = crawl_queue.pop()         html = download(url).text         for link in get_links(html):             #匹配该链接是不是我想要爬取得内容             if re.match(link_regex, link):       […]

[python 爬虫]我的第一个爬虫程序(2)

hello各位,好久不见,我们今天继续: 在上一章我们主要学习了如何重试下载,那么这一章我们来学习一下设置用户代理和其他一些的用法 设置用户代理 用户代理的主要目的是我们的python它使用的是默认用户代理python-requests/2.18.4,但是有些网站曾经经历过质量不佳的爬虫造成的服务器过载,所以会封禁默认的用户代理,接下来,运用前面的知识我们需要对这个函数进行一些更改,将用户代理更改为‘wswp’: import requests def download(url, user_agent=’wswp’, num_retries = 2):     print(’downloading:’, url)     headers = {’User-Agent’: user_agent}     html = requests.get(url, headers = headers)     if html.status_code != requests.codes.ok:         e = html.status_code         print(’Download error:’, e)         html […]

[Daily Leetcode] 566. Reshape the Matrix

Question: In MATLAB, there is a very useful function called ‘reshape’, which can reshape a matrix into a new one with different size but keep its original data. You’re given a matrix represented by a two-dimensional array, and two positive integers r and c representing the row number and column number of the wanted reshaped […]

[python爬虫]我的第一个爬虫程序(1)

下载网页: 想要爬取网页,我们需要做的第一件事就是把这个网页下载下来,接下来演示的是通过requests模块来进行网页的下载: import requests def download(url):     print(’downloading:’, url)     html = requests.get(url)     return html 在这里我们可以给出一个更加完整的版本,我们在这个函数增加了一个新的功能,那就是当我们的请求出现遇到异常时,我们的函数会捕获到异常,打印出出现的错误代码,并且返回None: import requests def download(url):     print(’downloading:’, url)     html = requests.get(url)     if html.status_code != requests.codes.ok:         e = html.status_code         print(’Download error:’, e)     […]

[python 爬虫]requests的高阶用法

Session函数的用法 Session 函数可以保存相关变量,如此我们便可以在每次提交数据的时候不需要重复输入变量(其中保存的变量就包含cookies)所以我们来试试吧! s = requests.Session() s.get(’http://httpbin.org/cookies/set/sessioncookie/123456789’) r = s.get(’http://httpbin.org/cookies’) print(r.text) # ‘{"cookies": {"sessioncookie": "123456789"}}’ 当然我们也可以自定义一些参数来用到Session里面来: s = requests.Session() s.auth = (’user’, ‘pass’) s.headers.update({’x-test’: ‘true’}) # both ‘x-test’ and ‘x-test2′ are sent s.get(’http://httpbin.org/headers’, headers={’x-test2′: ‘true’}) 但是这里要注意到的是通过以下方式的自定义参数不能够在下一次请求之中使用: s = requests.Session() r = s.get(’http://httpbin.org/cookies’, cookies={’from-my’: ‘browser’}) print(r.text) # ‘{"cookies": {"from-my": "browser"}}’ r = s.get(’http://httpbin.org/cookies’) print(r.text) # ‘{"cookies": […]

[Daily Leetcode]500. Keyboard Row

Question: Given a List of words, return the words that can be typed using letters of alphabet on only one row’s of American keyboard like the image below. Example 1: Input: [“Hello”, “Alaska”, “Dad”, “Peace”] Output: [“Alaska”, “Dad”] Note: You may use one character in the keyboard more than once. You may assume the input […]