教程 > scrapy 教程 > 阅读：178

scrapy 请求(requests) 和响应(responses)——迹忆客-ag捕鱼王app官网

scrapy 可以使用 request 和 response 对象来抓取网站。请求对象通过系统传递，使用蜘蛛执行请求并在返回响应对象时返回请求。

request 对象

request 对象是生成响应的 http 请求。它有以下类

class scrapy.http.request(url[, callback, method = 'get', headers, body, cookies, meta,
   encoding = 'utf-8', priority = 0, dont_filter = false, errback])

下表显示了 request 对象的参数

序号	参数	描述
1	url	它是一个字符串，指定 url 请求。
2	callback	它是一个可调用函数，它使用请求的响应作为第一个参数。
3	method	它是一个字符串，指定 http 方法请求。
4	headers	它是一个带有请求标头的字典。
5	body	它是具有请求正文的字符串或 unicode。
6	cookies	它是一个包含请求 cookie 的列表。
7	meta	它是一个包含请求元数据值的字典。
8	encoding	它是一个包含 utf-8 编码的字符串，用于对 url 进行编码。
9	priority	它是一个整数，调度程序使用优先级来定义处理请求的顺序。
10	dont_filter	它是一个布尔值，指定调度程序不应过滤请求。
11	errback	它是一个可调用函数，在处理请求时出现异常时调用。

将附加数据传递给回调函数

当响应作为其第一个参数下载时，将调用请求的回调函数。

例如

def parse_page1(self, response): 
   return scrapy.request("http://www.something.com/some_page.html", 
      callback = self.parse_page2)  
def parse_page2(self, response): 
   self.logger.info("%s page visited", response.url)

如果要将参数传递给可调用函数并在第二个回调中接收这些参数，则可以使用 request.meta 属性，如以下示例所示

def parse_page1(self, response): 
   item = demoitem() 
   item['foremost_link'] = response.url 
   request = scrapy.request("http://www.something.com/some_page.html", 
      callback = self.parse_page2) 
   request.meta['item'] = item 
   return request  
def parse_page2(self, response): 
   item = response.meta['item'] 
   item['other_link'] = response.url 
   return item

使用 errbacks 捕获请求处理中的异常

errback 是一个可调用函数，在处理请求时出现异常时调用。

下面的例子演示了这一点

import scrapy  
from scrapy.spidermiddlewares.httperror import httperror 
from twisted.internet.error import dnslookuperror 
from twisted.internet.error import timeouterror, tcptimedouterror  
class demospider(scrapy.spider): 
   name = "demo" 
   start_urls = [ 
      "http://www.httpbin.org/",              # http 200 expected 
      "http://www.httpbin.org/status/404",    # webpage not found  
      "http://www.httpbin.org/status/500",    # internal server error 
      "http://www.httpbin.org:12345/",        # timeout expected 
      "http://www.httphttpbinbin.org/",       # dns error expected 
   ]  
   
   def start_requests(self): 
      for u in self.start_urls: 
         yield scrapy.request(u, callback = self.parse_httpbin, 
         errback = self.errback_httpbin, 
         dont_filter=true)  
   
   def parse_httpbin(self, response): 
      self.logger.info('recieved response from {}'.format(response.url)) 
      # ...  
   
   def errback_httpbin(self, failure): 
      # logs failures 
      self.logger.error(repr(failure))  
      
      if failure.check(httperror): 
         response = failure.value.response 
         self.logger.error("httperror occurred on %s", response.url)  
      
      elif failure.check(dnslookuperror): 
         request = failure.request 
         self.logger.error("dnslookuperror occurred on %s", request.url) 
      elif failure.check(timeouterror, tcptimedouterror): 
         request = failure.request 
         self.logger.error("timeouterror occurred on %s", request.url)

request.meta 特殊键

request.meta 特殊键是 scrapy 识别的特殊元键列表。

下表显示了 request.meta 的一些键

序号	键	描述
1	dont_redirect	当设置为 true 时，它是一个键，不会根据响应的状态重定向请求。
2	dont_retry	当设置为 true 时，它是一个键，不会重试失败的请求，并且会被中间件忽略。
3	handle_httpstatus_list	它是定义可以允许每个请求的哪些响应代码的关键。
4	handle_httpstatus_all	它是一个键，用于通过将其设置为 true 来允许请求的任何响应代码。
5	dont_merge_cookies	它是用于通过将其设置为 true 来避免与现有 cookie 合并的键。
6	cookiejar	它是用于为每个蜘蛛保持多个 cookie 会话的密钥。
7	dont_cache	它是用于避免在每个策略上缓存 http 请求和响应的密钥。
8	redirect_urls	它是一个密钥，其中包含请求通过的 url。
9	bindaddress	它是可用于执行请求的传出 ip 地址的 ip。
10	dont_obey_robotstxt	当设置为 true 时，它是一个键，不会过滤 robots.txt 排除标准禁止的请求，即使启用了 `robotstxt_obey`。
11	download_timeout	它用于为每个蜘蛛设置超时（以秒为单位），下载器在超时之前将等待该超时。
12	download_maxsize	它用于设置下载程序将下载的每个蜘蛛的最大大小（以字节为单位）。
13	proxy	可以为 `request` 对象设置 proxy，以设置请求使用的 http 代理。

request 子类

我们可以通过子类化请求类来实现我们自己的自定义功能。内置请求子类如下

formreques 子类

formrequest 类通过扩展基本请求来处理 html 表单。它有以下类

class scrapy.http.formrequest(url[,formdata, callback, method = 'get', headers, body, 
   cookies, meta, encoding = 'utf-8', priority = 0, dont_filter = false, errback])

以下是参数

formdata - 它是一个字典，具有分配给请求正文的 html 表单数据。

注意 - 其余参数与请求类相同，并在请求对象部分进行了解释。

除了请求方法之外，formrequest 对象还支持以下类方法

classmethod from_response(response[, formname = none, formnumber = 0, formdata = none, 
   formxpath = none, formcss = none, clickdata = none, dont_click = false, ...])

下表为上述类的参数

序号	参数	描述
1	response	它是一个对象，用于使用 html 形式的响应预填充表单字段。
2	formname	如果指定，它是一个字符串，其中将使用具有名称属性的表单。
3	formnumber	当响应中有多个表单时，它是要使用的表单的整数。
4	formdata	它是用于覆盖的表单数据中字段的字典。
5	formxpath	指定时为字符串，使用与 xpath 匹配的形式。
6	formcss	指定时为字符串，使用与 css 选择器匹配的形式。
7	clickdata	它是用于观察被点击控件的属性字典。
8	dont_click	当设置为 true 时，表单中的数据将在不单击任何元素的情况下提交。

以下是一些请求使用示例

示例

使用 formrequest 通过 http post 发送数据

下面的代码演示了当你想在你的蜘蛛中复制 html 表单 post 时如何返回 formrequest 对象

return [formrequest(url = "http://www.something.com/post/action", 
   formdata = {'firstname': 'john', 'lastname': 'dave'}, 
   callback = self.after_post)]

使用 formrequest.from_response() 模拟用户登录

通常，网站使用元素来提供预填充的表单字段。

当我们希望在抓取时自动填充这些字段时，可以使用 formrequest.form_response() 方法。

以下示例演示了这一点。

import scrapy  
class demospider(scrapy.spider): 
   name = 'demo' 
   start_urls = ['http://www.something.com/users/login.php']  
   def parse(self, response): 
      return scrapy.formrequest.from_response( 
         response, 
         formdata = {'username': 'admin', 'password': 'confidential'}, 
         callback = self.after_login 
      )  
   
   def after_login(self, response): 
      if "authentication failed" in response.body: 
         self.logger.error("login failed") 
         return

response 对象

它是一个指示 http 响应的对象，该响应被馈送到蜘蛛程序进行处理。它有以下类

class scrapy.http.response(url[, status = 200, headers, body, flags])

response 对象的参数如下表

序号	参数	描述
1	url	它是一个字符串，指定 url 响应。
2	status	它是一个包含 http 状态响应的整数。
3	headers	它是一个包含响应头的字典。
4	body	它是一个带有响应主体的字符串。
5	flags	它是一个包含响应标志的列表。

response 子类

我们可以通过子类化响应类来实现您自己的自定义功能。内置响应子类如下

textresponse 对象

textresponse 对象用于二进制数据，例如图像、声音等，它们具有对基本 response 类进行编码的能力。它有以下类

class scrapy.http.textresponse(url[, encoding[,status = 200, headers, body, flags]])

以下是参数 -

encoding - 这是一个带有编码的字符串，用于对响应进行编码。

注意 - 其余参数与响应类相同，并在响应对象部分进行了解释。

下表显示了 textresponse 对象除了响应方法之外还支持的属性

序号	属性	描述
1	text	它是一个响应主体，其中 `response.text` 可以被多次访问。
2	encoding	它是一个包含响应编码的字符串。
3	selector	它是在第一次访问时实例化的属性，并使用响应作为目标。

下表显示了 textresponse 对象除了响应方法之外还支持的方法

序号	方法	描述
1	xpath (query)	它是 `textresponse.selector.xpath(query)` 的快捷方式。
2	css (query)	它是 `textresponse.selector.css(query)` 的快捷方式。
3	body_as_unicode()	它是一个可用作方法的响应主体，其中可以多次访问 `response.text`。

htmlresponse 对象

它是一个通过查看 html 的 meta httpequiv 属性来支持编码和自动发现的对象。它的参数与响应类相同，在响应对象部分有解释。它有以下类

class scrapy.http.htmlresponse(url[,status = 200, headers, body, flags])

xmlresponse 对象

它是一个支持通过查看 xml 行进行编码和自动发现的对象。它的参数与响应类相同，在响应对象部分有解释。它有以下类

class scrapy.http.xmlresponse(url[, status = 200, headers, body, flags])

 scrapy feed 导出

scrapy 链接提取器 

ag捕鱼王app官网计算机编程教程

scrapy 请求(requests) 和响应(responses)——迹忆客-ag捕鱼王app官网

request 对象

将附加数据传递给回调函数

使用 errbacks 捕获请求处理中的异常

request.meta 特殊键

request 子类

formreques 子类

示例

使用 formrequest 通过 http post 发送数据

使用 formrequest.from_response() 模拟用户登录

response 对象

response 子类

textresponse 对象

htmlresponse 对象

xmlresponse 对象

查看笔记

scrapy 请求(requests) 和 响应(responses)——迹忆客-ag捕鱼王app官网

request 对象

将附加数据传递给回调函数

使用 errbacks 捕获请求处理中的异常

request.meta 特殊键

request 子类

formreques 子类

示例

使用 formrequest 通过 http post 发送数据

使用 formrequest.from_response() 模拟用户登录

response 对象

response 子类

textresponse 对象

htmlresponse 对象

xmlresponse 对象

 查看笔记

scrapy 请求(requests) 和响应(responses)——迹忆客-ag捕鱼王app官网

查看笔记