scrapy 2.3 提取數(shù)據(jù)

2021-05-31 16:59 更新

學(xué)習(xí)如何使用scrappy提取數(shù)據(jù)的最佳方法是使用 ?Scrapy shell? . 運(yùn)行：

scrapy shell 'http://quotes.toscrape.com/page/1/'

注解

否則，在運(yùn)行Scrapy命令時(shí)，請(qǐng)記住要在命令行中包含url。 & 字符）不起作用。

在Windows上，使用雙引號(hào)：

scrapy shell "http://quotes.toscrape.com/page/1/"

您將看到類(lèi)似的內(nèi)容：

[ ... Scrapy log here ... ]
2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fa91d888c90>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com/page/1/>
[s]   response   <200 http://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x7fa91d888c10>
[s]   spider     <DefaultSpider 'default' at 0x7fa91c8af990>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

使用shell，可以嘗試使用 ?CSS? 對(duì)于響應(yīng)對(duì)象：

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

運(yùn)行``response.css（'title'）``的結(jié)果是一個(gè)類(lèi)似于列表的對(duì)象：class：~scrapy.selector.SelectorList，它表示一個(gè)列表：class：`~scrapy.selector.Selector，這些對(duì)象環(huán)繞XML/HTML元素，并允許您運(yùn)行進(jìn)一步的查詢，以細(xì)化所選內(nèi)容或提取數(shù)據(jù)。

要從上述標(biāo)題中提取文本，可以執(zhí)行以下操作：

>>> response.css('title::text').getall()
['Quotes to Scrape']

這里有兩件事需要注意：一是我們已經(jīng)添加了 ?::text? 對(duì)于CSS查詢，意味著我們只想直接選擇內(nèi)部的文本元素 ?<title>? 元素。如果我們不指定 ?::text? ，我們將獲得完整的title元素，包括其標(biāo)記：

>>> response.css('title').getall()
['<title>Quotes to Scrape</title>']

另一件事是呼叫的結(jié)果 ?.getall()? 是一個(gè)列表：選擇器可能返回多個(gè)結(jié)果，因此我們提取所有結(jié)果。當(dāng)您知道您只想要第一個(gè)結(jié)果時(shí)，如本例所示，您可以：

>>> response.css('title::text').get()
'Quotes to Scrape'

作為替代，你可以寫(xiě)下：

>>> response.css('title::text')[0].get()
'Quotes to Scrape'

然而，使用 ?.get()? 直接在A上 ?SelectorList? 實(shí)例避免了 ?IndexError? 回報(bào) ?None? 當(dāng)它找不到任何與所選內(nèi)容匹配的元素時(shí)。

這里有一個(gè)教訓(xùn)：對(duì)于大多數(shù)抓取代碼，您希望它能夠?qū)τ捎谠陧?yè)面上找不到的東西而導(dǎo)致的錯(cuò)誤具有彈性，這樣即使某些部分無(wú)法抓取，您至少可以 some 數(shù)據(jù)。

除此之外 ?getall()? 和 ?get()? 方法，也可以使用 ?re()? 提取方法 regular expressions ：

>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']

為了找到合適的CSS選擇器，您可能會(huì)發(fā)現(xiàn)在Web瀏覽器的shell中使用 view(response) . 您可以使用瀏覽器的開(kāi)發(fā)人員工具檢查HTML并找到一個(gè)選擇器（請(qǐng)參見(jiàn) 使用瀏覽器的開(kāi)發(fā)人員工具進(jìn)行抓取）

Selector Gadget 也是一個(gè)很好的工具，可以快速找到視覺(jué)上選中的元素的CSS選擇器，它可以在許多瀏覽器中使用。

以上內(nèi)容是否對(duì)您有幫助：

← scrapy 2.3 請(qǐng)求方法快捷方式

scrapy 2.3 在蜘蛛中提取數(shù)據(jù) →

寫(xiě)筆記

我要補(bǔ)充

scrapy 2.3 提取數(shù)據(jù)

推薦文章

推薦教程

推薦課程