分享點(diǎn)干貨!!!
Python數(shù)據(jù)抓取分析
編程模塊:requests,lxml,pymongo,time,BeautifulSoup
首先獲取所有產(chǎn)品的分類網(wǎng)址:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
def step(): try : headers = { 。。。。。 } r = requests.get(url,headers,timeout = 30 ) html = r.content soup = BeautifulSoup(html, "lxml" ) url = soup.find_all(正則表達(dá)式) for i in url: url2 = i.find_all( 'a' ) for j in url2: step1url = url + j[ 'href' ] print step1url step2(step1url) except Exception,e: print e |
我們?cè)诋a(chǎn)品分類的同時(shí)需要確定我們所訪問(wèn)的地址是產(chǎn)品還是又一個(gè)分類的產(chǎn)品地址(所以需要判斷我們?cè)L問(wèn)的地址是否含有if判斷標(biāo)志):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
def step2(step1url): try : headers = { 。。。。 } r = requests.get(step1url,headers,timeout = 30 ) html = r.content soup = BeautifulSoup(html, "lxml" ) a = soup.find( 'div' , id = 'divTbl' ) if a: url = soup.find_all( 'td' , class_ = 'S-ITabs' ) for i in url: classifyurl = i.find_all( 'a' ) for j in classifyurl: step2url = url + j[ 'href' ] #print step2url step3(step2url) else : postdata(step1url) |
當(dāng)我們if判斷后為真則將第二頁(yè)的分類網(wǎng)址獲取到(第一個(gè)步驟),否則執(zhí)行postdata函數(shù),將網(wǎng)頁(yè)產(chǎn)品地址抓取!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
def producturl(url): try : p1url = doc.xpath(正則表達(dá)式) for i in xrange ( 1 , len (p1url) + 1 ): p2url = doc.xpath(正則表達(dá)式) if len (p2url) > 0 : producturl = url + p2url[ 0 ].get( 'href' ) count = db[table].find({ 'url' :producturl}).count() if count < = 0 : sn = getNewsn() db[table].insert({ "sn" :sn, "url" :producturl}) print str (sn) + 'inserted successfully' else : 'url exist' except Exception,e: print e |
其中為我們所獲取到的產(chǎn)品地址并存入mongodb中,sn作為地址的新id。
下面我們需要在mongodb中通過(guò)新id索引來(lái)獲取我們的網(wǎng)址并進(jìn)行訪問(wèn),對(duì)產(chǎn)品進(jìn)行數(shù)據(jù)分析并抓取,將數(shù)據(jù)更新進(jìn)數(shù)據(jù)庫(kù)內(nèi)!
其中用到最多的BeautifulSoup這個(gè)模塊,但是對(duì)于存在于js的價(jià)值數(shù)據(jù)使用BeautifulSoup就用起來(lái)很吃力,所以對(duì)于js中的數(shù)據(jù)我推薦使用xpath,但是解析網(wǎng)頁(yè)就需要用到HTML.document_fromstring(url)方法來(lái)解析網(wǎng)頁(yè)。
對(duì)于xpath抓取價(jià)值數(shù)據(jù)的同時(shí)一定要細(xì)心!如果想了解xpath就在下面留言,我會(huì)盡快回答!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
|
def parser(sn,url): try : headers = { 。。。。。。 } r = requests.get(url, headers = headers,timeout = 30 ) html = r.content soup = BeautifulSoup(html, "lxml" ) dt = {} #partno a = soup.find( "meta" ,itemprop = "mpn" ) if a: dt[ 'partno' ] = a[ 'content' ] #manufacturer b = soup.find( "meta" ,itemprop = "manufacturer" ) if b: dt[ 'manufacturer' ] = b[ 'content' ] #description c = soup.find( "span" ,itemprop = "description" ) if c: dt[ 'description' ] = c.get_text().strip() #price price = soup.find( "table" , class_ = "table table-condensed occalc_pa_table" ) if price: cost = {} for i in price.find_all( 'tr' ): if len (i) > 1 : td = i.find_all( 'td' ) key = td[ 0 ].get_text().strip().replace( ',' ,'') val = td[ 1 ].get_text().replace(u '\u20ac' ,'').strip() if key and val: cost[key] = val if cost: dt[ 'cost' ] = cost dt[ 'currency' ] = 'EUR' #quantity d = soup.find( "input" , id = "ItemQuantity" ) if d: dt[ 'quantity' ] = d[ 'value' ] #specs e = soup.find( "div" , class_ = "row parameter-container" ) if e: key1 = [] val1 = [] for k in e.find_all( 'dt' ): key = k.get_text().strip().strip( '.' ) if key: key1.append(key) for i in e.find_all( 'dd' ): val = i.get_text().strip() if val: val1.append(val) specs = dict ( zip (key1,val1)) if specs: dt[ 'specs' ] = specs print dt if dt: db[table].update({ 'sn' :sn},{ '$set' :dt}) print str (sn) + ' insert successfully' time.sleep( 3 ) else : error( str (sn) + '\t' + url) except Exception,e: error( str (sn) + '\t' + url) print "Don't data!" |
最后全部程序運(yùn)行,將價(jià)值數(shù)據(jù)分析處理并存入數(shù)據(jù)庫(kù)中!
以上就是本文關(guān)于python+mongodb數(shù)據(jù)抓取詳細(xì)介紹的全部?jī)?nèi)容,希望對(duì)大家有所幫助。
原文鏈接:http://www.cnblogs.com/zhuPython/p/7724242.html