日韩午夜网站,欧美日韩一区二区视频在线观看,欧美精品一区二区在线观看

分享點(diǎn)干貨！！！

編程模塊：requests,lxml，pymongo，time，BeautifulSoup

首先獲取所有產(chǎn)品的分類網(wǎng)址：

									def step():

									  try:

									    headers = {

									      。。。。。

									      }

									    r = requests.get(url,headers,timeout=30)

									    html = r.content

									    soup = BeautifulSoup(html,"lxml")

									    url = soup.find_all(正則表達(dá)式)

									    for i in url:

									      url2 = i.find_all('a')

									      for j in url2:

									         step1url =url + j['href']

									         print step1url

									         step2(step1url)

									  except Exception,e:

									    print e

我們?cè)诋a(chǎn)品分類的同時(shí)需要確定我們所訪問(wèn)的地址是產(chǎn)品還是又一個(gè)分類的產(chǎn)品地址（所以需要判斷我們?cè)L問(wèn)的地址是否含有if判斷標(biāo)志）：

									def step2(step1url):

									  try:

									    headers = {

									      。。。。

									      }

									    r = requests.get(step1url,headers,timeout=30)

									    html = r.content

									    soup = BeautifulSoup(html,"lxml")

									    a = soup.find('div',id='divTbl')

									    if a:

									      url = soup.find_all('td',class_='S-ITabs')

									      for i in url:

									        classifyurl = i.find_all('a')

									        for j in classifyurl:

									           step2url = url + j['href']

									           #print step2url

									           step3(step2url)

									    else:

									      postdata(step1url)

當(dāng)我們if判斷后為真則將第二頁(yè)的分類網(wǎng)址獲取到（第一個(gè)步驟），否則執(zhí)行postdata函數(shù)，將網(wǎng)頁(yè)產(chǎn)品地址抓取！

									def producturl(url):

									  try:

									    p1url = doc.xpath(正則表達(dá)式)

									    for i in xrange(1,len(p1url) + 1):

									      p2url = doc.xpath(正則表達(dá)式)

									      if len(p2url) > 0:

									        producturl = url + p2url[0].get('href')

									        count = db[table].find({'url':producturl}).count()

									        if count <= 0:

									            sn = getNewsn()

									            db[table].insert({"sn":sn,"url":producturl})

									            print str(sn) + 'inserted successfully'

									        else:

									            'url exist'

									  except Exception,e:

									    print e

其中為我們所獲取到的產(chǎn)品地址并存入mongodb中，sn作為地址的新id。

下面我們需要在mongodb中通過(guò)新id索引來(lái)獲取我們的網(wǎng)址并進(jìn)行訪問(wèn)，對(duì)產(chǎn)品進(jìn)行數(shù)據(jù)分析并抓取，將數(shù)據(jù)更新進(jìn)數(shù)據(jù)庫(kù)內(nèi)！

其中用到最多的BeautifulSoup這個(gè)模塊，但是對(duì)于存在于js的價(jià)值數(shù)據(jù)使用BeautifulSoup就用起來(lái)很吃力，所以對(duì)于js中的數(shù)據(jù)我推薦使用xpath，但是解析網(wǎng)頁(yè)就需要用到HTML.document_fromstring(url)方法來(lái)解析網(wǎng)頁(yè)。

對(duì)于xpath抓取價(jià)值數(shù)據(jù)的同時(shí)一定要細(xì)心！如果想了解xpath就在下面留言，我會(huì)盡快回答！

									def parser(sn,url):

									  try:

									    headers = {

									      。。。。。。

									      }

									    r = requests.get(url, headers=headers,timeout=30)

									    html = r.content

									    soup = BeautifulSoup(html,"lxml")

									    dt = {}

									    #partno

									    a = soup.find("meta",itemprop="mpn")

									    if a:

									      dt['partno'] = a['content']

									    #manufacturer

									    b = soup.find("meta",itemprop="manufacturer")

									    if b:

									      dt['manufacturer'] = b['content']

									    #description

									    c = soup.find("span",itemprop="description")

									    if c:

									      dt['description'] = c.get_text().strip()

									    #price

									    price = soup.find("table",class_="table table-condensed occalc_pa_table")

									    if price:

									      cost = {}

									      for i in price.find_all('tr'):

									        if len(i) > 1:

									          td = i.find_all('td')

									          key=td[0].get_text().strip().replace(',','')

									          val=td[1].get_text().replace(u'\u20ac','').strip()

									          if key and val:

									            cost[key] = val

									      if cost:

									        dt['cost'] = cost

									        dt['currency'] = 'EUR'

									    #quantity

									    d = soup.find("input",id="ItemQuantity")

									    if d:

									      dt['quantity'] = d['value']

									    #specs

									    e = soup.find("div",class_="row parameter-container")

									    if e:

									      key1 = []

									      val1= []

									      for k in e.find_all('dt'):

									        key = k.get_text().strip().strip('.')

									        if key:

									          key1.append(key)

									      for i in e.find_all('dd'):

									        val = i.get_text().strip()

									        if val:

									          val1.append(val)

									      specs = dict(zip(key1,val1))

									    if specs:

									      dt['specs'] = specs

									      print dt

									    if dt:

									      db[table].update({'sn':sn},{'$set':dt})

									      print str(sn) + ' insert successfully'

									      time.sleep(3)

									    else:

									      error(str(sn) + '\t' + url)

									  except Exception,e:

									    error(str(sn) + '\t' + url)

									    print "Don't data!"