国产片侵犯亲女视频播放_亚洲精品二区_在线免费国产视频_欧美精品一区二区三区在线_少妇久久久_在线观看av不卡

腳本之家,腳本語言編程技術及教程分享平臺!
分類導航

Python|VBS|Ruby|Lua|perl|VBA|Golang|PowerShell|Erlang|autoit|Dos|bat|

服務器之家 - 腳本之家 - Python - python基于scrapy爬取京東筆記本電腦數據并進行簡單處理和分析

python基于scrapy爬取京東筆記本電腦數據并進行簡單處理和分析

2021-10-09 00:34Perhaps# Python

這篇文章主要介紹了python基于scrapy爬取京東筆記本電腦數據并進行簡單處理和分析的實例,幫助大家更好的理解和學習使用python。感興趣的朋友可以了解下

一、環境準備

  • python3.8.3
  • pycharm
  • 項目所需第三方包
?
1
pip install scrapy fake-useragent requests selenium virtualenv -i https://pypi.douban.com/simple

1.1 創建虛擬環境

切換到指定目錄創建

?
1
virtualenv .venv

創建完記得激活虛擬環境

1.2 創建項目

?
1
scrapy startproject 項目名稱

1.3 使用pycharm打開項目,將創建的虛擬環境配置到項目中來
1.4 創建京東spider

?
1
scrapy genspider 爬蟲名稱 url

1.5 修改允許訪問的域名,刪除https:

二、問題分析

爬取數據的思路是先獲取首頁的基本信息,在獲取詳情頁商品詳細信息;爬取京東數據時,只返回40條數據,這里,作者使用selenium,在scrapy框架中編寫下載器中間件,返回頁面所有數據。
爬取的字段分別是:

商品價格

商品評數

商品店家

商品sku(京東可直接搜索到對應的產品)

商品標題

商品詳細信息

三、spider

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import re
import scrapy
 
 
from lianjia.items import jd_detailitem
 
 
class jicomputerdetailspider(scrapy.spider):
    name = 'ji_computer_detail'
    allowed_domains = ['search.jd.com', 'item.jd.com']
    start_urls = [
        'https://search.jd.com/search?keyword=%e7%ac%94%e8%ae%b0%e6%9c%ac%e7%94%b5%e8%84%91&suggest=1.def.0.base&wq=%e7%ac%94%e8%ae%b0%e6%9c%ac%e7%94%b5%e8%84%91&page=1&s=1&click=0']
 
    def parse(self, response):
        lls = response.xpath('//ul[@class="gl-warp clearfix"]/li')
        for ll in lls:
            item = jd_detailitem()
            computer_price = ll.xpath('.//div[@class="p-price"]/strong/i/text()').extract_first()
            computer_commit = ll.xpath('.//div[@class="p-commit"]/strong/a/text()').extract_first()
            computer_p_shop = ll.xpath('.//div[@class="p-shop"]/span/a/text()').extract_first()
            item['computer_price'] = computer_price
            item['computer_commit'] = computer_commit
            item['computer_p_shop'] = computer_p_shop
            meta = {
                'item': item
            }
            shop_detail_url = ll.xpath('.//div[@class="p-img"]/a/@href').extract_first()
            shop_detail_url = 'https:' + shop_detail_url
            yield scrapy.request(url=shop_detail_url, callback=self.detail_parse, meta=meta)
        for i in range(2, 200, 2):
            next_page_url = f'https://search.jd.com/search?keyword=%e7%ac%94%e8%ae%b0%e6%9c%ac%e7%94%b5%e8%84%91&suggest=1.def.0.base&wq=%e7%ac%94%e8%ae%b0%e6%9c%ac%e7%94%b5%e8%84%91&page={i}&s=116&click=0'
            yield scrapy.request(url=next_page_url, callback=self.parse)
 
    def detail_parse(self, response):
        item = response.meta.get('item')
        computer_sku = response.xpath('//a[@class="notice j-notify-sale"]/@data-sku').extract_first()
        item['computer_sku'] = computer_sku
        computer_title = response.xpath('//div[@class="sku-name"]/text()').extract_first().strip()
        computer_title = ''.join(re.findall('\s', computer_title))
        item['computer_title'] = computer_title
        computer_detail = response.xpath('string(//ul[@class="parameter2 p-parameter-list"])').extract_first().strip()
        computer_detail = ''.join(re.findall('\s', computer_detail))
        item['computer_detail'] = computer_detail
        yield item

四、item

?
1
2
3
4
5
6
7
8
class jd_detailitem(scrapy.item):
    # define the fields for your item here like:
    computer_sku = scrapy.field()
    computer_price = scrapy.field()
    computer_title = scrapy.field()
    computer_commit = scrapy.field()
    computer_p_shop = scrapy.field()
    computer_detail = scrapy.field()

五、setting

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
import random
 
 
from fake_useragent import useragent
ua = useragent()
user_agent = ua.random
robotstxt_obey = false
download_delay = random.uniform(0.5, 1)
downloader_middlewares = {
    'lianjia.middlewares.jddownloadermiddleware': 543
}
item_pipelines = {
    'lianjia.pipelines.jd_csv_pipeline': 300
}

六、pipelines

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
class jd_csv_pipeline:
    # def process_item(self, item, spider):
    #     return item
    def open_spider(self, spider):
        self.fp = open('./jd_computer_message.xlsx', mode='w+', encoding='utf-8')
        self.fp.write('computer_sku\tcomputer_title\tcomputer_p_shop\tcomputer_price\tcomputer_commit\tcomputer_detail\n')
 
    def process_item(self, item, spider):
        # 寫入文件
        try:
            line = '\t'.join(list(item.values())) + '\n'
            self.fp.write(line)
            return item
        except:
            pass
 
    def close_spider(self, spider):
        # 關閉文件
        self.fp.close()

七、middlewares

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class jddownloadermiddleware:
    def process_request(self, request, spider):
        # 判斷是否是ji_computer_detail的爬蟲
        # 判斷是否是首頁
        if spider.name == 'ji_computer_detail' and re.findall(f'.*(item.jd.com).*', request.url) == []:
            options = chromeoptions()
            options.add_argument("--headless")
            driver = webdriver.chrome(options=options)
            driver.get(request.url)
            for i in range(0, 15000, 5000):
                driver.execute_script(f'window.scrollto(0, {i})')
                time.sleep(0.5)
            body = driver.page_source.encode()
            time.sleep(1)
            return htmlresponse(url=request.url, body=body, request=request)
        return none

八、使用jupyter進行簡單的處理和分析

其他文件:百度停用詞庫、簡體字文件
下載第三方包

?
1
!pip install seaborn jieba wordcloud pil  -i https://pypi.douban.com/simple

8.1導入第三方包

?
1
2
3
4
5
6
7
8
9
10
11
12
import re
import os
import jieba
import wordcloud
import pandas as pd
import numpy as np
from pil import image
import seaborn as sns
from docx import document
from docx.shared import inches
import matplotlib.pyplot as plt
from pandas import dataframe,series

8.2設置可視化的默認字體和seaborn的樣式

?
1
2
3
sns.set_style('darkgrid')
plt.rcparams['font.sans-serif'] = ['simhei']
plt.rcparams['axes.unicode_minus'] = false

8.3讀取數據

?
1
df_jp = pd.read_excel('./jd_shop.xlsx')

8.4篩選inteli5、i7、i9處理器數據

?
1
2
3
4
5
6
7
8
def convert_one(s):
    if re.findall(f'.*?(i5).*', str(s)) != []:
        return re.findall(f'.*?(i5).*', str(s))[0]
    elif re.findall(f'.*?(i7).*', str(s)) != []:
        return re.findall(f'.*?(i7).*', str(s))[0]
    elif re.findall(f'.*?(i9).*', str(s)) != []:
        return re.findall(f'.*?(i9).*', str(s))[0]
df_jp['computer_intel'] = df_jp['computer_detail'].map(convert_one)

8.5篩選筆記本電腦的屏幕尺寸范圍

?
1
2
3
4
def convert_two(s):
    if re.findall(f'.*?(\d+\.\d+英寸-\d+\.\d+英寸).*', str(s)) != []:
        return re.findall(f'.*?(\d+\.\d+英寸-\d+\.\d+英寸).*', str(s))[0]
df_jp['computer_in'] = df_jp['computer_detail'].map(convert_two)

8.6將評論數轉化為整形

?
1
2
3
4
5
6
7
8
def convert_three(s):
    if re.findall(f'(\d+)萬+', str(s)) != []:
        number = int(re.findall(f'(\d+)萬+', str(s))[0]) * 10000
        return number
    elif re.findall(f'(\d+)+', str(s)) != []:
        number = re.findall(f'(\d+)+', str(s))[0]
        return number
df_jp['computer_commit'] = df_jp['computer_commit'].map(convert_three)

8.7篩選出需要分析的品牌

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def find_computer(name, s):
    sr = re.findall(f'.*({name}).*', str(s))[0]
    return sr
def convert(s):
    if re.findall(f'.*(聯想).*', str(s)) != []:
        return find_computer('聯想', s)
    elif re.findall(f'.*(惠普).*', str(s)) != []:
        return find_computer('惠普', s)
    elif re.findall(f'.*(華為).*', str(s)) != []:
        return find_computer('華為', s)
    elif re.findall(f'.*(戴爾).*', str(s)) != []:
        return find_computer('戴爾', s)
    elif re.findall(f'.*(華碩).*', str(s)) != []:
        return find_computer('華碩', s)
    elif re.findall(f'.*(小米).*', str(s)) != []:
        return find_computer('小米', s)
    elif re.findall(f'.*(榮耀).*', str(s)) != []:
        return find_computer('榮耀', s)
    elif re.findall(f'.*(神舟).*', str(s)) != []:
        return find_computer('神舟', s)
    elif re.findall(f'.*(外星人).*', str(s)) != []:
        return find_computer('外星人', s)
df_jp['computer_p_shop'] = df_jp['computer_p_shop'].map(convert)

8.8刪除指定字段為空值的數據

?
1
2
3
for n in ['computer_price', 'computer_commit', 'computer_p_shop', 'computer_sku', 'computer_detail', 'computer_intel', 'computer_in']:
    index_ls = df_jp[df_jp[[n]].isnull().any(axis=1)==true].index
    df_jp.drop(index=index_ls, inplace=true)

8.9查看各品牌的平均價格

?
1
2
3
4
5
6
7
8
9
plt.figure(figsize=(10, 8), dpi=100)
ax = sns.barplot(x='computer_p_shop', y='computer_price', data=df_jp.groupby(by='computer_p_shop')[['computer_price']].mean().reset_index())
for index,row in df_jp.groupby(by='computer_p_shop')[['computer_price']].mean().reset_index().iterrows():
    ax.text(row.name,row['computer_price'] + 2,round(row['computer_price'],2),color="black",ha="center")
ax.set_xlabel('品牌')
ax.set_ylabel('平均價格')
ax.set_title('各品牌平均價格')
boxplot_fig = ax.get_figure()
boxplot_fig.savefig('各品牌平均價格.png', dpi=400)

python基于scrapy爬取京東筆記本電腦數據并進行簡單處理和分析

8.10 查看各品牌的價格區間

?
1
2
3
4
5
6
7
plt.figure(figsize=(10, 8), dpi=100)
ax = sns.boxenplot(x='computer_p_shop', y='computer_price', data=df_jp.query('computer_price>500'))
ax.set_xlabel('品牌')
ax.set_ylabel('價格區間')
ax.set_title('各品牌價格區間')
boxplot_fig = ax.get_figure()
boxplot_fig.savefig('各品牌價格區間.png', dpi=400)

python基于scrapy爬取京東筆記本電腦數據并進行簡單處理和分析

8.11 查看價格與評論數的關系

?
1
2
3
df_jp['computer_commit'] = df_jp['computer_commit'].astype('int64')
ax = sns.jointplot(x="computer_commit", y="computer_price", data=df_jp, kind="reg", truncate=false,color="m", height=10)
ax.fig.savefig('評論數與價格的關系.png')

python基于scrapy爬取京東筆記本電腦數據并進行簡單處理和分析

8.12 查看商品標題里出現的關鍵詞

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import imageio
 
# 將特征轉換為列表
ls = df_jp['computer_title'].to_list()
# 替換非中英文的字符
feature_points = [re.sub(r'[^a-za-z\u4e00-\u9fa5]+',' ',str(feature)) for feature in ls]
# 讀取停用詞
stop_world = list(pd.read_csv('./百度停用詞表.txt', engine='python', encoding='utf-8', names=['stopwords'])['stopwords'])
feature_points2 = []
for feature in feature_points:  # 遍歷每一條評論
    words = jieba.lcut(feature) # 精確模式,沒有冗余.對每一條評論進行jieba分詞
    ind1 = np.array([len(word) > 1 for word in words])  # 判斷每個分詞的長度是否大于1
    ser1 = pd.series(words)
    ser2 = ser1[ind1] # 篩選分詞長度大于1的分詞留下
    ind2 = ~ser2.isin(stop_world)  # 注意取反負號
    ser3 = ser2[ind2].unique()  # 篩選出不在停用詞表的分詞留下,并去重
    if len(ser3) > 0:
        feature_points2.append(list(ser3))
# 將所有分詞存儲到一個列表中
wordlist = [word for feature in feature_points2 for word in feature]
# 將列表中所有的分詞拼接成一個字符串
feature_str =  ' '.join(wordlist)  
# 標題分析
font_path = r'./simhei.ttf'
shoes_box_jpg = imageio.imread('./home.jpg')
wc=wordcloud.wordcloud(
    background_color='black',
    mask=shoes_box_jpg,
    font_path = font_path,
    min_font_size=5,
    max_font_size=50,
    width=260,
    height=260,
)
wc.generate(feature_str)
plt.figure(figsize=(10, 8), dpi=100)
plt.imshow(wc)
plt.axis('off')
plt.savefig('標題提取關鍵詞')

python基于scrapy爬取京東筆記本電腦數據并進行簡單處理和分析

8.13 篩選價格在4000到5000,聯想品牌、處理器是i5、屏幕大小在15寸以上的數據并查看價格

?
1
2
3
4
5
6
7
8
df_jd_query = df_jp.loc[(df_jp['computer_price'] <=5000) & (df_jp['computer_price']>=4000) & (df_jp['computer_p_shop']=="聯想") & (df_jp['computer_intel']=="i5") & (df_jp['computer_in']=="15.0英寸-15.9英寸"), :].copy()
plt.figure(figsize=(20, 10), dpi=100)
ax = sns.barplot(x='computer_sku', y='computer_price', data=df_jd_query)
ax.set_xlabel('聯想品牌sku')
ax.set_ylabel('價格')
ax.set_title('酷睿i5處理器屏幕15寸以上各sku的價格')
boxplot_fig = ax.get_figure()
boxplot_fig.savefig('酷睿i5處理器屏幕15寸以上各sku的價格.png', dpi=400)

python基于scrapy爬取京東筆記本電腦數據并進行簡單處理和分析

8.14 篩選價格在4000到5000,戴爾品牌、處理器是i7、屏幕大小在15寸以上的數據并查看價格

?
1
2
3
4
5
6
7
8
df_jp_daier = df_jp.loc[(df_jp['computer_price'] <=5000) & (df_jp['computer_price']>=4000) & (df_jp['computer_p_shop']=="戴爾") & (df_jp['computer_intel']=="i7") & (df_jp['computer_in']=="15.0英寸-15.9英寸"), :].copy()
plt.figure(figsize=(10, 8), dpi=100)
ax = sns.barplot(x='computer_sku', y='computer_price', data=df_jp_daier)
ax.set_xlabel('戴爾品牌sku')
ax.set_ylabel('價格')
ax.set_title('酷睿i7處理器屏幕15寸以上各sku的價格')
boxplot_fig = ax.get_figure()
boxplot_fig.savefig('酷睿i7處理器屏幕15寸以上各sku的價格.png', dpi=400)

python基于scrapy爬取京東筆記本電腦數據并進行簡單處理和分析

8.15 不同intel處理器品牌的價格

?
1
2
3
4
5
6
7
plt.figure(figsize=(10, 8), dpi=100)
ax = sns.barplot(x='computer_p_shop', y='computer_price', data=df_jp, hue='computer_intel')
ax.set_xlabel('品牌')
ax.set_ylabel('價格')
ax.set_title('不同酷睿處理器品牌的價格')
boxplot_fig = ax.get_figure()
boxplot_fig.savefig('不同酷睿處理器品牌的價格.png', dpi=400)

python基于scrapy爬取京東筆記本電腦數據并進行簡單處理和分析

8.16 不同尺寸品牌的價格

?
1
2
3
4
5
6
7
plt.figure(figsize=(10, 8), dpi=100)
ax = sns.barplot(x='computer_p_shop', y='computer_price', data=df_jp, hue='computer_in')
ax.set_xlabel('品牌')
ax.set_ylabel('價格')
ax.set_title('不同尺寸品牌的價格')
boxplot_fig = ax.get_figure()
boxplot_fig.savefig('不同尺寸品牌的價格.png', dpi=400)

python基于scrapy爬取京東筆記本電腦數據并進行簡單處理和分析

以上就是python基于scrapy爬取京東筆記本電腦數據并進行簡單處理和分析的詳細內容,更多關于python 爬取京東數據的資料請關注服務器之家其它相關文章!

原文鏈接:https://blog.csdn.net/weixin_45920625/article/details/115673622

延伸 · 閱讀

精彩推薦
Weibo Article 1 Weibo Article 2 Weibo Article 3 Weibo Article 4 Weibo Article 5 Weibo Article 6 Weibo Article 7 Weibo Article 8 Weibo Article 9 Weibo Article 10 Weibo Article 11 Weibo Article 12 Weibo Article 13 Weibo Article 14 Weibo Article 15 Weibo Article 16 Weibo Article 17 Weibo Article 18 Weibo Article 19 Weibo Article 20 Weibo Article 21 Weibo Article 22 Weibo Article 23 Weibo Article 24 Weibo Article 25
主站蜘蛛池模板: 色偷偷噜噜噜亚洲男人的天堂 | 国产黄a三级三级看三级 | 欧美精品系列 | 国产成人精品一区二区三区视频 | 亚洲视频在线观看网址 | 成人免费毛片aaaaaa片 | 91一区二区 | 欧美成人黄色小视频 | 日韩久草 | 国产一区二区三区四 | 欧美大片免费高清观看 | 欧洲精品久久久 | 97精品超碰一区二区三区 | 精品欧美一区二区三区久久久 | 中文字幕一区二区三区在线视频 | 日韩综合 | 国产v日产∨综合v精品视频 | 亚洲成人精品在线 | 黄色一级网站 | 少妇黄色一级片 | www伊人| 午夜在线观看 | 欧美一级大片 | 欧美在线免费视频 | 久久中国| 欧美精品黄色 | 高清一区二区三区 | 国产精品自拍视频 | 欧美一区二区三区婷婷月色 | 丁香伊人 | 欧美日韩视频在线观看免费 | 国产美女精品一区二区三区 | 欧美另类视频 | 视频二区在线观看 | 亚洲影音 | 国产片在线免费播放 | 欧美精品久久久久 | av黄色在线免费观看 | 免费福利视频一区二区三区 | a视频在线| 羞羞视频免费网站 |