国产片侵犯亲女视频播放_亚洲精品二区_在线免费国产视频_欧美精品一区二区三区在线_少妇久久久_在线观看av不卡

腳本之家,腳本語言編程技術及教程分享平臺!
分類導航

Python|VBS|Ruby|Lua|perl|VBA|Golang|PowerShell|Erlang|autoit|Dos|bat|

服務器之家 - 腳本之家 - Python - Python如何爬取微信公眾號文章和評論(基于 Fiddler 抓包分析)

Python如何爬取微信公眾號文章和評論(基于 Fiddler 抓包分析)

2021-07-26 00:28happyJared Python

這篇文章主要介紹了Python如何爬取微信公眾號文章和評論(基于 Fiddler 抓包分析),文中通過示例代碼介紹的非常詳細,對大家的學習或者工作具有一定的參考學習價值,需要的朋友可以參考下

背景說明

感覺微信公眾號算得是比較難爬的平臺之一,不過一番折騰之后還是小有收獲的。沒有用Scrapy(估計爬太快也有反爬限制),但后面會開始整理寫一些實戰出來。簡單介紹下本次的開發環境:

  • python3
  • requests
  • psycopg2 (操作postgres數據庫)

抓包分析

本次實戰對抓取的公眾號沒有限制,但不同公眾號每次抓取之前都要進行分析。打開Fiddler,將手機配置好相關代理,為避免干擾過多,這里給Fiddler加個過濾規則,只需要指定微信域名mp.weixin.qq.com就好:

Python如何爬取微信公眾號文章和評論(基于 Fiddler 抓包分析)

Fiddler配置Filter規則

平時關注的公眾號也比較多,本次實戰以“36氪”公眾號為例,繼續往下看:

Python如何爬取微信公眾號文章和評論(基于 Fiddler 抓包分析)

“36氪”公眾號

Python如何爬取微信公眾號文章和評論(基于 Fiddler 抓包分析)

公眾號右上角 -> 全部消息

在公眾號主頁,右上角有三個實心圓點,點擊進入消息界面,下滑找到并點擊“全部消息”,往下請求加載幾次歷史文章,然后回到Fiddler界面,不出意外的話應該可以看到這幾次請求,可以看到返回的數據是json格式的,同時文章數據是以json字符串的形式定義在general_msg_list字段中:

Python如何爬取微信公眾號文章和評論(基于 Fiddler 抓包分析)

公眾號文章列表抓包請求

分析文章列表接口

把請求URL和Cookie貼上來進行分析:

  1. https://mp.weixin.qq.com/mp/profile_ext?action=getmsg&__biz=MzI2NDk5NzA0Mw==&f=json&offset=10&count=10&is_ok=1&scene=126&uin=777&key=777&pass_ticket=QhOypNwH5dAr5w6UgMjyBrTSOdMEUT86vWc73GANoziWFl8xJd1hIMbMZ82KgCpN&wxtoken=&appmsg_token=971_LwY7Z%252BFBoaEv5z8k_dFWfJkdySbNkMR4OmFxNw~~&x5=1&f=json
  2. Cookie: pgv_pvid=2027337976; pgv_info=ssid=s3015512850; rewardsn=; wxtokenkey=777; wxuin=2089823341; devicetype=android-26; version=26070237; lang=zh_CN;pass_ticket=NDndxxaZ7p6Z9PYulWpLqMbI0i3ULFeCPIHBFu1sf5pX2IhkGfyxZ6b9JieSYRUy;wap_sid2=CO3YwOQHEogBQnN4VTNhNmxQWmc3UHI2U3kteWhUeVExZHFVMnN0QXlsbzVJRUJKc1pkdVFUU2Y5UzhSVEtOZmt1VVlYTkR4SEllQ2huejlTTThJWndMQzZfYUw2SldLVGVMQUthUjc3QWdVMUdoaGN0Nml2SU05cXR1dTN2RkhRUVd1V2Y3SFJ5d01BQUF+fjCB1pLcBTgNQJVO

下面把重要的參數說明一下,沒提到的說明就不那么重要了:

  • __biz:相當于是當前公眾號的id(唯一固定標志)
  • offset:文章數據接口請求偏移量標志(從0開始),每次返回的json數據中會有下一次請求的offset,注意這里并不是按某些規則遞增的
  • count:每次請求的數據量(親測最多可以是10)
  • pass_ticket:可以理解是請求票據,而且隔一段時間后(大概幾個小時)就會過期,這也是為什么微信公眾號比較難按固定規則進行抓取的原因
  • appmsg_token:同樣理解為非固定有過期策略的票據
  • Cookie:使用的時候可以把整段貼上去,但最少僅需要wap_sid2這部分

是不是感覺有點麻煩,畢竟不是要搞大規模專業的爬蟲,所以單就一個公眾號這么分析下來,還是可以往下繼續的,貼上截取的一段json數據,用于設計文章數據表:

  1. {
  2. "ret": 0,
  3. "errmsg": "ok",
  4. "msg_count": 10,
  5. "can_msg_continue": 1,
  6. "general_msg_list": "{\"list\":[{\"comm_msg_info\":{\"id\":1000005700,\"type\":49,\"datetime\":1535100943,\"fakeid\":\"3264997043\",\"status\":2,\"content\":\"\"},\"app_msg_ext_info\":{\"title\":\"金融危機又十年:錢荒之下,二手基金迎來高光時刻\",\"digest\":\"退出永遠是基金的主旋律。\",\"content\":\"\",\"fileid\":100034824,\"content_url\":\"http:\\/\\/mp.weixin.qq.com\\/s?__biz=MzI2NDk5NzA0Mw==&mid=2247518479&idx=1&sn=124ab52f7478c1069a6b4592cdf3c5f5&chksm=eaa6d8d3ddd151c5bb95a7ae118de6d080023246aa0a419e1d53bfe48a8d9a77e52b752d9b80&scene=27#wechat_redirect\",\"source_url\":\"\",\"cover\":\"http:\\/\\/mmbiz.qpic.cn\\/mmbiz_jpg\\/QicyPhNHD5vYgdpprkibtnWCAN7l4ZaqibKvopNyCWWLQAwX7QpzWicnQSVfcBZmPrR5YuHS45JIUzVjb0dZTiaLPyA\\/0?wx_fmt=jpeg\",\"subtype\":9,\"is_multi\":0,\"multi_app_msg_item_list\":[],\"author\":\"石亞瓊\",\"copyright_stat\":11,\"duration\":0,\"del_flag\":1,\"item_show_type\":0,\"audio_fileid\":0,\"play_url\":\"\",\"malicious_title_reason_id\":0,\"malicious_content_type\":0}}]}",
  7. "next_offset": 20,
  8. "video_count": 1,
  9. "use_video_tab": 1,
  10. "real_type": 0
  11. }

可以簡單抽取想要的數據,這里將文章表結構定義如下,順便貼上建表的SQL語句:

Python如何爬取微信公眾號文章和評論(基于 Fiddler 抓包分析)

文章數據表

  1. -- ----------------------------
  2. -- Table structure for tb_article
  3. -- ----------------------------
  4. DROP TABLE IF EXISTS "public"."tb_article";
  5. CREATE TABLE "public"."tb_article" (
  6. "id" serial4 PRIMARY KEY,
  7. "msg_id" int8 NOT NULL,
  8. "title" varchar(200) COLLATE "pg_catalog"."default" NOT NULL,
  9. "author" varchar(20) COLLATE "pg_catalog"."default",
  10. "cover" varchar(500) COLLATE "pg_catalog"."default",
  11. "digest" varchar(200) COLLATE "pg_catalog"."default",
  12. "source_url" varchar(800) COLLATE "pg_catalog"."default",
  13. "content_url" varchar(600) COLLATE "pg_catalog"."default" NOT NULL,
  14. "post_time" timestamp(6),
  15. "create_time" timestamp(6) NOT NULL
  16. )
  17. ;
  18. COMMENT ON COLUMN "public"."tb_article"."id" IS '自增主鍵';
  19. COMMENT ON COLUMN "public"."tb_article"."msg_id" IS '消息id (唯一)';
  20. COMMENT ON COLUMN "public"."tb_article"."title" IS '標題';
  21. COMMENT ON COLUMN "public"."tb_article"."author" IS '作者';
  22. COMMENT ON COLUMN "public"."tb_article"."cover" IS '封面圖';
  23. COMMENT ON COLUMN "public"."tb_article"."digest" IS '關鍵字';
  24. COMMENT ON COLUMN "public"."tb_article"."source_url" IS '原文地址';
  25. COMMENT ON COLUMN "public"."tb_article"."content_url" IS '文章地址';
  26. COMMENT ON COLUMN "public"."tb_article"."post_time" IS '發布時間';
  27. COMMENT ON COLUMN "public"."tb_article"."create_time" IS '入庫時間';
  28. COMMENT ON TABLE "public"."tb_article" IS '公眾號文章表';
  29. -- ----------------------------
  30. -- Indexes structure for table tb_article
  31. -- ----------------------------
  32. CREATE UNIQUE INDEX "unique_msg_id" ON "public"."tb_article" USING btree (
  33. "msg_id" "pg_catalog"."int8_ops" ASC NULLS LAST
  34. );

附請求文章接口并解析數據保存到數據庫的相關代碼:

  1. class WxMps(object):
  2. """微信公眾號文章、評論抓取爬蟲"""
  3.  
  4. def __init__(self, _biz, _pass_ticket, _app_msg_token, _cookie, _offset=0):
  5. self.offset = _offset
  6. self.biz = _biz # 公眾號標志
  7. self.msg_token = _app_msg_token # 票據(非固定)
  8. self.pass_ticket = _pass_ticket # 票據(非固定)
  9. self.headers = {
  10. 'Cookie': _cookie, # Cookie(非固定)
  11. 'User-Agent': 'Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.132 '
  12. }
  13. wx_mps = 'wxmps' # 這里數據庫、用戶、密碼一致(需替換成實際的)
  14. self.postgres = pgs.Pgs(host='localhost', port='5432', db_name=wx_mps, user=wx_mps, password=wx_mps)
  15.  
  16. def start(self):
  17. """請求獲取公眾號的文章接口"""
  18.  
  19. offset = self.offset
  20. while True:
  21. api = 'https://mp.weixin.qq.com/mp/profile_ext?action=getmsg&__biz={0}&f=json&offset={1}' \
  22. '&count=10&is_ok=1&scene=124&uin=777&key=777&pass_ticket={2}&wxtoken=&appmsg_token' \
  23. '={3}&x5=1&f=json'.format(self.biz, offset, self.pass_ticket, self.msg_token)
  24.  
  25. resp = requests.get(api, headers=self.headers).json()
  26. ret, status = resp.get('ret'), resp.get('errmsg') # 狀態信息
  27. if ret == 0 or status == 'ok':
  28. print('Crawl article: ' + api)
  29. offset = resp['next_offset'] # 下一次請求偏移量
  30. general_msg_list = resp['general_msg_list']
  31. msg_list = json.loads(general_msg_list)['list'] # 獲取文章列表
  32. for msg in msg_list:
  33. comm_msg_info = msg['comm_msg_info'] # 該數據是本次推送多篇文章公共的
  34. msg_id = comm_msg_info['id'] # 文章id
  35. post_time = datetime.fromtimestamp(comm_msg_info['datetime']) # 發布時間
  36. # msg_type = comm_msg_info['type'] # 文章類型
  37. # msg_data = json.dumps(comm_msg_info, ensure_ascii=False) # msg原數據
  38. app_msg_ext_info = msg.get('app_msg_ext_info') # article原數據
  39. if app_msg_ext_info:
  40. # 本次推送的首條文章
  41. self._parse_articles(app_msg_ext_info, msg_id, post_time)
  42. # 本次推送的其余文章
  43. multi_app_msg_item_list = app_msg_ext_info.get('multi_app_msg_item_list')
  44. if multi_app_msg_item_list:
  45. for item in multi_app_msg_item_list:
  46. msg_id = item['fileid'] # 文章id
  47. if msg_id == 0:
  48. msg_id = int(time.time() * 1000) # 設置唯一id,解決部分文章id=0出現唯一索引沖突的情況
  49. self._parse_articles(item, msg_id, post_time)
  50. print('next offset is %d' % offset)
  51. else:
  52. print('Before break , Current offset is %d' % offset)
  53. break
  54. def _parse_articles(self, info, msg_id, post_time):
  55. """解析嵌套文章數據并保存入庫"""
  56. title = info.get('title') # 標題
  57. cover = info.get('cover') # 封面圖
  58. author = info.get('author') # 作者
  59. digest = info.get('digest') # 關鍵字
  60. source_url = info.get('source_url') # 原文地址
  61. content_url = info.get('content_url') # 微信地址
  62. # ext_data = json.dumps(info, ensure_ascii=False) # 原始數據
  63. self.postgres.handler(self._save_article(), (msg_id, title, author, cover, digest,
  64. source_url, content_url, post_time,
  65. datetime.now()), fetch=True)
  66. @staticmethod
  67. def _save_article():
  68. sql = 'insert into tb_article(msg_id,title,author,cover,digest,source_url,content_url,post_time,create_time) ' \
  69. 'values(%s,%s,%s,%s,%s,%s,%s,%s,%s)'
  70. return sql
  71. if __name__ == '__main__':
  72. biz = 'MzI2NDk5NzA0Mw==' # "36氪"
  73. pass_ticket = 'NDndxxaZ7p6Z9PYulWpLqMbI0i3ULFeCPIHBFu1sf5pX2IhkGfyxZ6b9JieSYRUy'
  74. app_msg_token = '971_Z0lVNQBcGsWColSubRO9H13ZjrPhjuljyxLtiQ~~'
  75. cookie = 'wap_sid2=CO3YwOQHEogBQnN4VTNhNmxQWmc3UHI2U3kteWhUeVExZHFVMnN0QXlsbzVJRUJKc1pkdVFUU2Y5UzhSVEtOZmt1VVlYTkR4SEllQ2huejlTTThJWndMQzZfYUw2SldLVGVMQUthUjc3QWdVMUdoaGN0Nml2SU05cXR1dTN2RkhRUVd1V2Y3SFJ5d01BQUF+fjCB1pLcBTgNQJVO'
  76. # 以上信息不同公眾號每次抓取都需要借助抓包工具做修改
  77. wxMps = WxMps(biz, pass_ticket, app_msg_token, cookie)
  78. wxMps.start() # 開始爬取文章

分析文章評論接口

獲取評論的思路大致是一樣的,只是會更加麻煩一點。首先在手機端點開一篇有評論的文章,然后查看Fiddler抓取的請求:

Python如何爬取微信公眾號文章和評論(基于 Fiddler 抓包分析)

公眾號文章評論

Python如何爬取微信公眾號文章和評論(基于 Fiddler 抓包分析)

公眾號文章評論接口抓包請求

提取其中的URL和Cookie再次分析:

  1. https://mp.weixin.qq.com/mp/appmsg_comment?action=getcomment&scene=0&__biz=MzI2NDk5NzA0Mw==&appmsgid=2247518723&idx=1&comment_id=433253969406607362&offset=0&limit=100&uin=777&key=777&pass_ticket=NDndxxaZ7p6Z9PYulWpLqMbI0i3ULFeCPIHBFu1sf5pX2IhkGfyxZ6b9JieSYRUy&wxtoken=777&devicetype=android-26&clientversion=26070237&appmsg_token=971_dLK7htA1j8LbMUk8pvJKRlC_o218HEgwDbS9uARPOyQ34_vfXv3iDstqYnq2gAyze1dBKm4ZMTlKeyfx&x5=1&f=json
  2. Cookie: pgv_pvid=2027337976; pgv_info=ssid=s3015512850; rewardsn=; wxuin=2089823341; devicetype=android-26; version=26070237; lang=zh_CN; pass_ticket=NDndxxaZ7p6Z9PYulWpLqMbI0i3ULFeCPIHBFu1sf5pX2IhkGfyxZ6b9JieSYRUy; wap_sid2=CO3YwOQHEogBdENPSVdaS3pHOWc1V2QzY1NvZG9PYk1DMndPS3NfbGlHM0Vfal8zLU9kcUdkWTQxdUYwckFBT3RZM1VYUXFaWkFad3NVaWFXZ28zbEFIQ2pTa1lqZktfb01vcGdPLTQ0aGdJQ2xOSXoxTVFvNUg3SVpBMV9GRU1lbnotci1MWWl5d01BQUF+fjCj45PcBTgNQAE=; wxtokenkey=777

接著分析參數:

  • __biz:同上
  • pass_ticket:同上
  • Cookie:同上
  • offset和limit:代表偏移量和請求數量,由于公眾號評論最多展示100條,所以這兩個參數也不用改它
  • comment_id:獲取文章評論數據的標記id,固定但需要從當前文章結構(Html)解析提取
  • appmsgid:票據id,非固定每次需要從當前文章結構(Html)解析提取
  • appmsg_token:票據token,非固定每次需要從當前文章結構(Html)解析提取

可以看到最后三個參數要解析html獲取(當初真的找了好久才想到看文章網頁結構)。從文章請求接口可以獲得文章地址,對應上面的content_url字段,但請求該地址前仍需要對url做相關處理,不然上面三個參數會有缺失,也就獲取不到后面評論內容:

  1. def _parse_article_detail(self, content_url, article_id):
  2. """從文章頁提取相關參數用于獲取評論,article_id是已保存的文章id"""
  3. try:
  4. api = content_url.replace('amp;', '').replace('#wechat_redirect', '').replace('http', 'https')
  5. html = requests.get(api, headers=self.headers).text
  6. except:
  7. print('獲取評論失敗' + content_url)
  8. else:
  9. # group(0) is current line
  10. str_comment = re.search(r'var comment_id = "(.*)" \|\| "(.*)" \* 1;', html)
  11. str_msg = re.search(r"var appmsgid = '' \|\| '(.*)'\|\|", html)
  12. str_token = re.search(r'window.appmsg_token = "(.*)";', html)
  13. if str_comment and str_msg and str_token:
  14. comment_id = str_comment.group(1) # 評論id(固定)
  15. app_msg_id = str_msg.group(1) # 票據id(非固定)
  16. appmsg_token = str_token.group(1) # 票據token(非固定)

再回來看該接口返回的json數據,分析結構后然后定義數據表(含SQL):

Python如何爬取微信公眾號文章和評論(基于 Fiddler 抓包分析)

文章評論數據表

  1. -- ----------------------------
  2. -- Table structure for tb_article_comment
  3. -- ----------------------------
  4. DROP TABLE IF EXISTS "public"."tb_article_comment";
  5. CREATE TABLE "public"."tb_article_comment" (
  6. "id" serial4 PRIMARY KEY,
  7. "article_id" int4 NOT NULL,
  8. "comment_id" varchar(50) COLLATE "pg_catalog"."default",
  9. "nick_name" varchar(50) COLLATE "pg_catalog"."default" NOT NULL,
  10. "logo_url" varchar(300) COLLATE "pg_catalog"."default",
  11. "content_id" varchar(50) COLLATE "pg_catalog"."default" NOT NULL,
  12. "content" varchar(3000) COLLATE "pg_catalog"."default" NOT NULL,
  13. "like_num" int2,
  14. "comment_time" timestamp(6),
  15. "create_time" timestamp(6) NOT NULL
  16. )
  17. ;
  18. COMMENT ON COLUMN "public"."tb_article_comment"."id" IS '自增主鍵';
  19. COMMENT ON COLUMN "public"."tb_article_comment"."article_id" IS '文章外鍵id';
  20. COMMENT ON COLUMN "public"."tb_article_comment"."comment_id" IS '評論接口id';
  21. COMMENT ON COLUMN "public"."tb_article_comment"."nick_name" IS '用戶昵稱';
  22. COMMENT ON COLUMN "public"."tb_article_comment"."logo_url" IS '頭像地址';
  23. COMMENT ON COLUMN "public"."tb_article_comment"."content_id" IS '評論id (唯一)';
  24. COMMENT ON COLUMN "public"."tb_article_comment"."content" IS '評論內容';
  25. COMMENT ON COLUMN "public"."tb_article_comment"."like_num" IS '點贊數';
  26. COMMENT ON COLUMN "public"."tb_article_comment"."comment_time" IS '評論時間';
  27. COMMENT ON COLUMN "public"."tb_article_comment"."create_time" IS '入庫時間';
  28. COMMENT ON TABLE "public"."tb_article_comment" IS '公眾號文章評論表';
  29. -- ----------------------------
  30. -- Indexes structure for table tb_article_comment
  31. -- ----------------------------
  32. CREATE UNIQUE INDEX "unique_content_id" ON "public"."tb_article_comment" USING btree (
  33. "content_id" COLLATE "pg_catalog"."default" "pg_catalog"."text_ops" ASC NULLS LAST
  34. );

萬里長征快到頭了,最后貼上這部分代碼,由于要先獲取文章地址,所以和上面獲取文章數據的代碼是一起的:

  1. import json
  2. import re
  3. import time
  4. from datetime import datetime
  5.  
  6. import requests
  7.  
  8. from utils import pgs
  9.  
  10. class WxMps(object):
  11. """微信公眾號文章、評論抓取爬蟲"""
  12.  
  13. def __init__(self, _biz, _pass_ticket, _app_msg_token, _cookie, _offset=0):
  14. self.offset = _offset
  15. self.biz = _biz # 公眾號標志
  16. self.msg_token = _app_msg_token # 票據(非固定)
  17. self.pass_ticket = _pass_ticket # 票據(非固定)
  18. self.headers = {
  19. 'Cookie': _cookie, # Cookie(非固定)
  20. 'User-Agent': 'Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.132 '
  21. }
  22. wx_mps = 'wxmps' # 這里數據庫、用戶、密碼一致(需替換成實際的)
  23. self.postgres = pgs.Pgs(host='localhost', port='5432', db_name=wx_mps, user=wx_mps, password=wx_mps)
  24.  
  25. def start(self):
  26. """請求獲取公眾號的文章接口"""
  27.  
  28. offset = self.offset
  29. while True:
  30. api = 'https://mp.weixin.qq.com/mp/profile_ext?action=getmsg&__biz={0}&f=json&offset={1}' \
  31. '&count=10&is_ok=1&scene=124&uin=777&key=777&pass_ticket={2}&wxtoken=&appmsg_token' \
  32. '={3}&x5=1&f=json'.format(self.biz, offset, self.pass_ticket, self.msg_token)
  33.  
  34. resp = requests.get(api, headers=self.headers).json()
  35. ret, status = resp.get('ret'), resp.get('errmsg') # 狀態信息
  36. if ret == 0 or status == 'ok':
  37. print('Crawl article: ' + api)
  38. offset = resp['next_offset'] # 下一次請求偏移量
  39. general_msg_list = resp['general_msg_list']
  40. msg_list = json.loads(general_msg_list)['list'] # 獲取文章列表
  41. for msg in msg_list:
  42. comm_msg_info = msg['comm_msg_info'] # 該數據是本次推送多篇文章公共的
  43. msg_id = comm_msg_info['id'] # 文章id
  44. post_time = datetime.fromtimestamp(comm_msg_info['datetime']) # 發布時間
  45. # msg_type = comm_msg_info['type'] # 文章類型
  46. # msg_data = json.dumps(comm_msg_info, ensure_ascii=False) # msg原數據
  47.  
  48. app_msg_ext_info = msg.get('app_msg_ext_info') # article原數據
  49. if app_msg_ext_info:
  50. # 本次推送的首條文章
  51. self._parse_articles(app_msg_ext_info, msg_id, post_time)
  52. # 本次推送的其余文章
  53. multi_app_msg_item_list = app_msg_ext_info.get('multi_app_msg_item_list')
  54. if multi_app_msg_item_list:
  55. for item in multi_app_msg_item_list:
  56. msg_id = item['fileid'] # 文章id
  57. if msg_id == 0:
  58. msg_id = int(time.time() * 1000) # 設置唯一id,解決部分文章id=0出現唯一索引沖突的情況
  59. self._parse_articles(item, msg_id, post_time)
  60. print('next offset is %d' % offset)
  61. else:
  62. print('Before break , Current offset is %d' % offset)
  63. break
  64.  
  65. def _parse_articles(self, info, msg_id, post_time):
  66. """解析嵌套文章數據并保存入庫"""
  67.  
  68. title = info.get('title') # 標題
  69. cover = info.get('cover') # 封面圖
  70. author = info.get('author') # 作者
  71. digest = info.get('digest') # 關鍵字
  72. source_url = info.get('source_url') # 原文地址
  73. content_url = info.get('content_url') # 微信地址
  74. # ext_data = json.dumps(info, ensure_ascii=False) # 原始數據
  75.  
  76. content_url = content_url.replace('amp;', '').replace('#wechat_redirect', '').replace('http', 'https')
  77. article_id = self.postgres.handler(self._save_article(), (msg_id, title, author, cover, digest,
  78. source_url, content_url, post_time,
  79. datetime.now()), fetch=True)
  80. if article_id:
  81. self._parse_article_detail(content_url, article_id)
  82.  
  83. def _parse_article_detail(self, content_url, article_id):
  84. """從文章頁提取相關參數用于獲取評論,article_id是已保存的文章id"""
  85.  
  86. try:
  87. html = requests.get(content_url, headers=self.headers).text
  88. except:
  89. print('獲取評論失敗' + content_url)
  90. else:
  91. # group(0) is current line
  92. str_comment = re.search(r'var comment_id = "(.*)" \|\| "(.*)" \* 1;', html)
  93. str_msg = re.search(r"var appmsgid = '' \|\| '(.*)'\|\|", html)
  94. str_token = re.search(r'window.appmsg_token = "(.*)";', html)
  95.  
  96. if str_comment and str_msg and str_token:
  97. comment_id = str_comment.group(1) # 評論id(固定)
  98. app_msg_id = str_msg.group(1) # 票據id(非固定)
  99. appmsg_token = str_token.group(1) # 票據token(非固定)
  100.  
  101. # 缺一不可
  102. if appmsg_token and app_msg_id and comment_id:
  103. print('Crawl article comments: ' + content_url)
  104. self._crawl_comments(app_msg_id, comment_id, appmsg_token, article_id)
  105.  
  106. def _crawl_comments(self, app_msg_id, comment_id, appmsg_token, article_id):
  107. """抓取文章的評論"""
  108.  
  109. api = 'https://mp.weixin.qq.com/mp/appmsg_comment?action=getcomment&scene=0&__biz={0}' \
  110. '&appmsgid={1}&idx=1&comment_id={2}&offset=0&limit=100&uin=777&key=777' \
  111. '&pass_ticket={3}&wxtoken=777&devicetype=android-26&clientversion=26060739' \
  112. '&appmsg_token={4}&x5=1&f=json'.format(self.biz, app_msg_id, comment_id,
  113. self.pass_ticket, appmsg_token)
  114. resp = requests.get(api, headers=self.headers).json()
  115. ret, status = resp['base_resp']['ret'], resp['base_resp']['errmsg']
  116. if ret == 0 or status == 'ok':
  117. elected_comment = resp['elected_comment']
  118. for comment in elected_comment:
  119. nick_name = comment.get('nick_name') # 昵稱
  120. logo_url = comment.get('logo_url') # 頭像
  121. comment_time = datetime.fromtimestamp(comment.get('create_time')) # 評論時間
  122. content = comment.get('content') # 評論內容
  123. content_id = comment.get('content_id') # id
  124. like_num = comment.get('like_num') # 點贊數
  125. # reply_list = comment.get('reply')['reply_list'] # 回復數據
  126.  
  127. self.postgres.handler(self._save_article_comment(), (article_id, comment_id, nick_name, logo_url,
  128. content_id, content, like_num, comment_time,
  129. datetime.now()))
  130.  
  131. @staticmethod
  132. def _save_article():
  133. sql = 'insert into tb_article(msg_id,title,author,cover,digest,source_url,content_url,post_time,create_time) ' \
  134. 'values(%s,%s,%s,%s,%s,%s,%s,%s,%s) returning id'
  135. return sql
  136.  
  137. @staticmethod
  138. def _save_article_comment():
  139. sql = 'insert into tb_article_comment(article_id,comment_id,nick_name,logo_url,content_id,content,like_num,' \
  140. 'comment_time,create_time) values(%s,%s,%s,%s,%s,%s,%s,%s,%s)'
  141. return sql
  142.  
  143. if __name__ == '__main__':
  144. biz = 'MzI2NDk5NzA0Mw==' # "36氪"
  145. pass_ticket = 'NDndxxaZ7p6Z9PYulWpLqMbI0i3ULFeCPIHBFu1sf5pX2IhkGfyxZ6b9JieSYRUy'
  146. app_msg_token = '971_Z0lVNQBcGsWColSubRO9H13ZjrPhjuljyxLtiQ~~'
  147. cookie = 'wap_sid2=CO3YwOQHEogBQnN4VTNhNmxQWmc3UHI2U3kteWhUeVExZHFVMnN0QXlsbzVJRUJKc1pkdVFUU2Y5UzhSVEtOZmt1VVlYTkR4SEllQ2huejlTTThJWndMQzZfYUw2SldLVGVMQUthUjc3QWdVMUdoaGN0Nml2SU05cXR1dTN2RkhRUVd1V2Y3SFJ5d01BQUF+fjCB1pLcBTgNQJVO'
  148. # 以上信息不同公眾號每次抓取都需要借助抓包工具做修改
  149. wxMps = WxMps(biz, pass_ticket, app_msg_token, cookie)
  150. wxMps.start() # 開始爬取文章及評論

文末小結

最后展示下數據庫里的數據,單線程爬的慢而且又沒這方面的數據需求,所以也只是隨便試了下手:

Python如何爬取微信公眾號文章和評論(基于 Fiddler 抓包分析)

抓取的部分數據

有時候寫爬蟲是個細心活,如果覺得太麻煩的話,推薦了解下WechatSogou這個工具。有問題的歡迎底部留言討論。

完整代碼:GitHub

以上就是本文的全部內容,希望對大家的學習有所幫助,也希望大家多多支持服務器之家。

原文鏈接:https://www.jianshu.com/p/b5b01ded8f98

延伸 · 閱讀

精彩推薦
Weibo Article 1 Weibo Article 2 Weibo Article 3 Weibo Article 4 Weibo Article 5 Weibo Article 6 Weibo Article 7 Weibo Article 8 Weibo Article 9 Weibo Article 10 Weibo Article 11 Weibo Article 12 Weibo Article 13 Weibo Article 14 Weibo Article 15 Weibo Article 16 Weibo Article 17 Weibo Article 18 Weibo Article 19 Weibo Article 20 Weibo Article 21 Weibo Article 22 Weibo Article 23 Weibo Article 24 Weibo Article 25 Weibo Article 26 Weibo Article 27 Weibo Article 28 Weibo Article 29 Weibo Article 30 Weibo Article 31 Weibo Article 32 Weibo Article 33 Weibo Article 34 Weibo Article 35 Weibo Article 36 Weibo Article 37 Weibo Article 38 Weibo Article 39 Weibo Article 40
主站蜘蛛池模板: 日韩高清电影 | a级片在线观看 | 99在线免费视频 | 成人日韩在线 | 一道本一区二区三区 | 日韩成人在线视频 | 91黄视频 | 欧美福利视频 | 亚洲成av人片在线观看无 | 久久av综合 | 羞羞视频在线看 | 中文字幕成人 | 99久久婷婷国产综合精品草原 | 四影虎影www4hu23cmo | 91精品视频在线播放 | 米奇色网 | 中日韩一线二线三线视频 | av片免费 | 久草视频网 | 中文字幕三级 | 国精产品一区二区三区有限公司 | 欧美日批 | 午夜小电影 | 欧美做爰一区二区三区 | 国产伊人久 | 亚洲免费观看视频 | 91精品国产乱码久久久久久 | 国产精品欧美日韩 | 亚洲视屏| 亚洲视频观看 | 国产精品178页 | 欧美日韩国产成人 | 国产成人精品一区二 | 欧美日本韩国一区二区 | 免费久久久久 | 免费成人黄色大片 | 爱色av网站 | 狠狠色噜噜狠狠狠狠 | 一区二区在线视频 | 久久久久久久久久久九 | 成年人免费看 |