百度貼吧大家都經常逛,去逛百度貼吧的時候,經常會看到樓主分享一些資源,要求留下郵箱,樓主才給發。
對于一個熱門的帖子,留下的郵箱數量是非常多的,樓主需要一個一個的去復制那些回復的郵箱,然后再粘貼發送郵件,不是被折磨死就是被累死。無聊至極寫了一個抓取百度貼吧郵箱數據的程序,需要的拿走。
程序實現了一鍵抓取帖子全部郵箱和分頁抓取郵箱兩個功能,界面懶得做了,效果如下:
老規矩,直接貼源碼
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
|
<?php $url2 = ""; $page = ""; if ($_GET[ 'url2' ] = = ""){ $url2 = "http://tieba.baidu.com/p/2314539885?pn=1" ; } else { $url2 = $_GET[ 'url2' ]; } if ($_GET[ 'page' ] = = ""){ $page = "1" ; } else { $page = $_GET[ 'page' ]; } ?> <form action = " " method=" get"> < input type = "hidden" value = "getAll" name = "type" / > <table> <tr> <td>帖子鏈接:< / td><td>< input type = "text" name = "url" value = "http://tieba.baidu.com/p/2314539885" style = "width:300px;" / >< / td> < / tr> <tr> <td>總頁數:< / td><td>< input type = "text" name = "page" style = "width:300px;" value = "<?php echo $page;?>" / >< / td> < / tr> <tr> <td colspan = 2 >< input type = "submit" value = "抓取全部郵箱數據" / >< / td> < / tr> < / table> < / form> <form action = " " method=" get"> < input type = "hidden" value = "getNow" name = "type" / > <table> <tr> <td>帖子鏈接:< / td><td>< input type = "text" name = "url2" value = "<?php echo $url2;?>" style = "width:300px;" / >< / td> < / tr> <tr> <td colspan = 2 >< input type = "submit" value = "抓取當前頁面郵箱數據" / >< / td> < / tr> < / table> < / form> <?php if ($_GET[ 'type' ]! = ""){ $counts = 0 ; if ($_GET[ 'type' ] = = "getAll" ){ $pages = $_GET[ 'page' ]; $url = $_GET[ 'url' ]; for ($i = 0 ;$i<$pages;$i + + ){ $ch2 = curl_init(); curl_setopt($ch2, CURLOPT_URL, $url); curl_setopt($ch2, CURLOPT_FOLLOWLOCATION, TRUE); curl_setopt($ch2, CURLOPT_SSL_VERIFYHOST, FALSE); curl_setopt($ch2, CURLOPT_SSL_VERIFYPEER, false); curl_setopt($ch2, CURLOPT_RETURNTRANSFER, TRUE); $texts = curl_exec($ch2); curl_close($ch2); $dat = getEmail($texts); for ($j = 0 ;$j<count($dat);$j + + ){ echo $dat[$j]. "<br />" ; $counts + + ; } } } else if ($_GET[ 'type' ] = = "getNow" ){ $url = $_GET[ 'url2' ]; $ch2 = curl_init(); curl_setopt($ch2, CURLOPT_URL, $url); curl_setopt($ch2, CURLOPT_FOLLOWLOCATION, TRUE); curl_setopt($ch2, CURLOPT_SSL_VERIFYHOST, FALSE); curl_setopt($ch2, CURLOPT_SSL_VERIFYPEER, false); curl_setopt($ch2, CURLOPT_RETURNTRANSFER, TRUE); $texts = curl_exec($ch2); curl_close($ch2); $dat = getEmail($texts); for ($i = 0 ;$i<count($dat);$i + + ){ echo $dat[$i]. "<br />" ; $counts + + ; } } echo '<h2>共采集到數據:' .$counts. '條</h2>' ; } function getEmail($ str ){ $pattern = "/([a-z0-9\-_\.]+@[a-z0-9]+\.[a-z0-9\-_\.]+)/" ; preg_match_all($pattern,$ str ,$emailArr); return $emailArr[ 0 ]; } ?> |