對于網絡,我一直處于好奇的態度。以前一直想著寫個爬蟲,但是一拖再拖,懶得實現,感覺這是一個很麻煩的事情,出現個小錯誤,就要調試很多時間,太浪費時間。
后來一想,既然早早給自己下了保證,就先實現它吧,從簡單開始,慢慢增加功能,有時間就實現一個,并且隨時優化代碼。
下面是我簡單實現爬取指定網頁,并且保存的簡單實現,其實有幾種方式可以實現,這里慢慢添加該功能的幾種實現方式。
UrlConnection爬取實現
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
|
package html; import java.io.BufferedReader; import java.io.FileOutputStream; import java.io.FileWriter; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.io.OutputStreamWriter; import java.net.MalformedURLException; import java.net.URL; import java.net.URLConnection; public class Spider { public static void main(String[] args) { String filepath = "d:/124.html" ; String url_str = "http://www.hao123.com/" ; URL url = null ; try { url = new URL(url_str); } catch (MalformedURLException e) { e.printStackTrace(); } String charset = "utf-8" ; int sec_cont = 1000 ; try { URLConnection url_con = url.openConnection(); url_con.setDoOutput( true ); url_con.setReadTimeout( 10 * sec_cont); url_con.setRequestProperty( "User-Agent" , "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)" ); InputStream htm_in = url_con.getInputStream(); String htm_str = InputStream2String(htm_in,charset); saveHtml(filepath,htm_str); } catch (IOException e) { e.printStackTrace(); } } /** * Method: saveHtml * Description: save String to file * @param filepath * file path which need to be saved * @param str * string saved */ public static void saveHtml(String filepath, String str){ try { /*@SuppressWarnings("resource") FileWriter fw = new FileWriter(filepath); fw.write(str); fw.flush();*/ OutputStreamWriter outs = new OutputStreamWriter(new FileOutputStream(filepath, true), "utf-8"); outs.write(str); System.out.print(str); outs.close(); } catch (IOException e) { System.out.println("Error at save html..."); e.printStackTrace(); } } /** * Method: InputStream2String * Description: make InputStream to String * @param in_st * inputstream which need to be converted * @param charset * encoder of value * @throws IOException * if an error occurred */ public static String InputStream2String(InputStream in_st,String charset) throws IOException{ BufferedReader buff = new BufferedReader( new InputStreamReader(in_st, charset)); StringBuffer res = new StringBuffer(); String line = "" ; while ((line = buff.readLine()) != null ){ res.append(line); } return res.toString(); } } |
實現過程中,爬取的網頁的中文亂碼問題,是個比較麻煩的事情。
HttpClient爬取實現
HttpClient實現爬取網頁時,遇到了很多問題。其一,就是存在兩個版本的HttpClient,一個是sun內置的,另一個是apache開源的一個項目,似乎sun內置用的不太多,我也就沒有實現,而是采用了apache開源項目(以后說的HttpClient都是指apache的開源版本);其二,在使用HttpClient時,最新的版本已經不同于以前的版本,從HttpClient4.x版本后,導入的包就已經不一樣了,從網上找的很多部分都是HttpClient3.x版本的,所以如果使用最新的版本,還是看幫助文件為好。
我用的是Eclipse,需要配置環境導入引用包。
首先,下載HttpClient,地址是:http://hc.apache.org/downloads.cgi,我是用的事HttpClient4.2版本。
然后,解壓縮,找到了/lib文件夾下的commons-codec-1.6.jar,commons-logging-1.1.1.jar,httpclient-4.2.5.jar,httpcore-4.2.4.jar(版本號根據下載的版本有所不同,還有其他的jar文件,我這里暫時用不到,所以先導入必須的);
最后,將上面的jar文件,加入classpath中,即右擊工程文件 => Bulid Path => Configure Build Path => Add External Jar..,然后添加上面的包就可以了。
還用一種方法就是講上面的包,直接復制到工程文件夾下的lib文件夾中。
下面是實現代碼:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
|
package html; import java.io.BufferedReader; import java.io.FileOutputStream; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.io.OutputStreamWriter; import org.apache.http.HttpEntity; import org.apache.http.HttpResponse; import org.apache.http.client.*; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.DefaultHttpClient; public class SpiderHttpClient { public static void main(String[] args) throws Exception { // TODO Auto-generated method stub String url_str = "http://www.hao123.com" ; String charset = "utf-8" ; String filepath = "d:/125.html" ; HttpClient hc = new DefaultHttpClient(); HttpGet hg = new HttpGet(url_str); HttpResponse response = hc.execute(hg); HttpEntity entity = response.getEntity(); InputStream htm_in = null ; if (entity != null ){ System.out.println(entity.getContentLength()); htm_in = entity.getContent(); String htm_str = InputStream2String(htm_in,charset); saveHtml(filepath,htm_str); } } /** * Method: saveHtml * Description: save String to file * @param filepath * file path which need to be saved * @param str * string saved */ public static void saveHtml(String filepath, String str){ try { /*@SuppressWarnings("resource") FileWriter fw = new FileWriter(filepath); fw.write(str); fw.flush();*/ OutputStreamWriter outs = new OutputStreamWriter(new FileOutputStream(filepath, true), "utf-8"); outs.write(str); outs.close(); } catch (IOException e) { System.out.println("Error at save html..."); e.printStackTrace(); } } /** * Method: InputStream2String * Description: make InputStream to String * @param in_st * inputstream which need to be converted * @param charset * encoder of value * @throws IOException * if an error occurred */ public static String InputStream2String(InputStream in_st,String charset) throws IOException{ BufferedReader buff = new BufferedReader( new InputStreamReader(in_st, charset)); StringBuffer res = new StringBuffer(); String line = "" ; while ((line = buff.readLine()) != null ){ res.append(line); } return res.toString(); } } |
以上就是本文的全部內容,希望對大家的學習有所幫助,也希望大家多多支持服務器之家。
原文鏈接:http://www.cnblogs.com/ywl925/archive/2013/08/20/3270875.html