最近要弄一個(gè)爬蟲程序,想著先來個(gè)簡單的模擬登陸, 在權(quán)衡jxbrowser
和htmlunit
兩種技術(shù), jxbowser
有界面呈現(xiàn)效果,但是對于某些js跳轉(zhuǎn)之后的效果獲取比較繁瑣。
隨后考慮用htmlunit
, 想著借用咱們csnd的登陸練練手。誰知道csdn的登陸,js加載時(shí)間超長,不設(shè)置長一點(diǎn)的加載時(shí)間,按鈕提交根本沒效果,js沒生效。 具體看代碼注釋吧。 奉勸做爬蟲的同志們,千萬別用csdn登陸練手,坑死我了。
maven
配置如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
<dependencies> <!-- https: //mvnrepository.com/artifact/net.sourceforge.htmlunit/htmlunit --> <dependency> <groupid>net.sourceforge.htmlunit</groupid> <artifactid>htmlunit</artifactid> <version> 2.18 </version> </dependency> <!-- https: //mvnrepository.com/artifact/org.jsoup/jsoup --> <dependency> <groupid>org.jsoup</groupid> <artifactid>jsoup</artifactid> <version> 1.9 . 2 </version> </dependency> </dependencies> |
代碼如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
|
package com.test; import java.io.ioexception; import java.net.malformedurlexception; import java.util.hashmap; import java.util.map; import java.util.set; import com.gargoylesoftware.htmlunit.browserversion; import com.gargoylesoftware.htmlunit.failinghttpstatuscodeexception; import com.gargoylesoftware.htmlunit.nicelyresynchronizingajaxcontroller; import com.gargoylesoftware.htmlunit.silentcsserrorhandler; import com.gargoylesoftware.htmlunit.webclient; import com.gargoylesoftware.htmlunit.html.htmlbuttoninput; import com.gargoylesoftware.htmlunit.html.htmlform; import com.gargoylesoftware.htmlunit.html.htmlpage; import com.gargoylesoftware.htmlunit.html.htmlpasswordinput; import com.gargoylesoftware.htmlunit.html.htmltextinput; import com.gargoylesoftware.htmlunit.util.cookie; public class simulatelogin { //訪問的目標(biāo)網(wǎng)址(csdn) private static string target_url = "https://passport.csdn.net/account/login?from=http://www.csdn.net" ; public static void main(string[] args) throws failinghttpstatuscodeexception, malformedurlexception, ioexception { // 模擬一個(gè)瀏覽器 webclient webclient = new webclient(browserversion.chrome); // 設(shè)置webclient的相關(guān)參數(shù) webclient.setcsserrorhandler( new silentcsserrorhandler()); //設(shè)置ajax webclient.setajaxcontroller( new nicelyresynchronizingajaxcontroller()); //設(shè)置支持js webclient.getoptions().setjavascriptenabled( true ); //css渲染禁止 webclient.getoptions().setcssenabled( false ); //超時(shí)時(shí)間 webclient.getoptions().settimeout( 50000 ); //設(shè)置js拋出異常:false webclient.getoptions().setthrowexceptiononscripterror( false ); //允許重定向 webclient.getoptions().setredirectenabled( true ); //允許cookie webclient.getcookiemanager().setcookiesenabled( true ); // 模擬瀏覽器打開一個(gè)目標(biāo)網(wǎng)址 htmlpage page = webclient.getpage(target_url); /**等待js加載完全,csdn這點(diǎn) 特別坑,js加載時(shí)間超長!!!!!!! 后人切記不要用csdn模擬登陸!!!!!!!**/ webclient.waitforbackgroundjavascript( 10000 * 3 ); // 根據(jù)form的名字獲取頁面表單,也可以通過索引來獲取:page.getforms().get(0) htmlform form = (htmlform) page.getelementbyid( "fm1" ); htmltextinput username = (htmltextinput) form.getinputbyname( "username" ); htmlpasswordinput password = (htmlpasswordinput) form.getinputbyname( "password" ); username.setvalueattribute( "********" ); //用戶名 password.setvalueattribute( "********" ); //密碼 htmlbuttoninput button = (htmlbuttoninput) page.getbyxpath( "//input[contains(@class, 'logging')]" ).get( 0 ); // scriptresult result = page.executejavascript("javascript:document.getelementsbyclassname('logging')[0].click()"); // htmlpage retpage = (htmlpage) result.getnewpage(); htmlpage retpage = button.click(); // 等待js驅(qū)動dom完成獲得還原后的網(wǎng)頁 webclient.waitforbackgroundjavascript( 1000 ); //輸出跳轉(zhuǎn)網(wǎng)頁的地址 system.out.println(retpage.geturl().tostring()); //輸出跳轉(zhuǎn)網(wǎng)頁的內(nèi)容 system.out.println(retpage.asxml()); //獲取cookie set<cookie> cookies = webclient.getcookiemanager().getcookies(); map<string, string> responsecookies = new hashmap<string, string>(); for (cookie c : cookies) { responsecookies.put(c.getname(), c.getvalue()); system.out.print(c.getname()+ ":" +c.getvalue()); } webclient.close(); system.out.println( "success!" ); } } |
另外,csdn的js總是莫名其妙的報(bào)一堆錯(cuò),如果不想看,想忽略的話,在創(chuàng)建webclient
前加上如下代碼:
1
2
3
4
5
6
7
8
|
//設(shè)置日志級別,原頁面js異常不打印 logfactory.getfactory().setattribute( "org.apache.commons.logging.log" , "org.apache.commons.logging.impl.nooplog" ); java.util.logging.logger.getlogger( "com.gargoylesoftware.htmlunit" ) .setlevel(level.off); java.util.logging.logger.getlogger( "org.apache.commons.httpclient" ) .setlevel(level.off); |
總結(jié)
以上就是這篇文章的全部內(nèi)容了,希望本文的內(nèi)容對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,謝謝大家對服務(wù)器之家的支持。如果你想了解更多相關(guān)內(nèi)容請查看下面相關(guān)鏈接
原文鏈接:https://blog.csdn.net/moneyshi/article/details/78799949