爬蟲python驗證碼識別
前言:
二值化、普通降噪、8鄰域降噪
tesseract、tesserocr、pil
參考文獻--代碼地址:https://github.com/liguobao/python-verify-code-ocr
1、批量下載驗證碼圖片
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
import shutil import requests from loguru import logger for i in range ( 100 ): url = 'http://xxxx/create/validate/image' response = requests.get(url, stream = true) with open (f './imgs/{i}.png' , 'wb' ) as out_file: response.raw.decode_content = true shutil.copyfileobj(response.raw, out_file) logger.info(f "download {i}.png successfully." ) del response |
2、識別代碼看看效果
1
2
3
4
5
6
7
8
9
10
|
from pil import image import tesserocr img = image. open ( "./imgs/98.png" ) img.show() img_l = img.convert( "l" ) # 灰階圖 img_l.show() verify_code1 = tesserocr.image_to_text(img) verify_code2 = tesserocr.image_to_text(img_l) print (f "verify_code1:{verify_code1}" ) print (f "verify_code2:{verify_code2}" ) |
毫無疑問,無論是原圖還是灰階圖,一無所有。
3、折騰降噪、去干擾
python圖片驗證碼降噪 - 8鄰域降噪
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
|
from pil import image # https://www.cnblogs.com/jhao/p/10345853.html python圖片驗證碼降噪 — 8鄰域降噪 def noise_remove_pil(image_name, k): """ 8鄰域降噪 args: image_name: 圖片文件命名 k: 判斷閾值 returns: """ def calculate_noise_count(img_obj, w, h): """ 計算鄰域非白色的個數 args: img_obj: img obj w: width h: height returns: count (int) """ count = 0 width, height = img_obj.size for _w_ in [w - 1 , w, w + 1 ]: for _h_ in [h - 1 , h, h + 1 ]: if _w_ > width - 1 : continue if _h_ > height - 1 : continue if _w_ = = w and _h_ = = h: continue if img_obj.getpixel((_w_, _h_)) < 230 : # 這里因為是灰度圖像,設置小于230為非白色 count + = 1 return count img = image. open (image_name) # 灰度 gray_img = img.convert( 'l' ) w, h = gray_img.size for _w in range (w): for _h in range (h): if _w = = 0 or _h = = 0 : gray_img.putpixel((_w, _h), 255 ) continue # 計算鄰域非白色的個數 pixel = gray_img.getpixel((_w, _h)) if pixel = = 255 : continue if calculate_noise_count(gray_img, _w, _h) < k: gray_img.putpixel((_w, _h), 255 ) return gray_img if __name__ = = '__main__' : image = noise_remove_pil( "./imgs/1.png" , 4 ) image.show() |
看下圖效果:
這樣差不多了,不過還可以提升
提升新思路:
這邊的干擾線是從某個點發出來的紅色線條,
其實我只需要把紅色的像素點都干掉,這個線條也會被去掉。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
from pil import image import tesserocr img = image. open ( "./imgs/98.png" ) img.show() # 嘗試去掉紅像素點 w, h = img.size for _w in range (w): for _h in range (h): o_pixel = img.getpixel((_w, _h)) if o_pixel = = ( 255 , 0 , 0 ): img.putpixel((_w, _h), ( 255 , 255 , 255 )) img.show() img_l = img.convert( "l" ) # img_l.show() verify_code1 = tesserocr.image_to_text(img) verify_code2 = tesserocr.image_to_text(img_l) print (f "verify_code1:{verify_code1}" ) print (f "verify_code2:{verify_code2}" ) |
看起來ok,上面還有零星的藍色像素掉,也可以用同樣的方法一起去掉。
甚至ocr都直接出效果了
好了,完結撒花。
不過,后面發現,有些紅色線段和藍色點,是和驗證碼重合的。
這個時候,如果直接填成白色,就容易把字母切開,導致識別效果變差。
當前點是紅色或者藍色,判斷周圍點是不是超過兩個像素點是黑色。
是,填充為黑色。
否,填充成白色。
最終完整代碼:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
|
from pil import image import tesserocr from loguru import logger class verfycodeocr(): def __init__( self ) - > none: pass def ocr( self , img): """ 驗證碼ocr args: img (img): imgobject/imgpath returns: [string]: 識別結果 """ img_obj = image. open (img) if type (img) = = str else img self ._remove_pil(img_obj) verify_code = tesserocr.image_to_text(img_obj) return verify_code.replace( "\n" , "").strip() def _get_p_black_count( self , img: image, _w: int , _h: int ): """ 獲取當前位置周圍像素點中黑色元素的個數 args: img (img): 圖像信息 _w (int): w坐標 _h (int): h坐標 returns: int: 個數 """ w, h = img.size p_round_items = [] # 超過了橫縱坐標 if _w = = 0 or _w = = w - 1 or 0 = = _h or _h = = h - 1 : return 0 p_round_items = [img.getpixel( (_w, _h - 1 )), img.getpixel((_w, _h + 1 )), img.getpixel((_w - 1 , _h)), img.getpixel((_w + 1 , _h))] p_black_count = 0 for p_item in p_round_items: if p_item = = ( 0 , 0 , 0 ): p_black_count = p_black_count + 1 return p_black_count def _remove_pil( self , img: image): """清理干擾識別的線條和噪點 args: img (img): 圖像對象 returns: [img]: 被清理過的圖像對象 """ w, h = img.size for _w in range (w): for _h in range (h): o_pixel = img.getpixel((_w, _h)) # 當前像素點是紅色(線段) 或者 綠色(噪點) if o_pixel = = ( 255 , 0 , 0 ) or o_pixel = = ( 0 , 0 , 255 ): # 周圍黑色數量大于2,則把當前像素點填成黑色;否則用白色覆蓋 p_black_count = self ._get_p_black_count(img, _w, _h) if p_black_count > = 2 : img.putpixel((_w, _h), ( 0 , 0 , 0 )) else : img.putpixel((_w, _h), ( 255 , 255 , 255 )) logger.info(f "_remove_pil finish." ) # img.show() return img if __name__ = = '__main__' : verfycodeocr = verfycodeocr() img_path = "./imgs/51.png" img = image. open (img_path) img.show() ocr_result = verfycodeocr.ocr(img) img.show() logger.info(ocr_result) |
到此這篇關于爬蟲python驗證碼識別入門的文章就介紹到這了,更多相關python驗證碼識別內容請搜索服務器之家以前的文章或繼續瀏覽下面的相關文章希望大家以后多多支持服務器之家!
原文鏈接:https://www.cnblogs.com/liguobao/p/15111849.html