本篇记录下python识别图片中的文字 所需的安装配置;python
Tesseract-OCR这个软件是由Google维护的开源的OCR软件。git
下载地址:https://github.com/tesseract-ocr/tesseract/wiki/Downloadsgithub
下载后安装后,将Tesseract-OCR路径加入系统path;app
安装时注意勾选简体中文,默认安装,安装完毕后,敲命令(看看装的怎么样了,支持什么语言):spa
tesseractcode
tesseract -vblog
tesseract --list-langs #查看Tesseract-OCR支持语言seo
中文字库chi_sim.traineddata图片
下载地址:https://github.com/tesseract-ocr/tesseract/wiki/Data-Filesip
将中文字库放在\Tesseract-OCR\tessdata文件夹里面;
改文件:
C:\Python3\Lib\site-packages\pytesseract\pytesseract.py(根据实际路径修改),找到这两行:
# CHANGE THIS IF TESSERACT IS NOT IN YOUR PATH, OR IS NAMED DIFFERENTLY tesseract_cmd = 'tesseract'
改成这样:
# CHANGE THIS IF TESSERACT IS NOT IN YOUR PATH, OR IS NAMED DIFFERENTLY #tesseract_cmd = 'tesseract' tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract.exe'
代码:
(写几个字,截图保存成:1.png)
import pytesseract from PIL import Image text = pytesseract.image_to_string(Image.open('1.png'), lang='chi_sim') print(text)