【python文字识别OCR】

问题

python实现文字识别

方法

tesseract-OCR方法，没有环境限制，pytorch还是tensorflow都可以用

步骤

1. 下载 tesseract-ocr 的exe文件

文件名称：tesseract-ocr-w64-setup-v4.1.0.20190314.exe（根据自己电脑位数下载）
链接：https://digi.bib.uni-mannheim.de/tesseract/

2. 双击安装 tesseract-ocr.exe 文件

中途安装需要在select components时，add language （最后一行小加号）中选中所有Chinese开头的四个中文包，然后一直点击确认安装

3. 安装需要的包

在python环境中安装两个包：
pip install pytesseract
pip install Pillow

4.识别代码

import pytesseract as pt
from PIL import Image
'''
   识别中文的代码
'''
# 刚才安装tesseract-ocr的tesseract.exe的路径
path = r'~/Tesseract-OCR/tesseract.exe'
pt.pytesseract.tesseract_cmd = path
img = Image.open('9999.png')
text = pt.image_to_string(img, lang='chi_sim').strip()
print(text)

# ===================================================
'''
   识别英文的代码
'''
path = r'~/Tesseract-OCR/tesseract.exe'
pt.pytesseract.tesseract_cmd = path
img = Image.open('.png')
text = pt.image_to_string(img)
print(text)

注意：

中文识别的时候报错，可以替换 Tesseract-OCR/tessdata 中chi_sim.traineddata（中文识别包）
tesseract方法对图像像素要求高，也就说图像像素点越多，识别准确率越高