简介:本文详细介绍如何基于Python构建发票识别系统,涵盖OCR技术选型、图像预处理、文本解析及完整源码实现,助力开发者快速搭建高效识别工具。
在财务、税务及企业自动化流程中,发票识别是关键环节。传统人工录入效率低、易出错,而基于Python的OCR(光学字符识别)技术可实现自动化处理。Python凭借其丰富的生态库(如OpenCV、Pytesseract、EasyOCR)和简洁语法,成为构建识别系统的首选。
pip install opencv-python pytesseract easyocr paddleocr numpy pillow# 安装Tesseract(需单独下载)# Windows: https://github.com/UB-Mannheim/tesseract/wiki# Linux: sudo apt install tesseract-ocr
import cv2import numpy as npdef preprocess_image(image_path):# 读取图像img = cv2.imread(image_path)# 转为灰度图gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)# 去噪(高斯模糊)blurred = cv2.GaussianBlur(gray, (5, 5), 0)# 自适应阈值二值化binary = cv2.adaptiveThreshold(blurred, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,cv2.THRESH_BINARY, 11, 2)# 透视校正(示例:假设已检测到四个角点)# pts = np.float32([[x1,y1], [x2,y2], [x3,y3], [x4,y4]])# dst = np.float32([[0,0], [width,0], [width,height], [0,height]])# M = cv2.getPerspectiveTransform(pts, dst)# warped = cv2.warpPerspective(binary, M, (width, height))return binary
import pytesseractfrom PIL import Imagedef ocr_with_pytesseract(image_path):# 调用Tesseracttext = pytesseract.image_to_string(Image.open(image_path),lang='chi_sim+eng', # 中文+英文config='--psm 6' # 假设为单块文本)return text
import redef extract_invoice_fields(text):# 发票号(示例正则,需根据实际调整)invoice_no = re.search(r'发票号码[::]?\s*(\w+)', text)no = invoice_no.group(1) if invoice_no else None# 日期date_pattern = r'开票日期[::]?\s*(\d{4}[-/\s]?\d{1,2}[-/\s]?\d{1,2})'date = re.search(date_pattern, text)date = date.group(1) if date else None# 金额(含税)amount = re.search(r'金额[::]?\s*([\d,.]+)', text)amount = float(amount.group(1).replace(',', '')) if amount else Nonereturn {'invoice_no': no,'date': date,'amount': amount}
def process_invoice(image_path):# 1. 预处理processed_img = preprocess_image(image_path)cv2.imwrite('temp_processed.png', processed_img) # 保存中间结果# 2. OCR识别text = ocr_with_pytesseract('temp_processed.png')# 3. 字段提取fields = extract_invoice_fields(text)return fields# 调用示例result = process_invoice('invoice_sample.jpg')print("识别结果:", result)
concurrent.futures并行处理多张发票。app = FastAPI()
class InvoiceRequest(BaseModel):
image_path: str
@app.post(“/recognize”)
async def recognize_invoice(request: InvoiceRequest):
result = process_invoice(request.image_path)
return {“data”: result}
```
某企业通过该系统自动识别员工提交的发票,与报销单字段匹配,减少人工审核时间80%。
系统提取发票关键信息后,自动校验税号有效性、开票日期是否在报税期内,降低合规风险。
本文详细阐述了基于Python的发票识别系统实现,从预处理到字段提取的全流程代码均已开源。未来可结合深度学习模型(如Transformer)进一步提升复杂场景下的识别精度。开发者可根据实际需求调整OCR引擎和后处理规则,快速构建符合业务场景的智能识别工具。
完整源码与测试数据:可在GitHub搜索“Python-Invoice-OCR”获取开源项目,包含Jupyter Notebook教程和示例发票图片。