Tesseract OCR 完整教程 / 第 5 章：多语言支持

第 5 章：多语言支持

深入了解 Tesseract 的多语言识别能力与调优技巧。

5.1 语言系统架构

Tesseract 语言系统
├── 语言模型 (.traineddata)
│   ├── LSTM 网络权重
│   ├── 字符集定义
│   ├── 字典/词频
│   └── 双字母组合 (Bigram)
├── 脚本系统
│   ├── Latin (拉丁)
│   ├── Han (汉字)
│   ├── Cyrillic (西里尔)
│   ├── Arabic (阿拉伯)
│   └── Devanagari (天城体)
└── 运行时语言组合
    └── eng+chi_sim+jpn

5.2 中文识别

5.2.1 简体中文

# 安装简体中文语言包
sudo apt install tesseract-ocr-chi-sim

# 基本识别
tesseract chinese.png stdout -l chi_sim

# 中英文混合（推荐）
tesseract mixed.png stdout -l chi_sim+eng

import pytesseract
from PIL import Image

img = Image.open('chinese.png')

# 纯中文
text = pytesseract.image_to_string(img, lang='chi_sim')

# 中英文混合
text = pytesseract.image_to_string(img, lang='chi_sim+eng')
print(text)

5.2.2 繁体中文

# 安装繁体中文
sudo apt install tesseract-ocr-chi-tra

# 繁体识别
tesseract traditional.png stdout -l chi_tra

# 简繁混合
tesseract mixed.png stdout -l chi_sim+chi_tra+eng

5.2.3 中文识别优化

import pytesseract
from PIL import Image
import cv2
import numpy as np

def ocr_chinese(image_path):
    """优化的中文 OCR"""
    img = cv2.imread(image_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    
    # 1. 对比度增强（中文笔画多，需要更高对比度）
    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
    enhanced = clahe.apply(gray)
    
    # 2. 自适应二值化（中文适合自适应）
    binary = cv2.adaptiveThreshold(
        enhanced, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, 
        cv2.THRESH_BINARY, 15, 8
    )
    
    # 3. 形态学操作（连接断裂笔画）
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
    morphed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel)
    
    # 4. OCR 识别
    pil_img = Image.fromarray(morphed)
    text = pytesseract.image_to_string(
        pil_img, 
        lang='chi_sim+eng',
        config='--psm 6 --oem 1'
    )
    
    return text

5.2.4 中文模型对比

模型	大小	精度	速度	推荐场景
`chi_sim` (标准)	55MB	⭐⭐⭐	⭐⭐⭐⭐	一般场景
`chi_sim_best`	95MB	⭐⭐⭐⭐⭐	⭐⭐	高精度要求
`chi_sim_fast`	25MB	⭐⭐	⭐⭐⭐⭐⭐	批量处理

5.3 日文识别

5.3.1 基本使用

# 安装日文语言包
sudo apt install tesseract-ocr-jpn

# 日文识别
tesseract japanese.png stdout -l jpn

# 日英混合
tesseract mixed.png stdout -l jpn+eng

5.3.2 日文脚本组成

日文包含三种书写系统：

脚本	说明	示例
平假名 (Hiragana)	日本原生表音文字	あいうえお
片假名 (Katakana)	外来语表音文字	アイウエオ
汉字 (Kanji)	汉字	日本語

# 日文垂直书写（竖排）
tesseract vertical_jp.png stdout -l jpn --psm 5

5.4 阿拉伯文识别

5.4.1 基本使用

# 安装阿拉伯文语言包
sudo apt install tesseract-ocr-ara

# 阿拉伯文识别（RTL 从右到左）
tesseract arabic.png stdout -l ara

# 阿拉伯文 + 英文
tesseract mixed.png stdout -l ara+eng

5.4.2 阿拉伯文特殊处理

阿拉伯文是 从右到左（RTL） 书写的，需要注意：

import pytesseract
from PIL import Image

img = Image.open('arabic.png')

# 使用 PSM 6 适合统一文本块
text = pytesseract.image_to_string(
    img, 
    lang='ara+eng',
    config='--psm 6'
)

# 注意：输出文本可能需要 RTL 处理
lines = text.strip().split('\n')
for line in lines:
    print(line)  # 可能需要反向显示

5.5 其他语言

5.5.1 语言包安装一览

# 亚洲语言
sudo apt install tesseract-ocr-jpn        # 日文
sudo apt install tesseract-ocr-kor        # 韩文
sudo apt install tesseract-ocr-tha        # 泰文
sudo apt install tesseract-ocr-vie        # 越南文
sudo apt install tesseract-ocr-hin        # 印地文

# 欧洲语言
sudo apt install tesseract-ocr-deu        # 德文
sudo apt install tesseract-ocr-fra        # 法文
sudo apt install tesseract-ocr-spa        # 西班牙文
sudo apt install tesseract-ocr-por        # 葡萄牙文
sudo apt install tesseract-ocr-ita        # 意大利文
sudo apt install tesseract-ocr-rus        # 俄文
sudo apt install tesseract-ocr-nld        # 荷兰文
sudo apt install tesseract-ocr-pol        # 波兰文
sudo apt install tesseract-ocr-tur        # 土耳其文

# 中东语言
sudo apt install tesseract-ocr-ara        # 阿拉伯文
sudo apt install tesseract-ocr-heb        # 希伯来文
sudo apt install tesseract-ocr-fas        # 波斯文

5.5.2 支持语言完整列表

# 查看所有已安装语言
tesseract --list-langs

# 查看所有可用语言（包括未安装的）
apt list 2>/dev/null | grep tesseract-ocr-

5.6 混合语言处理

5.6.1 语言组合策略

# 方法 1: 指定多语言（推荐）
tesseract image.png stdout -l chi_sim+eng

# 方法 2: 使用脚本模型（实验性）
tesseract image.png stdout -l Han+Latin

5.6.2 语言组合效果对比

组合	速度	精度	适用场景
`eng`	快	英文高	纯英文
`chi_sim`	中	中文高	纯中文
`chi_sim+eng`	中	混合高	中英文混合
`chi_sim+chi_tra`	慢	简繁混合	简繁混排
`jpn+eng`	中	日英混合	日英混排

注意事项：

语言越多，识别越慢（每个额外语言增加约 30% 时间）
语言组合可能引入误识别
按主次排列语言（主要语言在前）

5.6.3 自动语言检测

import pytesseract
from PIL import Image

def detect_language(image_path):
    """使用 OSD 检测脚本类型"""
    img = Image.open(image_path)
    
    # OSD 检测
    osd = pytesseract.image_to_osd(img, output_type=pytesseract.Output.DICT)
    
    script = osd['script']
    confidence = osd['script_conf']
    
    # 脚本到语言映射
    script_lang_map = {
        'Han': 'chi_sim',
        'Latin': 'eng',
        'Japanese': 'jpn',
        'Korean': 'kor',
        'Arabic': 'ara',
        'Cyrillic': 'rus',
        'Devanagari': 'hin',
    }
    
    detected_lang = script_lang_map.get(script, 'eng')
    
    print(f"检测脚本: {script} (置信度: {confidence:.2f})")
    print(f"推荐语言: {detected_lang}")
    
    return detected_lang

5.7 字符集与 Unicode

5.7.1 Tesseract Unicode 支持

语言	Unicode 范围	说明
英文	U+0020 - U+007E	基本拉丁
中文	U+4E00 - U+9FFF	CJK 统一汉字
日文假名	U+3040 - U+30FF	平假名+片假名
韩文	U+AC00 - U+D7AF	韩文音节
阿拉伯文	U+0600 - U+06FF	阿拉伯文

5.7.2 字符白名单/黑名单

import pytesseract
from PIL import Image

img = Image.open('image.png')

# 只识别数字
text = pytesseract.image_to_string(
    img, 
    lang='eng',
    config='--psm 6 -c tessedit_char_whitelist=0123456789'
)

# 只识别中文和数字
text = pytesseract.image_to_string(
    img,
    lang='chi_sim+eng',
    config='--psm 6 -c tessedit_char_whitelist=0123456789一二三四五六七八九十百千万亿'
)

# 排除某些字符
text = pytesseract.image_to_string(
    img,
    config='-c tessedit_char_blacklist=|[]{}'
)

5.8 竖排文本处理

5.8.1 中文竖排

# 竖排中文（PSM 5）
tesseract vertical_cn.png stdout -l chi_sim --psm 5

5.8.2 日文竖排

# 竖排日文
tesseract vertical_jp.png stdout -l jpn --psm 5

5.8.3 竖排文本预处理

import cv2
import numpy as np

def detect_vertical_text(image_path):
    """检测是否为竖排文本"""
    img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
    
    # 二值化
    _, binary = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
    
    # 计算投影
    h_proj = np.sum(binary, axis=1)  # 水平投影
    v_proj = np.sum(binary, axis=0)  # 垂直投影
    
    # 分析投影分布
    h_peaks = np.where(h_proj > np.mean(h_proj) * 1.5)[0]
    v_peaks = np.where(v_proj > np.mean(v_proj) * 1.5)[0]
    
    # 竖排文本特征：水平投影间隔均匀，垂直投影连续
    if len(v_peaks) > len(h_peaks) * 1.5:
        return True, "竖排文本"
    else:
        return False, "横排文本"

is_vertical, desc = detect_vertical_text('text.png')
print(f"检测结果: {desc}")

5.9 特殊脚本处理

5.9.1 数学公式

# 安装数学公式模型
sudo apt install tesseract-ocr-equ

# 识别数学公式
tesseract formula.png stdout -l equ

5.9.2 音乐符号

# 目前 Tesseract 不原生支持音乐符号
# 可尝试使用 osd 检测后选择最接近的脚本

5.10 语言包自定义

5.10.1 查看语言包内容

# 使用 combine_tessdata 查看语言包结构
combine_tessdata -u /usr/share/tesseract-ocr/5/tessdata/chi_sim.traineddata chi_sim.

# 查看组件
ls chi_sim.*
# chi_sim.config  chi_sim.lstm  chi_sim.lstm-number  chi_sim.lstm-punc
# chi_sim.lstm-unicharset  chi_sim.word-freq

5.10.2 语言包版本选择

版本	来源	大小	精度
`tessdata`	官方标准	中等	标准
`tessdata_best`	官方最佳	大	最高
`tessdata_fast`	官方快速	小	一般

# 下载 best 版本（高精度）
cd /usr/share/tesseract-ocr/5/tessdata/
sudo wget https://github.com/tesseract-ocr/tessdata_best/raw/main/chi_sim_best.traineddata
sudo mv chi_sim_best.traineddata chi_sim.traineddata

5.11 业务场景选型

业务场景	推荐语言配置	备注
中文发票	`chi_sim`	白名单数字
中英文合同	`chi_sim+eng`	标准配置
古籍数字化	`chi_tra`	可能需要训练
日文漫画	`jpn`	竖排用 PSM 5
多语言护照	`eng+chi_sim+jpn+kor`	根据文档选择
阿拉伯文档	`ara`	注意 RTL

5.12 本章小结

要点	说明
中文识别	安装 `chi_sim`，混合文档加 `eng`
语言组合	用 `+` 连接，主要语言在前
竖排文本	使用 PSM 5
模型选择	精度要求用 `best`，速度要求用 `fast`
混合语言	语言越多越慢，可能降低精度

5.13 扩展阅读

上一章: 图像预处理 | 下一章: 模型训练