目录

旷视文本检测与识别综述笔记

Introduction

Basic pipeline

  • Detection+Recognition
  • End-to-end

/thumbbox2/image-20201107140006891.png
fig1

Challenges for general text detection and recognition

  • Diversity and variability
    • different languages/color/fonts/size/orientations/shapes
  • Complexity and interference of background
    • similar patterns/occlusions
  • Imperfect image conditions
    • Low resolution/shot angle/blurred(unfocused)/noise/light

Methods before DL

Text detection

  • CCA (Connected Components Analysis):连通域分析法

    1. 提取出包含文本的候选区域(color clustering/extreme region extration)

    2. 从候选区域中过滤背景,即分割出文本类(特征提取,分类/分割)

      • 特征:MSRE/SWT/SIFT/SURF/LBP/灰度共生矩阵等
      • 分类器:Kmeans/KNN/SVM/NN/DecisionTree等

Huang et al., 2013; Neumann and Matas, 2010; Epshtein et al., 2010; Tu et al., 2012; Yin et al., 2014; Yi and Tian, 2011; Jain and Yu, 1998

  • SW (Siliding window):滑动窗口法

    1. 利用不同大小的滑动窗口对窗口区域进行二分类(包含/不包含文本)
    2. 通过形态学操作/CRF/Graph-based-method等对窗口进行合并

Lee et al., 2011; Wang et al., 2011; Coates et al., 2011; Wang et al., 2012

Text recognition

  • 基于特征的方法
  • 划分子问题
    • 二值化(text binarization)->文本行切分(text line segmentation)->字符划分(character segmentation)->单字符识别(single character recognition)->单词校正(word correction)

feature-based: Shi et al., 2013; Yao et al., 2014; Rodriguez-Serrano et al., 2013, 2015; Gordo, 2015; Almazan et al., 2014

text binarization: Zhiwei et al., 2010; Mishra et al., 2011; Wakahara and Kita, 2011; Lee and Kim, 2013

text line segmentation: Ye et al., 2003

character segmentation: Nomura et al., 2005; Shivakumara et al., 2011; Roy et al., 2009

single character recognition: Chen et al., 2004; Sheshadri and Divvala, 2012

word correction: Zhang and Chang, 2003; Wachenfeld et al., 2006; Mishra et al., 2012; Karatzas and Antonacopoulos, 2004; Weinman et al., 2007

End-to-end (detection+recognition)

  • Wang et al., 2011:nearest-neighbor classifier+HoG
  • Neumann and Matas, 2013:decision delay ap- proach+dynamic programming algorithm

Wang et al., 2011; Neumann and Matas, 2013

Methods based on DL

Text detection

Text recognition

/thumbbox2/image-20201107191619556.png
fig2

End-to-end (Detection+Recognition/Text Spotting)

image-20201107222045772

Auxiliary techniques that support detection and recognition

  • Synthetic Data
  • Weakly and Semi-Supervision

More paper reference

https://github.com/Jyouhou/SceneTextPapers

Datasets

https://tva1.sinaimg.cn/large/008eGmZEgy1gnbsqxuyqnj31sc0tqqv6.jpg
image-20201107191949977

Dataset (Year)Image Num (train/test)Text Num (train/test)OrientationLanguageCharacteristicsDetec/Recog Task
End2End========================
ICDAR03 (2003)509 (258/251)2276 (1110/1156)HorizontalEN-✓/✓
ICDAR13 Scene Text(2013)462 (229/233)- (848/1095)HorizontalEN-✓/✓
ICDAR15 Incidental Text(2015)1500 (1000/500)- (-/-)Multi-OrientedENBlur, Small, Defocused✓/✓
ICDAR17 / RCTW (2017)12263 (8034/4229)- (-/-)Multi-OrientedCN-✓/✓
Total-Text (2017)1555 (1255/300)- (-/-)Multi-Oriented, CurvedEN, CNIrregular polygon label✓/✓
SVT (2010)350 (100/250)904 (257/647)HorizontalEN-✓/✓
KAIST (2010)3000 (-/-)5000 (-/-)HorizontalEN, KODistorted✓/✓
NEOCR (2011)659 (-/-)5238 (-/-)Multi-oriented8 langs-✓/✓
CUTE (2014) or here80 (-/80)- (-/-)CurvedEN-✓/✓
CTW (2017)32K ( 25K/6K)1M ( 812K/205K)Multi-OrientedCNFine-grained annotation✓/✓
CASIA-10K (2018)10K (7K/3K)- (-/-)Multi-OrientedCN✓/✓
Detection Only========================
OSTD (2011)89 (-/-)218 (-/-)Multi-orientedEN-✓/-
MSRA-TD500 (2012)500 (300/200)1719 (1068/651)Multi-OrientedEN, CNLong text✓/-
HUST-TR400 (2014)400 (400/-)- (-/-)Multi-OrientedEN, CNLong text✓/-
ICDAR17 / RRC-MLT (2017)18000 (9000/9000)- (-/-)Multi-Oriented9 langs-✓/-
CTW1500 (2017)1500 (1000/500)- (-/-)Multi-Oriented, CurvedENBounding box with_14_ vertexes✓/-
Recognition Only========================
Char74k (2009)74107 (-/-)74107 (-/-)HorizontalEN, KannadaCharacter label-/✓
IIIT 5K-Word (2012)5000 (-/-)5000 (2000/3000)Horizontal-cropped-/✓
SVHN (2010)- (-/-)600000 (-/-)Horizontal-House number digits-/✓
SVTP (2013)639 (-/639)- (-/-)ENDistorted-/✓

Evaluation

Detection Metrics

  • Precision ($P$): the proportion of predicted text instances that can be matched to gt labels.
  • Recall ($R$): the porportion of gt labels that have correspondents in the predicted list.
  • F1-Score

$$ F_1 = \frac{2PR}{P+R} $$

  • And others

Recognition Metrics

Character-level(#characters are recognized)/word level(whether the predicted word exactly the same as gt)

https://tva1.sinaimg.cn/large/008eGmZEgy1gnbsqyja11j31920mitf2.jpg
image-20201108160251229

Applications

  • Automatic Data Entry
  • Identity Authentication
  • Augmented Computer Vision
  • Intelligence Content Analysis

Reference