Project


Our objective was to create Korean OCR(Optical Character Recognition) System.


Hangul



The native alphabet of the Korean language is called Hangul. It is a phonemic alphabet organized into syllabic blocks. Each block consists of at least two of the 24 Hangul letters (jumo), with at least one each of the 14 consonants and 10 vowels. In Korean language no letter may stand alone. Instead,they are grouped into syllabic or morphemic blocks of at least two and often three:
  • a consonant or a doubled consonant called the initial,
  • a vowel or diphthong called the medial, and, optionally,
  • a consonant or consonant cluster at the end of the syllable, called the final

When a syllable has no actual initial consonant, the null initial ieung is used as a placeholder. Thus, a block contains a minimum of two jamo, an initial and a medial.
Normally the resulting block is written within a square of the same size and shape as a hanja (Chinese character) by compressing or stretching the letters to fill the bounds of the block.





Our work


How was it day by day...

Friday

  • getting familiar with Korean language
  • obtaining input text for our work, obtaining alphabet
  • creating procedure of dividing text into lines using vertical projection (matlab)
  • meeting problems for OCR - connections between letters in a syllable, one letter appears in different scale

Saturday

  • creating procedure of dividing lines of text into single characters using horizontal projection (matlab)
  • creating database of letters: each character is stored as an: image, xproj, yproj, phonetically representation
  • proposing skeletonization and application of topological features in order to distinguish characters, choosing needed features:
    • number of components
    • number of endpoints
    • direction of endpoint
    • number of branch points
    • number of corner points
  • tested correlations between projection of letters and same examples of complex characters - without success
  • tested correlations between projection of letters and projection of some consonants- with success

Monday

  • implementing division of syllable into character based on connectedness.(each separated part form separate component) - partially accurate
  • testing procedure of matching syllable components with alphabet, based on sum of square distance - partially accurate
  • implementation of some feature extracted from skeletonized shape(endpoint, corner points, directions of endpoints)
  • Tuesday

  • skeletonization of alphabet, the characters separated from the whole text
  • finished feature extraction - number of endpoints of different directions, number of branching points, number of corner points, height and width of object
  • implementation testing script for whole procedure,
  • Wednesday

  • preparing test data
  • testing - obtained very bad results - searching for some additional features
  • adding scaled vertical and horizontal projections to the feature set
  • Thursday

  • bug found, testing results are much better
  • testing different classification methods,R statistical tool used
  • finishing work


  • Copyright © 2009. Privacy Policy | Terms of Use | XHTML | CSS

    Website Design by Flash Website Templates