Laboratory of Intelligent Systèmes and Applications

Home > LSIA > Resources > RCATSS >>..



RCATSS (Reference Corpus for Arabic Text Segmentation Systems) is a corpus containing data for testing and comparing arabic text segmentation systems. It offer about 80 lines and a total of 4519 characters in eight gray scale documents of printed arabic texts. The documents were digitized with a high resolution scanner (300 dpi). To guarantee that the constructed corpus can be used as a reference for all segmentation systems, it was taken into consideration that it must contain many examples of any probleme that may cause difficulties for the segmentation system. The texts have been choosen randomly and written with 34 fonts and various sizes in a way to garantee the presnece of :

  • All the letters of the alphabet.

  • Diacritics signs, numbers (latin & hindi digits) , punctuation signs (?!,.").

  • Letters in the four positions: start, end, middle and isolated.

  • Overlapping characters.

  • Characters in vertical ligature.

  • A minimum of 4000 characters.

  • Average image quality (average image size, noise, etc).

  • Slant text.

  • Condensed spacing between characters of the same word.

Characteristics of the proposed corpus:

  • Number of lines: 83

  • Number of words: 984

  • Number of characters: 4519

  • Number of fonts used: 34