Dana Vrajitoru
I310 Multimedia Arts and Technology
Text
Text in Multimedia
- Definition: the medium of delivering information via words to be
read and interpreted by the reader.
- Types of writing systems:
- Logographic - is a single grapheme (anatomic unit of
written language) which represents a word or a morpheme (a meaningful
unit of language).
- Alphabetic - using a set of standardized letters
representing phonemes (smallest contrastive unit in the sound system
of a language).
- Syllabaries - using symbols to represent syllables.
- Ideographic (or pictographic) - a graphic symbol
representing an idea.
History of Text
- Cuneiforms - Sumerian, 4-3k BC.
- Hieroglyphs (using pictures to represent ideas and
objects). Various existing types - Egyptian (~3k BC), Anatolian
(Luwian, ~2k BC), Cretan (~2k BC), Mayan (3rd - 16th century)
- Alphabets: around 1700 BC, the Semite population in Egypt
developed an alphabet that is believed to be the ancestor of most
modern ones, although it's not itself understood.
Current Writing Types
- Alphabets: Latin, Cyrillic, Arabic, Brahmic, Hebraic, Greek,
other.
- Syllabaries: small scale used in Greece (Mycenaean), Native
American (Cherokee, Cree), African (Vai), China (Yi). Large scale:
Japanese kana (hiragana and katakana)
- Logographic and ideographic: Chinese, Japan (Kanji), Korea
(Hanja).
- Spelling: phonetic alphabet (international). Most languages will
have digraphs (2 letters), trigraphs or even tessergraphs (tsch) to
represent one phonem.
Alphabet distribution in the world
Cree Syllabary
Typeface and Font
- Typeface - family of graphic characters including many sizes and
styles. Example: Arial.
- Font - collection of characters of a single size and style
belonging to a typeface. Example: Arial 14-point bold.
- Size - expressed in points (0.0138 in). A 14-point size means from
the top of capital letter to the bottom of descending letters (like
y).
- Case: uppercase (capitals) and lowercase (small
letters). Case-sensitive: any system that distinguishes between the
uppercase and lowercase.
Font Terminology
- Bitmap font - each glyph is stored as a matrix of pixels. Harder
to scale (need anti-aliasing) or to condense and expand.
- Outline fonts - the glyphs are stored as an outline using vector
graphics (PS, TT).
- Postscript fonts - Type 1 or Type 2 fonts, developed for the
PostScript (printing oriented) language.
- TrueType - developed by Apple & Microsoft, higher degree of
control of how the fonts are displayed.
- Serif - small decorative line added to the font.
sans serif (Arial).
- Monospace font - a font for which all the characters have the same
size. Useful in situations where the alignment is important
(programming code, ASCII art).
Character Encoding
- ASCII - American Standard Code for Information Interchange. It
represents a character as 1 byte (8 bits). This means up to 256
characters. Traditionally 128, extended 256.
- Unicode - Universal Character Set represents each character on 2
bytes (16-bit), extending the range to 65,536 characters which can
include many alphabets and notations.
Hypertext
- Hypermedia - interactive multimedia enhanced with links between
elements that allow the user to browse the project.
- Hypertext - a hypermedia project where the text component
represents an important part.
- For collection of documents containing text beside the
user-defined links, textual information can be processed and indexed.
- This makes the retrieval of individual parts of the collection
easier using a search engine based on keywords.
Indexing
- The documents to be indexed are first separated into words.
- The most common words are removed (as, of, are).
- The remaining words are processes to remove suffixes and prefixes
like s, ing, ly, etc. This turns several words from the same family
into the same thing (work, works, working -> work), called stems.
- The stems are counted in each document and some importance is
decided for each of them.
- An inverted index is generated, storing for each keyword (stem)
the documents (pages) that are related to it.
Search Engine
- The user enters one or more keywords to search for, with or
without Boolean operators (and, or).
- These keywords are processed the same way as described for the
indexing.
- All of the documents/pages related to the processed keywords in
the inverted index are retrieved and classified by relevance.
- Search engines use various techniques to sort among the retrieved
pages. Hits from the user on a page can improve its relevance or links
from a page to another can also make it more important.