Glossary


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
AIFF Audio Interchange File Format. An audio file format that was developed by Apple Computer and is primarily used for sharing high-quality sampled audio. As an uncompressed format, it is less frequently used for archiving because file sizes are so large. For information on the structure of AIFF format, see the Apple Developer pages on sound files. For information on AIFF specifications, see McGill University's website on audio format specifications.

Alignment A way of organizing multiple strings of data. Comparable sequences are presented together in a multilinear format, which aids in comparison of the strings by revealing similarities and differences in terms of insertions, deletions and substitutions.

Annotation Further linguistic information added to a speech signal or text; it may include part of speech labels, indications of morphological significance, and other pertinent information.

Archival Format An uncompressed, unedited rendition of data which serves as a long term storage form. Archival data storage is high quality and requires a large file size. Also called preservation format.

Archive A trusted repository created and maintained by an institution with a demonstrated commitment to permanence and the long-term preservation of archived resources. For more information see the E-MELD pages on Creating Collections and Archives.

Archival Object An archival object is not the same thing as an archived file. Most often, an archival object will consist of a set or bundle of related files, for example, a recording in several formats.

ASCII American Standard Code for Information Interchange. [æski] A code for information exchange between computers that encodes English characters as numbers (7 or 8 binary digits). The larger ASCII character sets also incorporate codes for non-English characters, graphic symbols and mathematical symbols. ASCII was developed by the American National Standards Institute.

AVI Audio-Video Interleaved is an audiovisual format developed by Microsoft. Although a proprietary technology, this is a common format for audio, video and other multimedia files used within Windows-PC environments.

Bit Binary Digit is the fundamental unit of information in the base two numerical system, read by a computer with a single bit being represented by either a '0' or a '1'. For example, the numbers 1, 2, 3, 4, 5 are represented as 1, 10, 11, 100, 101, in the binary system. Complex representations of information are accommodated by multiple or gathered bits.

Bitonal Image An image consisting only of a foreground color and a background color. (Tonality is the pixel or bit depth of a digital image).

Bit depth Bit depth concerns the number of bits used to convey tonality for each pixel, that is, black and white, gray-scale, or color. In general, the more bits per pixel, the larger the file size. For digital audio, bit-depth is the sample size, which determines the dynamic range of the file.

BMP Bitmap Image Format. Developed by Microsoft, it is uncompressed; therefore the file sizes are much larger than those for other formats. Can store graphics from 1-bit (2-color, i.e. black and white) up to 24-bit (16.7 million colors).

Byte A sequence of 8 bits (enough to represent one of 256 possible characters in the ASCII set of alphanumeric characters) processed as a single unit of information by a computer.

Characters A character is "the smallest component of written language that has semantic value" (The Unicode Standard 3.0: 13). Characters should be distinguished from glyphs, which are the visual representations of characters. For example, the single character "A" may be represented by different glyphs: "A" and "a".

Character Encoding Conversion Converting character codes from one encoding standard into their appropriate counterparts in another standard, e.g., converting characters from an idiosyncratic encoding (as used by some IPA fonts) to Unicode.

Collection The body of documentary materials created by linguists and native speakers, that will be deposited in an archive. For more information see the E-MELD pages on Creating Collections and Archives.

Compression A technique for more efficient data storage. Lossless compression allows the file to be uncompressed, rebuilt in its original format without loss of data; but lossless compression is rare. Lossy compression can be uncompressed, but some information (e.g., details of color in an image, or modulation in an audio file) will be lost. Lossy compression algorithms vary in the amount of data lost; higher compression results in smaller files but greater distortion.

Concordance An index of a text, based on words, lemmas, or morphemes with their immediate contexts. Concordance software can automatically generate such lists, given a set of words, lemmas, or morphemes, and a text.

Data Mining The extraction of patterns and other useful information from a corpus of data.

Database A collection of information compiled in a computer-searchable format.

DOC [dak] This is the file extension for Microsoft Word and Word Perfect documents. These are proprietary binary formats and are therefore not suitable for long-term archiving.

DPI Dots Per Inch. Number of pixels per inch stored by the digital file. Used for monitors and printers.

Dublin Core Metadata Set The Dublin Core metadata set is the result of the efforts of the Dublin Core Initiative, a movement that seeks to define a metadata standard general enough to catalog everything in the world.

File Extension The code that marks which format a particular set of data is stored in; it is represented in the filename preceded by a period. For example, in the filename "data.doc," "doc" is the file extension. It means that the file in question is a document file created by a program such as MS Word or Word Perfect.

Flat File Database An information structure that stores data in a simple two-dimensional format - for instance, an Excel spreadsheet that presents vocabulary items and their grammatical features. This human-readable format works well for simple sets of information but more complex sets are best stored in a relational database.

Feature Value Feature values define what kind of features belong to the specified grammatical category, such as gender and number. For example, in Classical Arabic, 'ʔimraʔatun', the word for 'woman', is a noun with the feature values of 'singular, feminine, nominative, and indefinite'.

Form Type The form type is the morphological structure in which a linguistic form occurs, such as prefix, suffix, or free root.

FO Formatting Object. A Formatting Object is a concept used in XSL transformations of XML files. FOs are used in the generation and manipulation of areas of text; they enable data that is stored in XML to be displayed and printed.

Format When used of a file, format refers to the file's preset organization scheme, specified by the application used to create it. MS Word, for example, creates files in a proprietary 'doc' format. A file format is often only readable by the same application that was used to create it, which causes portability problems.

GIF [gɪf] Graphic Image File format. A widely supported image-storage format promoted by CompuServe that gained early widespread use on online services and the Internet. It can only render up to 8-bit color (256 colors). However, GIF format provides lossless compression of an image, making it more efficient than BMP for items such as line drawings. Visit the World Wide Web Consortium's link to GIF specifications for more information.

Gloss A brief label for a single meaning or sense of a linguistic form.

Glyph Glyphs are the visual representations of a character in a writing system. Several glyphs may represent the same character. For example, the single character "A" may be represented by different glyphs: "A" and "a".

Grammatical Category The part of speech category to which the linguistic form belongs, such as noun, verb, or adjective. Part of speech categorization is determined through morphological, semantic and distributional tests.

Grapheme Graphemes are the fundamental units in an orthography or writing system. The Spanish grapheme ch, for example, is treated as a single entity in sorting and often considered a separate letter. However, it is covered in Unicode by two characters, c + h, not a single ch character. (Anderson, 2003: 3)

HTML HyperText Markup Language. The markup language used to represent the format of web pages (in contrast to XML, which represents content). Tags are used to "mark up" elements of text for display in certain presentation formats. For instance, the HTML tag <b> is interpreted by web browsers to indicate that a block of text should be displayed in bold type. Visit the World Wide Web Consortium's HTML Homepage for more information.

Information Structure A scheme for organizing related data in such a way that it can be manipulated efficiently by software programs.

Interlinear Text Language data in tiered format incorporating 'text, phrase, word and morpheme levels.' (Bird, Bow, Hughes, 2003)

Interpolated Resolution The resolution at which a device can process an image. Using interpolation, the device assigns intermediate values based on known values to achieve higher resolution than the optical resolution offers.

JPEG [dzepɛg] Joint Photographic Experts Group. Used to refer to the standard the group developed for still-image compression, which is sanctioned by the International Standards Organization (ISO). The advantages of JPEGs include producing smaller image files with the aid of higher compression, and rendering images in either 24-bit color (16 million colors) or 8-bit color (256 colors). However, JPEG is a lossy compression format, which may not be preferred for archival images. Visit the World Wide Web Consortium's JPEG Homepage for more information.

Language Description "A language description aims at the record of a language, with ‘language’ being understood as a system of abstract elements, constructions, and rules that constitute the invariant underlying structure of the utterances observable in a speech community." Himmelmann (1998). Linguistics 36. pp. 161-195. Note that this is distinct from language documentation.

Language Documentation "The aim of a language documentation is to provide a comprehensive record of the linguistic practices characteristic of a given speech community. Linguistic practices and traditions are manifest in two ways: (1) the observable linguistic behavior, manifest in everyday interaction between members of the speech community, and (2) the native speakers' metalinguistic knowledge, manifest in their ability to provide interpretations and systematizations for linguistic units and events. This definition of the aim of a language documentation differs fundamentally from the aim of language descriptions." Himmelmann (1998). Linguistics 36. pp. 161-195.

Lexicon Management Software Software that provides the ability to quickly and easily enter, categorize and display large quantities of lexical data.

Line Art Perfect for use either in print or on the web, these images can be emailed to you on whatever platform and graphics suite is currently used. The best resolution for lineart is 1200 ppi. Halftoning is the process of turning continuous tone grayscale or color images into a series of dots for printing that fool the eye. Learn the ins and outs of halftones, screens, traditional and digital halftoning techniques.

Linguistic Corpus A collection of texts (now usually in an electronic format) that are chosen and organized in such a way as to facilitate linguistic research, e.g. to represent a certain type of discourse or to provide a data set to be searched for examples of linguistic features.

LPI Lines Per Inch. The number of lines per inch stored by the digital file. Used for printing screens.

Metadata Data about data. Metadata includes pertinent information about a collection of data, including information about the speaker, the collector and the format of the data. It is essential to accurate analysis of the data collected and increases portability.

Metadata Extension Metadata extensions in the form of XML markup are used by OLAC to "extend" the function of Dublin Core metadata to encompass linguistic data. An example of OLAC metadata extensions is "olac:role", which allows a user to further describe the creators and contributors of resources by applying roles to their positions. For instance, the creator can be further described as the "collector" or "data-provider". This allows for more accurate metadata as well as better recall and precision in searches for relevant resources.

Metadata Substitution Metadata substitutions are element refinements that can be substituted for one of the fifteen basic Dublin Core metadata elements. A refinement shares the meaning of an element but with narrower semantics. For example, the more specific "medium" or "extent" can be substituted for the general element "format."

Migration The movement of data from one format, system or domain to another without jeopardizing the integrity of the data.

Monochrome Literally "one color". Usually used for a black and white (or sometimes green or orange) monitor as distinct from a color monitor.

Morphology The branch of grammar that studies the internal structure of words. (Crystal, 1987)

MOV [mov] The extension for Quicktime format, developed by Apple Computer. A method of storing sound, graphics, and movie files. The latest version of Quicktime can be downloaded from Apple's website.

MP3 MPEG-1 Audio Layer 3. A file format used for compressing audio data. Compression rates of up to 12 to 1 are possible, but with corresponding loss in sound quality. For more information, go to the MPEG Homepage.

MPEG [ɛmpɛg] An acronym for Moving Pictures Experts Group, an industry committee that is developing a set of compression standards for moving images (such as film, video and animation) that can be downloaded and viewed on a computer. The MPEG-1 standard yields a video resolution of 352-by-240 at 30 frames/second, while MPEG-2 offers resolutions of 720x480 and 1280x720 at 60 frames/second, with full CD-quality audio. For more information, go to the MPEG Homepage.

Non-transcriptional Annotation Non-transcriptional annotation constitutes forms of representing form structure and content of types of linguistic signs. It is at a different level of interpretation than transcription, as it can be recursive or iterative (it can annotate itself or a transcription). Furthermore, it itself does not require a transcription. (One can annotate a video without transcribing it, for example with closed captions or with a translation into a major language.)

OCR Optical Character Recognition. A computer application that reads images of text from a printed page and converts them to characters that can be searched, indexed, and edited. Useful for digitizing a type-written hard copy for which no digital record exists.

OLAC Open Language Archives Community. A community founded to develop a consensus on best practices in digital language documentation and to develop a network of repositories of archival language resources.

Open Source Software that is open-source is software for which the source code is freely available. This means that another developer is free to modify the code according to his/her needs, or to reverse-engineer a product created by the software. Language documentation created using open source software is likely to last longer than that created using proprietary software because many programmers will be able to understand, and if necessary reconstruct, the software that makes it intelligible. Proprietary software, on the other hand, is impenetrable after the developer ceases to support it.

Open Standard An open standard is a specification whose description is freely available, e.g., HTML, XML. This means that developers are free to create applications which are valid according to the specification and which will therefore work with software designed for it. It is recommended best practice to use open standards and open source software, since documentation so created has the best chance of being intelligible to future generations.

Optical Resolution The resolution at which a device can capture an image; this creates a set of known values that enables interpolated resolution.

Parser An algorithm or program used to determine the structure of a sentence in some language. Syntactic parsers determine syntactic structure, taking the POS marked tokens from a lexical analyser as its input. Morphological parsers take a word form as input and return as its analysis a structure of morphemes.

Parsing Breaking down linguistic data into working components such as lexemes or phrases.

PDF Portable Document Format. A file format developed by Adobe Systems, that is used to capture almost any kind of document with the formatting as in the original. Viewing a PDF file requires a reader such as Acrobat Reader, XPDF or Preview. Acrobat Reader is built into most browsers and can be downloaded freely from Adobe.

Phonology The study of the systematic distribution of phonemes (linguistic sounds). (Crystal, 1987)

Pinyin A romanization of Standard Mandarin script. Other systems have been used, and can be used with non-Han languages; however, Pinyin has been internationally excepted as the standard romanization script for Chinese languages. Pinyin is notably useful for entering Chinese language data into computers.

Pixel A pixel is a single dot on a computer monitor. Depending on the bit depth of a computer monitor, each pixel can be displayed as anywhere from two to millions of colors. Each pixel is assigned a tonal value (black, white, shades of gray or color), which is represented in binary code (zeros and ones).

PNG [pɪŋ] Portable Network Graphics. An open standard file format for bitmapped graphic images, designed to be a replacement for the GIF format. Visit the World Wide Web Consortium's PNG Homepage for more information.

Portability The ability of a data format to 'transcend computer environments, scholarly communities, domains of application and passage of time' (Bird & Simons, Language 79, 2003).

PPI Pixels Per Inch. A measure of the number of pixels per inch stored by a digital file. Used for monitor display resolution.

Precision The amount of relevant results returned by a search engine, usually represented as a percentage measured by dividing the number of relevant returns by the total number of returns. Higher precision means that less irrelevant material is displayed, but relevant material that uses a different vocabulary may also be missed.

Presentation Format A compressed, easily accessible rendition of data which serves as a general access working form primarily for the web. Presentation data storage is of medium quality and constitutes a reasonable file size for fast download time. Also called access format and display format.

PRJ The file extension for projects saved out of the (commercially available) SIL Linguist's Shoebox.

PSD PhotoShop Documents. The file extension for documents created and saved in Adobe Photoshop as layered images, which makes editing different parts of an image at a later date far easier.

Real Audio Real Audio is an audio format that can only be heard by using RealPlayer, a basic version of which can be freely downloaded online. The file extension for Real Audio files is RA or RAM.

Recall Offers a lower level of precision in a search to avoid missing any relevant returns; this, however, results in increased irrelevant returns.

Relational Database An information structure that stores data in tables that can be linked to each other for cross-referencing - for instance, a table that presents vocabulary items and their grammatical features that is linked to a table that presents grammatical features and their definitions. This format prevents the duplication of data and is the preferred method of storing complex sets of information. Compare to flat file database.

Resolution A measure of the quality of a digital image; as the resolution increases, the quality of the image increases. Resolution is measured in terms of dots per inch (dpi) or pixels per inch (ppi). Monitors measure in ppi; flatbed scanners, drum scanners, and printers measure in dpi.

RTF Rich Text Format. A format developed by Microsoft and enables saving text files with formatting, font information, text color, and some page layout information intact. Established for the presentation of cross-platform text and graphics interchange, this format does not represent information structure. It does, however, do a good job of storing font, color and formatting information.

Sampling Rate For digital audio, the number of times per second that a sound wave is measured. A higher sampling rate provides a more faithful reproduction of the sound.

Semantic Field The meaning group to which a lexeme belongs; a group of interrelated vocabulary items. For example, the terms 'head', 'arm' and 'foot' all belong to the semantic field 'parts of the body.'

Screen Resolution Resolution measured in pixels per inch (PPI). It helps to think of the image as a grid. As the resolution increases, the size of the grid cells get smaller, in effect increasing the number of cells (pixels) per inch.

SPI Samples Per Inch stored by the digital file. Used for scanners, and for optical resolution vs. interpolated resolution.

Supported by Multiple Vendors Data stored in a file format supported by multiple vendors has a better chance of being accessible in the future. If your data is stored in a proprietary format that is only available from a single vendor, you are dependent on that vendor's ability to stay in business, and their willingness to continue writing applications for that format.

Syntax The way in which words are arranged to show relationships of meaning within (and sometimes between) sentences. (Crystal, 1987)

Tagger A piece of software used for tagging texts, i.e. labelling words with the part-of-speech appropriate to the context. A POS-Tagger, for instance, will usually work in tandem with a syntactic-parser.

Tagging Usually Part-of-Speech (POS) tagging, or grammatical tagging, the most common form of corpus annotation. Essentially, it involves marking each word in a corpus with a grammatical label.

Tagging Scheme A reference of annotation tags including their names, definitions and the guidelines concerning their use in a corpus.

Tagset The set of tags used for annotation in a particular language in a particular corpus.

Thumbnail A reduced-size version of an image, used to present a faster loading, albeit lower quality, alternative to larger image formats such as a GIF or a JPEG. This format is not recommended for archival purposes, but can be useful when a large number of images must be displayed on the same webpage.

TIFF [tɪf] Tagged Image File Format. Storage format for master digital images. It is a platform independent format and the recommended format for archiving images in electronic form. TIFF is nearly the only format used as a lossless image storage format that uses no compression at all. (Sometimes a lossless compression algorithm called LZW is used, but it is not universally supported.) The default file extension for TIFF files is ".TIF".

Tonality Pixel depth or bit depth.

Transcription Transcription (or transcriptional annotation) is a representation of the sign, in some modality such as speech or gesture; it includes the information necessary for a person or machine to reproduce the linguistic form. Transcription is always interpretive, since it reflects what the transcriber perceives, but its goal is faithful representation of the linguistic signal. Types of transcription include orthographic, phonetic, phonemic, and kinemic/kinetic transcription. For more information see the Annotation and Transcription Homepage.

Transparent We are using transparent to refer to file formats which can be accessed without recourse to special algorithms. For example, an uncompressed TIFF image file could be read, pixel by pixel, by any application that can handle images, but a GIF file compressed using the Lempel-Ziv algorithm could only be displayed using an application that is able to process that algorithm.

TXT [tɛkst] The basic, ASCII encoded text format. Non-proprietary, widely supported and future-proof.

Typology The classification of languages according to their linguistic features, such as word order, morphological structure, and phonological phenomena.

Unicode Unicode is an international character encoding standard designed to map each character in the world's writing systems to its own unique numerical code. The inventory of characters covered by the standard continues to grow; it has the potential to provide a unique code for approximately one million characters. Unicode is the standard upon which many current fonts, keyboards, and software are based. For more detailed information, consult the latest edition of the Unicode Standard, which is available online from the Unicode website and in print. (Anderson, 2003: 1) Also, see the E-MELD pages on Unicode.

Unicode Consortium A non-profit organization founded to develop Unicode standards and promote the use of the Unicode standard. For more information, visit the Unicode Consortium's website.

Valid XML An XML document that has an associated document type declaration and complies with the constraints expressed in it. For more information visit the World Wide Web Consortium's pages on XML.

WAV [wav] The wave file format is a subset of Microsoft's RIFF specification for the storage of multimedia files. It has become a standard sound file format for use on IBM-compatible PCs. This format does not compress in any way and there is no loss of information.

Well-formed XML XML that meets the well-formedness constraints specified by the W3C.

XML EXtensible Markup Language. It defines a standard way of encoding the structure of information in plain text format. It is an open standard of the World Wide Web Consortium that is based on extensible tags (extensible meaning that they are not pre-programmed, but can be defined by the creator). XML is currently considered best practice for the archival encoding of textual data, because it does not depend upon any particular software, and can be formatted through an XSL Stylesheet to be displayed in almost any format (including html, .txt, .doc). For more information see the E-MELD pages on XML.

XML Tools Software that eases and expands the use of XML.

XSD XML Schema Definition is a language for specifying the grammar of the markup allowed in an XML file. Such a specification is called a schema and typically has a file extension of XSD. It can define, for instance, the ordering of elements, or what child elements a particular element may have. It is important for XML documents to adhere to a particular schema because schemas provide uniformity in the organization of data. This is particularly important if an XSL stylesheet will later be designed to apply to the data. View a sample XSD schema designed for the EMELD project.

XSL EXtensible Stylesheet Language. A programming language that creates stylesheets to transform XML documents into different file formats (for instance, HTML, text, or PDF). (Note that XSL cannot produce a PDF document. XSL transforms to FO, and then an FO processor transforms FO to PDF). Using an XSL processor, it is possible to transform an XML document via multiple XSL stylesheets which will display the document in multiple formats without changing the original XML document.

XLS The file extension for Microsoft Excel spreadsheets. Though spreadsheets were originally developed for manipulating accounting data, they have proven useful to linguists for managing many kinds of tabular data. Not to be confused with XSL.