The same holds true of text corpora, in the sense that the original text usually has an external source, and is considered to be an immutable artifact.
Any transformations of that artifact which involve human judgment — even something as simple as tokenization — are subject to later revision, thus it is important to retain the source material in a form that is as close to the original as possible.
We can also construct special tabulations (known as paradigms) to illustrate contrasts and systematic variation, as shown in 1.3 for three verbs. At the most abstract level, a text is a representation of a real or fictional speech event, and the time-course of that event carries over into the text itself.
A text could be a small unit, such as a word or sentence, or a complete narrative or dialogue.
Like the Brown Corpus, which displays a balanced selection of text genres and sources, TIMIT includes a balanced selection of dialects, speakers, and materials.
For each of eight dialect regions, 50 male and female speakers having a range of ages and educational backgrounds each read ten carefully chosen sentences.
It could also be a phrasal lexicon, where the key field is a phrase rather than a single word.
A thesaurus also consists of record-structured data, where we look up entries via non-key fields that correspond to topics.
As we saw in 2., most lexical resources can be represented using a record structure, i.e. A lexical resource could be a conventional dictionary or comparative wordlist, as illustrated.
Five of the sentences read by each speaker are also read by six other speakers (for comparability).
The remaining three sentences read by each speaker were unique to that speaker (for coverage). You can access its documentation in the usual way, using This gives us a sense of what a speech processing system would have to do in producing or recognizing speech in this particular dialect (New England).
TIMIT was developed by a consortium including Texas Instruments and MIT, from which it derives its name.
It was designed to provide data for the acquisition of acoustic-phonetic knowledge and to support the development and evaluation of automatic speech recognition systems.