Managing Linguistic Data Structured collections of annotated linguistic data are essential in most areas of NLP, however, we still face many obstacles in using them. How do we design a new language resource and ensure that its coverage, balance, and documentation support a wide range of uses? When existing data is in the wrong format for some analysis tool, how can we convert it to a suitable format? What is a good way to document the existence of a resource we have created so that others can easily find it?

Along the way, we will study the design of existing corpora, the typical workflow for creating a corpus, and the lifecycle of corpus. As in other chapters, there will be many examples drawn from practical experience managing linguistic data, including data that has been collected in the course of linguistic fieldwork, laboratory work, and web crawling. 1   Corpus Structure: a Case Study The TIMIT corpus of read speech was the first annotated speech database to be widely distributed, and it has an especially clear organization. TIMIT was developed by a consortium including Texas Instruments and MIT, from which it derives its name. 1   The Structure of TIMIT Like the Brown Corpus, which displays a balanced selection of text genres and sources, TIMIT includes a balanced selection of dialects, speakers, and materials.