: Common Structures for Text Corpora: The simplest kind of corpus is a collection of isolated texts with no particular organization; some corpora are structured into categories like genre (Brown Corpus); some categorizations overlap, such as topic categories (Reuters Corpus); other corpora represent language use over time (Inaugural Address Corpus).

This corpus contains text from 500 sources, and the sources have been categorized by genre, such as Next, we need to obtain counts for each genre of interest.Unfortunately, for many languages, substantial corpora are not yet available.Often there is insufficient government or industrial support for developing language resources, and individual efforts are piecemeal and hard to discover or re-use.We'll use NLTK's support for conditional frequency distributions.These are presented systematically in 2, where we also unpick the following code line by line.

This chapter continues to present programming concepts by example, in the context of a linguistic processing task.

