2. Accessing Text Corpora and Lexical Resources (2024)

Practical work in Natural Language Processing typically useslarge bodies of linguistic data, or corpora.The goal of this chapter is to answer the following questions:

What are some useful text corpora and lexical resources, and how can we access them with Python?
Which Python constructs are most helpful for this work?
How do we avoid repeating ourselves when writing Python code?

This chapter continues to present programming concepts by example, in thecontext of a linguistic processing task. We will wait until later beforeexploring each Python construct systematically. Don't worry if you seean example that contains something unfamiliar; simply try it out and seewhat it does, and — if you're game — modify it by substitutingsome part of the code with a different text or word. This way you willassociate a task with a programming idiom, and learn the hows and whys later.

As just mentioned, a text corpus is a large body of text. Manycorpora are designed to contain a careful balance of materialin one or more genres. We examined some small text collections in1., such as the speeches known as the US PresidentialInaugural Addresses. This particular corpus actually contains dozensof individual texts — one per address — but for conveniencewe glued them end-to-end and treated them as a single text.1. also used various pre-defined texts thatwe accessed by typing from nltk.book import *. However, since we wantto be able to work with other texts, this section examines avariety of text corpora. We'll see howto select individual texts, and how to work with them.

1.1Gutenberg Corpus

NLTK includes a small selection of texts from the Project Gutenbergelectronic text archive, which containssome 25,000 free electronic books, hosted at http://www.gutenberg.org/. We beginby getting the Python interpreter to load the NLTK package,then ask to see nltk.corpus.gutenberg.fileids(), the file identifiers inthis corpus:

>>> import nltk>>> nltk.corpus.gutenberg.fileids()['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt','blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt','carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt','chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt','milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt','shakespeare-macbeth.txt', 'whitman-leaves.txt']

Let's pick out the first of these texts — Emma by Jane Austen — andgive it a short name, emma, then find out how many words it contains:

>>> emma = nltk.corpus.gutenberg.words('austen-emma.txt')>>> len(emma)192427

Note

In 1, we showed how youcould carry out concordancing of a text such as text1 with thecommand text1.concordance(). However, this assumes that you areusing one of the nine texts obtained as a result of doing fromnltk.book import *. Now that you have started examining data fromnltk.corpus, as in the previous example, you have to employ thefollowing pair of statements to perform concordancing and othertasks from 1:

>>> emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))>>> emma.concordance("surprize")

When we defined emma, we invoked the words() function of the gutenbergobject in NLTK's corpus package.But since it is cumbersome to type such long names all the time, Python providesanother version of the import statement, as follows:

>>> from nltk.corpus import gutenberg>>> gutenberg.fileids()['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', ...]>>> emma = gutenberg.words('austen-emma.txt')

Let's write a short program to display other information about eachtext, by looping over all the values of fileid corresponding tothe gutenberg file identifiers listed earlier and then computingstatistics for each text. For a compact output display, we will roundeach number to the nearest integer, using round().

>>> for fileid in gutenberg.fileids():...  num_chars = len(gutenberg.raw(fileid)) ...  num_words = len(gutenberg.words(fileid))...  num_sents = len(gutenberg.sents(fileid))...  num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))...  print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid)...5 25 26 austen-emma.txt5 26 17 austen-persuasion.txt5 28 22 austen-sense.txt4 34 79 bible-kjv.txt5 19 5 blake-poems.txt4 19 14 bryant-stories.txt4 18 12 burgess-busterbrown.txt4 20 13 carroll-alice.txt5 20 12 chesterton-ball.txt5 23 11 chesterton-brown.txt5 18 11 chesterton-thursday.txt4 21 25 edgeworth-parents.txt5 26 15 melville-moby_dick.txt5 52 11 milton-paradise.txt4 12 9 shakespeare-caesar.txt4 12 8 shakespeare-hamlet.txt4 12 7 shakespeare-macbeth.txt5 36 12 whitman-leaves.txt

This program displays three statistics for each text:average word length, average sentence length, and the number of times each vocabularyitem appears in the text on average (our lexical diversity score).Observe that average word length appears to be a general property of English, sinceit has a recurrent value of 4. (In fact, the average word length is really3 not 4, since the num_chars variable counts space characters.)By contrast average sentence length and lexical diversityappear to be characteristics of particular authors.

The previous example also showed how we can access the "raw" text of the book ,not split up into tokens. The raw() function gives us the contents of the filewithout any linguistic processing. So, for example, len(gutenberg.raw('blake-poems.txt'))tells us how many letters occur in the text, including the spaces between words.The sents() function divides the text up into its sentences, where each sentence isa list of words:

>>> macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt')>>> macbeth_sentences[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare','1603', ']'], ['Actus', 'Primus', '.'], ...]>>> macbeth_sentences[1116]['Double', ',', 'double', ',', 'toile', 'and', 'trouble', ';','Fire', 'burne', ',', 'and', 'Cauldron', 'bubble']>>> longest_len = max(len(s) for s in macbeth_sentences)>>> [s for s in macbeth_sentences if len(s) == longest_len][['Doubtfull', 'it', 'stood', ',', 'As', 'two', 'spent', 'Swimmers', ',', 'that','doe', 'cling', 'together', ',', 'And', 'choake', 'their', 'Art', ':', 'The','mercilesse', 'Macdonwald', ...]]

Note

Most NLTK corpus readers include a variety of access methodsapart from words(), raw(), and sents(). Richerlinguistic content is available from some corpora, such as part-of-speechtags, dialogue tags, syntactic trees, and so forth; we will see thesein later chapters.

1.2Web and Chat Text

Although Project Gutenberg contains thousands of books, it represents establishedliterature. It is important to consider less formal language as well. NLTK'ssmall collection of web text includes content from a Firefox discussion forum,conversations overheard in New York, the movie script of Pirates of the Carribean,personal advertisem*nts, and wine reviews:

>>> from nltk.corpus import webtext>>> for fileid in webtext.fileids():...  print(fileid, webtext.raw(fileid)[:65], '...')...firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se...grail.txt SCENE 1: [wind] [clop clop clop] KING ARTHUR: Whoa there! [clop...overheard.txt White guy: So, do you have any plans for this evening? Asian girl...pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr...singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun...wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawb...

There is also a corpus of instant messaging chat sessions, originally collectedby the Naval Postgraduate School for research on automatic detection of Internet predators.The corpus contains over 10,000 posts, anonymized by replacing usernames with genericnames of the form "UserNNN", and manually edited to remove any other identifying information.The corpus is organized into 15 files, where each file contains several hundred postscollected on a given date, for an age-specific chatroom (teens, 20s, 30s, 40s, plus ageneric adults chatroom). The filename contains the date, chatroom,and number of posts; e.g., 10-19-20s_706posts.xml contains 706 posts gathered fromthe 20s chat room on 10/19/2006.

>>> from nltk.corpus import nps_chat>>> chatroom = nps_chat.posts('10-19-20s_706posts.xml')>>> chatroom[123]['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',','I', 'can', 'look', 'in', 'a', 'mirror', '.']

1.3Brown Corpus

The Brown Corpus was the first million-word electroniccorpus of English, created in 1961 at Brown University.This corpus contains text from 500 sources, and the sourceshave been categorized by genre, such as news, editorial, and so on.1.1 gives an example of each genre(for a complete list, see http://icame.uib.no/brown/bcm-los.html).

Table 1.1:

Example Document for Each Section of the Brown Corpus

ID	File	Genre	Description
A16	`ca16`	news	Chicago Tribune: Society Reportage
B02	`cb02`	editorial	Christian Science Monitor: Editorials
C17	`cc17`	reviews	Time Magazine: Reviews
D12	`cd12`	religion	Underwood: Probing the Ethics of Realtors
E36	`ce36`	hobbies	Norling: Renting a Car in Europe
F25	`cf25`	lore	Boroff: Jewish Teenage Culture
G22	`cg22`	belles_lettres	Reiner: Coping with Runaway Technology
H15	`ch15`	government	US Office of Civil and Defence Mobilization: The Family Fallout Shelter
J17	`cj19`	learned	Mosteller: Probability with Statistical Applications
K04	`ck04`	fiction	W.E.B. Du Bois: Worlds of Color
L13	`cl13`	mystery	Hitchens: Footsteps in the Night
M01	`cm01`	science_fiction	Heinlein: Stranger in a Strange Land
N14	`cn15`	adventure	Field: Rattlesnake Ridge
P12	`cp12`	romance	Callaghan: A Passion in Rome
R06	`cr06`	humor	Thurber: The Future, If Any, of Comedy

We can access the corpus as a list of words, or a list of sentences (where each sentenceis itself just a list of words). We can optionally specify particular categories or files to read:

>>> from nltk.corpus import brown>>> brown.categories()['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies','humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance','science_fiction']>>> brown.words(categories='news')['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]>>> brown.words(fileids=['cg22'])['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...]>>> brown.sents(categories=['news', 'editorial', 'reviews'])[['The', 'Fulton', 'County'...], ['The', 'jury', 'further'...], ...]

The Brown Corpus is a convenient resource for studying systematic differences betweengenres, a kind of linguistic inquiry known as stylistics.Let's compare genres in their usage of modal verbs. The first stepis to produce the counts for a particular genre. Remember toimport nltk before doing the following:

>>> from nltk.corpus import brown>>> news_text = brown.words(categories='news')>>> fdist = nltk.FreqDist(w.lower() for w in news_text)>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']>>> for m in modals:...  print(m + ':', fdist[m], end=' ')...can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

Note

We need to include end=' ' in order for the print function toput its output on a single line.

Note

Your Turn:Choose a different section of the Brown Corpus, and adapt the previousexample to count a selection of wh words, such as what,when, where, who, and why.

Next, we need to obtain counts for each genre of interest. We'll useNLTK's support for conditional frequency distributions. These arepresented systematically in 2,where we also unpick the following code line by line. For the moment,you can ignore the details and just concentrate on the output.

>>> cfd = nltk.ConditionalFreqDist(...  (genre, word)...  for genre in brown.categories()...  for word in brown.words(categories=genre))>>> genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']>>> cfd.tabulate(conditions=genres, samples=modals) can could may might must will news 93 86 66 38 50 389 religion 82 59 78 12 54 71 hobbies 268 58 131 22 83 264science_fiction 16 49 4 12 8 16 romance 74 193 11 51 45 43 humor 16 30 8 8 9 13

Observe that the most frequent modal in the news genre is will,while the most frequent modal in the romance genre is could.Would you have predicted this? The idea that word countsmight distinguish genres will be taken up again in chap-data-intensive.

1.4Reuters Corpus

The Reuters Corpus contains 10,788 news documents totaling 1.3 million words.The documents have been classified into 90 topics, and groupedinto two sets, called "training" and "test"; thus, the text withfileid 'test/14826' is a document drawn from the test set. This split is fortraining and testing algorithms that automatically detect the topic of a document,as we will see in chap-data-intensive.

>>> from nltk.corpus import reuters>>> reuters.fileids()['test/14826', 'test/14828', 'test/14829', 'test/14832', ...]>>> reuters.categories()['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa','coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn','cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', ...]

Unlike the Brown Corpus, categories in the Reuters corpus overlap witheach other, simply because a news story often covers multiple topics.We can ask for the topics covered by one or more documents, or for thedocuments included in one or more categories. For convenience, thecorpus methods accept a single fileid or a list of fileids.

>>> reuters.categories('training/9865')['barley', 'corn', 'grain', 'wheat']>>> reuters.categories(['training/9865', 'training/9880'])['barley', 'corn', 'grain', 'money-fx', 'wheat']>>> reuters.fileids('barley')['test/15618', 'test/15649', 'test/15676', 'test/15728', 'test/15871', ...]>>> reuters.fileids(['barley', 'corn'])['test/14832', 'test/14858', 'test/15033', 'test/15043', 'test/15106','test/15287', 'test/15341', 'test/15618', 'test/15648', 'test/15649', ...]

Similarly, we can specify the words or sentences we want in terms offiles or categories. The first handful of words in each of these texts are thetitles, which by convention are stored as upper case.

>>> reuters.words('training/9865')[:14]['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', 'BIDS','DETAILED', 'French', 'operators', 'have', 'requested', 'licences', 'to', 'export']>>> reuters.words(['training/9865', 'training/9880'])['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...]>>> reuters.words(categories='barley')['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...]>>> reuters.words(categories=['barley', 'corn'])['THAI', 'TRADE', 'DEFICIT', 'WIDENS', 'IN', 'FIRST', ...]

1.5Inaugural Address Corpus

In 1, we looked atthe Inaugural Address Corpus,but treated it as a single text. The graph in fig-inauguralused "word offset" as one of the axes; this is the numerical index of theword in the corpus, counting from the first word of the first address.However, the corpus is actually a collection of 55 texts, one for each presidential address.An interesting property of this collection is its time dimension:

>>> from nltk.corpus import inaugural>>> inaugural.fileids()['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', ...]>>> [fileid[:4] for fileid in inaugural.fileids()]['1789', '1793', '1797', '1801', '1805', '1809', '1813', '1817', '1821', ...]

Notice that the year of each text appears in its filename. To get the yearout of the filename, we extracted the first four characters, using fileid[:4].

Let's look at how the words America and citizen are used over time.The following codeconverts the words in the Inaugural corpusto lowercase using w.lower() ,then checks if they start with either of the "targets"america or citizen using startswith() .Thus it will count words like American's and Citizens.We'll learn about conditional frequency distributions in2; for now just considerthe output, shown in 1.1.

>>> cfd = nltk.ConditionalFreqDist(...  (target, fileid[:4])...  for fileid in inaugural.fileids()...  for w in inaugural.words(fileid)...  for target in ['america', 'citizen']...  if w.lower().startswith(target)) >>> cfd.plot()

Figure 1.1: Plot of a Conditional Frequency Distribution: all words in the Inaugural AddressCorpus that begin with america or citizen are counted; separate countsare kept for each address; these are plotted so that trends in usage over time canbe observed; counts are not normalized for document length.

1.6Annotated Text Corpora

Many text corpora contain linguistic annotations, representing POS tags,named entities, syntactic structures, semantic roles, and so forth. NLTK providesconvenient ways to access several of these corpora, and has data packages containing corporaand corpus samples, freely downloadable for use in teaching and research.1.2 lists some of the corpora. For information aboutdownloading them, see http://nltk.org/data.For more examples of how to access NLTK corpora,please consult the Corpus HOWTO at http://nltk.org/howto.

Table 1.2:

Some of the Corpora and Corpus Samples Distributed with NLTK: For information about downloadingand using them, please consult the NLTK website.

Corpus	Compiler	Contents
Brown Corpus	Francis, Kucera	15 genres, 1.15M words, tagged, categorized
CESS Treebanks	CLiC-UB	1M words, tagged and parsed (Catalan, Spanish)
Chat-80 Data Files	Pereira & Warren	World Geographic Database
CMU Pronouncing Dictionary	CMU	127k entries
CoNLL 2000 Chunking Data	CoNLL	270k words, tagged and chunked
CoNLL 2002 Named Entity	CoNLL	700k words, pos- and named-entity-tagged (Dutch, Spanish)
CoNLL 2007 Dependency Treebanks (sel)	CoNLL	150k words, dependency parsed (Basque, Catalan)
Dependency Treebank	Narad	Dependency parsed version of Penn Treebank sample
FrameNet	Fillmore, Baker et al	10k word senses, 170k manually annotated sentences
Floresta Treebank	Diana Santos et al	9k sentences, tagged and parsed (Portuguese)
Gazetteer Lists	Various	Lists of cities and countries
Genesis Corpus	Misc web sources	6 texts, 200k words, 6 languages
Gutenberg (selections)	Hart, Newby, et al	18 texts, 2M words
Inaugural Address Corpus	CSpan	US Presidential Inaugural Addresses (1789-present)
Indian POS-Tagged Corpus	Kumaran et al	60k words, tagged (Bangla, Hindi, Marathi, Telugu)
MacMorpho Corpus	NILC, USP, Brazil	1M words, tagged (Brazilian Portuguese)
Movie Reviews	Pang, Lee	2k movie reviews with sentiment polarity classification
Names Corpus	Kantrowitz, Ross	8k male and female names
NIST 1999 Info Extr (selections)	Garofolo	63k words, newswire and named-entity SGML markup
Nombank	Meyers	115k propositions, 1400 noun frames
NPS Chat Corpus	Forsyth, Martell	10k IM chat posts, POS-tagged and dialogue-act tagged
Open Multilingual WordNet	Bond et al	15 languages, aligned to English WordNet
PP Attachment Corpus	Ratnaparkhi	28k prepositional phrases, tagged as noun or verb modifiers
Proposition Bank	Palmer	113k propositions, 3300 verb frames
Question Classification	Li, Roth	6k questions, categorized
Reuters Corpus	Reuters	1.3M words, 10k news documents, categorized
Roget's Thesaurus	Project Gutenberg	200k words, formatted text
RTE Textual Entailment	Dagan et al	8k sentence pairs, categorized
SEMCOR	Rus, Mihalcea	880k words, part-of-speech and sense tagged
Senseval 2 Corpus	Pedersen	600k words, part-of-speech and sense tagged
SentiWordNet	Esuli, Sebastiani	sentiment scores for 145k WordNet synonym sets
Shakespeare texts (selections)	Bosak	8 books in XML format
State of the Union Corpus	CSPAN	485k words, formatted text
Stopwords Corpus	Porter et al	2,400 stopwords for 11 languages
Swadesh Corpus	Wiktionary	comparative wordlists in 24 languages
Switchboard Corpus (selections)	LDC	36 phonecalls, transcribed, parsed
Univ Decl of Human Rights	United Nations	480k words, 300+ languages
Penn Treebank (selections)	LDC	40k words, tagged and parsed
TIMIT Corpus (selections)	NIST/LDC	audio files and transcripts for 16 speakers
VerbNet 2.1	Palmer et al	5k verbs, hierarchically organized, linked to WordNet
Wordlist Corpus	OpenOffice.org et al	960k words and 20k affixes for 8 languages
WordNet 3.0 (English)	Miller, Fellbaum	145k synonym sets

1.7Corpora in Other Languages

NLTK comes with corpora for many languages, though in some casesyou will need to learn how to manipulate character encodings in Pythonbefore using these corpora (see 3.3).

>>> nltk.corpus.cess_esp.words()['El', 'grupo', 'estatal', 'Electricit\xe9_de_France', ...]>>> nltk.corpus.floresta.words()['Um', 'revivalismo', 'refrescante', 'O', '7_e_Meio', ...]>>> nltk.corpus.indian.words('hindi.pos')['पूर्ण', 'प्रतिबंध', 'हटाओ', ':', 'इराक', 'संयुक्त', ...]>>> nltk.corpus.udhr.fileids()['Abkhaz-Cyrillic+Abkh', 'Abkhaz-UTF8', 'Achehnese-Latin1', 'Achuar-Shiwiar-Latin1','Adja-UTF8', 'Afaan_Oromo_Oromiffa-Latin1', 'Afrikaans-Latin1', 'Aguaruna-Latin1','Akuapem_Twi-UTF8', 'Albanian_Shqip-Latin1', 'Amahuaca', 'Amahuaca-Latin1', ...]>>> nltk.corpus.udhr.words('Javanese-Latin1')[11:]['Saben', 'umat', 'manungsa', 'lair', 'kanthi', 'hak', ...]

The last of these corpora, udhr, contains the Universal Declaration of Human Rightsin over 300 languages. The fileids for this corpus includeinformation about the character encoding used in the file,such as UTF8 or Latin1.Let's use a conditional frequency distribution to examine the differences in word lengthsfor a selection of languages included in the udhr corpus.The output is shown in 1.2 (run the program yourself to see a color plot).Note that True and False are Python's built-in boolean values.

>>> from nltk.corpus import udhr>>> languages = ['Chickasaw', 'English', 'German_Deutsch',...  'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']>>> cfd = nltk.ConditionalFreqDist(...  (lang, len(word))...  for lang in languages...  for word in udhr.words(lang + '-Latin1'))>>> cfd.plot(cumulative=True)

Figure 1.2: Cumulative Word Length Distributions:Six translations of the Universal Declaration of Human Rights are processed;this graph shows that words having 5 or fewer letters account for about80% of Ibibio text, 60% of German text, and 25% of Inuktitut text.

Note

Your Turn:Pick a language of interest in udhr.fileids(), and define a variableraw_text = udhr.raw(Language-Latin1). Now plot a frequencydistribution of the letters of the text using nltk.FreqDist(raw_text).plot().

1.8Text Corpus Structure

We have seen a variety of corpus structures so far; these aresummarized in 1.3.The simplest kind lacks any structure: it is just a collection of texts.Often, texts are grouped into categories that might correspond to genre, source, author, language, etc.Sometimes these categories overlap, notably in the case of topical categories as a text can berelevant to more than one topic. Occasionally, text collections have temporal structure,news collections being the most common example.

Figure 1.3: Common Structures for Text Corpora: The simplest kind of corpus is a collectionof isolated texts with no particular organization; some corpora are structuredinto categories like genre (Brown Corpus); some categorizations overlap, such astopic categories (Reuters Corpus); other corpora represent language use over time(Inaugural Address Corpus).

Table 1.3:

Basic Corpus Functionality defined in NLTK: more documentation can be found usinghelp(nltk.corpus.reader) and by reading the online Corpus HOWTO at http://nltk.org/howto.

Example	Description
`fileids()`	the files of the corpus
`fileids([categories])`	the files of the corpus corresponding to these categories
`categories()`	the categories of the corpus
`categories([fileids])`	the categories of the corpus corresponding to these files
`raw()`	the raw content of the corpus
`raw(fileids=[f1,f2,f3])`	the raw content of the specified files
`raw(categories=[c1,c2])`	the raw content of the specified categories
`words()`	the words of the whole corpus
`words(fileids=[f1,f2,f3])`	the words of the specified fileids
`words(categories=[c1,c2])`	the words of the specified categories
`sents()`	the sentences of the whole corpus
`sents(fileids=[f1,f2,f3])`	the sentences of the specified fileids
`sents(categories=[c1,c2])`	the sentences of the specified categories
`abspath(fileid)`	the location of the given file on disk
`encoding(fileid)`	the encoding of the file (if known)
`open(fileid)`	open a stream for reading the given corpus file
`root`	if the path to the root of locally installed corpus
`readme()`	the contents of the README file of the corpus

NLTK's corpus readers support efficient access to a variety of corpora, and canbe used to work with new corpora. 1.3 lists functionalityprovided by the corpus readers. We illustrate the difference between someof the corpus access methods below:

>>> raw = gutenberg.raw("burgess-busterbrown.txt")>>> raw[1:20]'The Adventures of B'>>> words = gutenberg.words("burgess-busterbrown.txt")>>> words[1:20]['The', 'Adventures', 'of', 'Buster', 'Bear', 'by', 'Thornton', 'W', '.','Burgess', '1920', ']', 'I', 'BUSTER', 'BEAR', 'GOES', 'FISHING', 'Buster','Bear']>>> sents = gutenberg.sents("burgess-busterbrown.txt")>>> sents[1:20][['I'], ['BUSTER', 'BEAR', 'GOES', 'FISHING'], ['Buster', 'Bear', 'yawned', 'as','he', 'lay', 'on', 'his', 'comfortable', 'bed', 'of', 'leaves', 'and', 'watched','the', 'first', 'early', 'morning', 'sunbeams', 'creeping', 'through', ...], ...]

1.9Loading your own Corpus

If you have your own collection of text files that you would like to access usingthe above methods, you can easily load them with the help of NLTK'sPlaintextCorpusReader. Check the location of your files on your file system; inthe following example, we have taken this to be the directory/usr/share/dict. Whatever the location, set this to be the value ofcorpus_root .The second parameter of the PlaintextCorpusReader initializer can be a list of fileids, like ['a.txt', 'test/b.txt'],or a pattern that matches all fileids, like '[abc]/.*\.txt'(see 3.4 for informationabout regular expressions).

>>> from nltk.corpus import PlaintextCorpusReader>>> corpus_root = '/usr/share/dict' >>> wordlists = PlaintextCorpusReader(corpus_root, '.*') >>> wordlists.fileids()['README', 'connectives', 'propernames', 'web2', 'web2a', 'words']>>> wordlists.words('connectives')['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', ...]

As another example, suppose you have your own local copy of Penn Treebank (release 3),in C:\corpora. We can use the BracketParseCorpusReader to access thiscorpus. We specify the corpus_root to be the location of the parsed Wall StreetJournal component of the corpus , and give a file_patternthat matches the files contained within its subfolders (using forward slashes).

>>> from nltk.corpus import BracketParseCorpusReader>>> corpus_root = r"C:\corpora\penntreebank\parsed\mrg\wsj" >>> file_pattern = r".*/wsj_.*\.mrg" >>> ptb = BracketParseCorpusReader(corpus_root, file_pattern)>>> ptb.fileids()['00/wsj_0001.mrg', '00/wsj_0002.mrg', '00/wsj_0003.mrg', '00/wsj_0004.mrg', ...]>>> len(ptb.sents())49208>>> ptb.sents(fileids='20/wsj_2013.mrg')[19]['The', '55-year-old', 'Mr.', 'Noriega', 'is', "n't", 'as', 'smooth', 'as', 'the','shah', 'of', 'Iran', ',', 'as', 'well-born', 'as', 'Nicaragua', "'s", 'Anastasio','Somoza', ',', 'as', 'imperial', 'as', 'Ferdinand', 'Marcos', 'of', 'the', 'Philippines','or', 'as', 'bloody', 'as', 'Haiti', "'s", 'Baby', Doc', 'Duvalier', '.']

We introduced frequency distributions in 3.We saw that given some list mylist of words or other items,FreqDist(mylist) would compute the number of occurrences of eachitem in the list. Here we will generalize this idea.

When the texts of a corpus are divided into severalcategories, by genre, topic, author, etc, we can maintain separatefrequency distributions for each category. This will allow us tostudy systematic differences between the categories. In the previoussection we achieved this using NLTK's ConditionalFreqDist datatype. A conditional frequency distribution is a collection offrequency distributions, each one for a different "condition". Thecondition will often be the category of the text. 2.1depicts a fragment of a conditional frequency distribution having justtwo conditions, one for news text and one for romance text.

Figure 2.1: Counting Words Appearing in a Text Collection (a conditional frequency distribution)

2.1Conditions and Events

A frequency distribution counts observable events,such as the appearance of words in a text. A conditionalfrequency distribution needs to pair each event with a condition.So instead of processing a sequence of words ,we have to process a sequence of pairs :

>>> text = ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] >>> pairs = [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ...]

Each pair has the form (condition, event). If we were processing theentire Brown Corpus by genre there would be 15 conditions (one per genre),and 1,161,192 events (one per word).

2.2Counting Words by Genre

In 1 we saw a conditionalfrequency distribution where the condition was the section of theBrown Corpus, and for each condition we counted words. WhereasFreqDist() takes a simple list as input, ConditionalFreqDist()takes a list of pairs.

>>> from nltk.corpus import brown>>> cfd = nltk.ConditionalFreqDist(...  (genre, word)...  for genre in brown.categories()...  for word in brown.words(categories=genre))

Let's break this down, and look at just two genres, news and romance.For each genre , we loop over every word in the genre ,producing pairs consisting of the genre and the word :

>>> genre_word = [(genre, word) ...  for genre in ['news', 'romance'] ...  for word in brown.words(categories=genre)] >>> len(genre_word)170576

So, as we can see below,pairs at the beginning of the list genre_word will be of the form('news', word) , while those at the end will be of the form('romance', word) .

>>> genre_word[:4][('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')] # [_start-genre]>>> genre_word[-4:][('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')] # [_end-genre]

We can now use this list of pairs to create a ConditionalFreqDist, andsave it in a variable cfd. As usual, we can type the name of thevariable to inspect it , and verify it has two conditions :

>>> cfd = nltk.ConditionalFreqDist(genre_word)>>> cfd <ConditionalFreqDist with 2 conditions>>>> cfd.conditions()['news', 'romance'] # [_conditions-cfd]

Let's access the two conditions, and satisfy ourselves that each is justa frequency distribution:

>>> print(cfd['news'])<FreqDist with 14394 samples and 100554 outcomes>>>> print(cfd['romance'])<FreqDist with 8452 samples and 70022 outcomes>>>> cfd['romance'].most_common(20)[(',', 3899), ('.', 3736), ('the', 2758), ('and', 1776), ('to', 1502),('a', 1335), ('of', 1186), ('``', 1045), ("''", 1044), ('was', 993),('I', 951), ('in', 875), ('he', 702), ('had', 692), ('?', 690),('her', 651), ('that', 583), ('it', 573), ('his', 559), ('she', 496)]>>> cfd['romance']['could']193

2.3Plotting and Tabulating Distributions

Apart from combining two or more frequency distributions, and being easy to initialize,a ConditionalFreqDist provides some useful methods for tabulation and plotting.

The plot in 1.1 was based on a conditional frequency distributionreproduced in the code below.The condition is either of the words america or citizen ,and the counts being plotted are the number of times the word occured in a particular speech.It exploits the fact that the filename for each speech, e.g., 1865-Lincoln.txtcontains the year as the first four characters .This code generates the pair ('america', '1865') forevery instance of a word whose lowercased form starts with america— such as Americans — in the file 1865-Lincoln.txt.

>>> from nltk.corpus import inaugural>>> cfd = nltk.ConditionalFreqDist(...  (target, fileid[:4]) ...  for fileid in inaugural.fileids()...  for w in inaugural.words(fileid)...  for target in ['america', 'citizen'] ...  if w.lower().startswith(target))

The plot in 1.2 was also based on a conditional frequency distribution,reproduced below. This time, the condition is the name of the languageand the counts being plotted are derived from word lengths .It exploits the fact that the filename for each language is the language name followedby '-Latin1' (the character encoding).

>>> from nltk.corpus import udhr>>> languages = ['Chickasaw', 'English', 'German_Deutsch',...  'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']>>> cfd = nltk.ConditionalFreqDist(...  (lang, len(word)) ...  for lang in languages...  for word in udhr.words(lang + '-Latin1'))

In the plot() and tabulate() methods, we canoptionally specify which conditions to display with a conditions= parameter.When we omit it, we get all the conditions. Similarly, we can limit thesamples to display with a samples= parameter. This makes it possible toload a large quantity of data into a conditional frequency distribution, and thento explore it by plotting or tabulating selected conditions and samples. It alsogives us full control over the order of conditions and samples in any displays.For example, we can tabulate the cumulative frequency data just for twolanguages, and for words less than 10 characters long, as shown below.We interpret the last cell on the top row to mean that 1,638 words of theEnglish text have 9 or fewer letters.

>>> cfd.tabulate(conditions=['English', 'German_Deutsch'],...  samples=range(10), cumulative=True) 0 1 2 3 4 5 6 7 8 9 English 0 185 525 883 997 1166 1283 1440 1558 1638German_Deutsch 0 171 263 614 717 894 1013 1110 1213 1275

Note

Your Turn:Working with the news and romance genres from the Brown Corpus,find out which days of the week are most newsworthy, and which are most romantic.Define a variable called days containing a list of days of the week, i.e.['Monday', ...]. Now tabulate the counts for these words usingcfd.tabulate(samples=days). Now try the same thing using plot in place of tabulate.You may control the output order of days with the help of an extra parameter:samples=['Monday', ...].

You may have noticed that the multi-line expressions we have beenusing with conditional frequency distributions look like listcomprehensions, but without the brackets. In general,when we use a list comprehension as a parameter to a function,like set([w.lower() for w in t]), we are permitted to omitthe square brackets and just write: set(w.lower() for w in t).(See the discussion of "generator expressions" in 4.2for more about this.)

2.4Generating Random Text with Bigrams

We can use a conditional frequency distribution to create a table ofbigrams (word pairs). (We introducted bigrams in3.)The bigrams() function takes a list ofwords and builds a list of consecutive word pairs.Remember that, in order to see the result and not a cryptic"generator object", we need to use the list() function:

>>> sent = ['In', 'the', 'beginning', 'God', 'created', 'the', 'heaven',...  'and', 'the', 'earth', '.']>>> list(nltk.bigrams(sent))[('In', 'the'), ('the', 'beginning'), ('beginning', 'God'), ('God', 'created'),('created', 'the'), ('the', 'heaven'), ('heaven', 'and'), ('and', 'the'),('the', 'earth'), ('earth', '.')]

In 2.2, we treat each word as a condition, and for each onewe effectively create a frequency distribution over the followingwords. The function generate_model() contains a simple loop togenerate text. When we call the function, we choose a word (such as'living') as our initial context, then once inside the loop, weprint the current value of the variable word, and reset wordto be the most likely token in that context (using max()); nexttime through the loop, we use that word as our new context. As youcan see by inspecting the output, this simple approach to textgeneration tends to get stuck in loops; another method would be torandomly choose the next word from among the available words.

def generate_model(cfdist, word, num=15): for i in range(num): print(word, end=' ') word = cfdist[word].max()text = nltk.corpus.genesis.words('english-kjv.txt')bigrams = nltk.bigrams(text)cfd = nltk.ConditionalFreqDist(bigrams)

>>> cfd['living']FreqDist({'creature': 7, 'thing': 4, 'substance': 2, ',': 1, '.': 1, 'soul': 1})>>> generate_model(cfd, 'living')living creature that he said , and the land of the land of the land

Example 2.2 (code_random_text.py): Figure 2.2: Generating Random Text: this program obtains all bigramsfrom the text of the book of Genesis, then constructs aconditional frequency distribution to record whichwords are most likely to follow a given word; e.g., afterthe word living, the most likely word iscreature; the generate_model() function uses thisdata, and a seed word, to generate random text.

Conditional frequency distributions are a useful data structure for many NLP tasks.Their commonly-used methods are summarized in 2.1.

Table 2.1:

NLTK's Conditional Frequency Distributions: commonly-used methods and idioms for defining,accessing, and visualizing a conditional frequency distribution of counters.

Example	Description
`cfdist = ConditionalFreqDist(pairs)`	create a conditional frequency distribution from a list of pairs
`cfdist.conditions()`	the conditions
`cfdist[condition]`	the frequency distribution for this condition
`cfdist[condition][sample]`	frequency for the given sample for this condition
`cfdist.tabulate()`	tabulate the conditional frequency distribution
`cfdist.tabulate(samples, conditions)`	tabulation limited to the specified samples and conditions
`cfdist.plot()`	graphical plot of the conditional frequency distribution
`cfdist.plot(samples, conditions)`	graphical plot limited to the specified samples and conditions
`cfdist1 < cfdist2`	test if samples in `cfdist1` occur less frequently than in `cfdist2`

By this time you've probably typed and retyped a lot of code in the Pythoninteractive interpreter. If you mess up when retyping a complex example you haveto enter it again. Using the arrow keys to access and modify previous commands is helpful but only goes sofar. In this section we see two important ways to reuse code: text editors and Python functions.

3.1Creating Programs with a Text Editor

The Python interactive interpreter performs your instructions as soon as you typethem. Often, it is better to compose a multi-line program using a text editor,then ask Python to run the whole program at once. Using IDLE, you can dothis by going to the File menu and opening a new window. Try this now, andenter the following one-line program:

print('Monty Python')

Save this program in a file called monty.py, thengo to the Run menu, and select the command Run Module.(We'll learn what modules are shortly.)The result in the main IDLE window should look like this:

>>> ================================ RESTART ================================>>>Monty Python>>>

You can also type from monty import * and it will do the same thing.

From now on, you have a choice of using the interactive interpreter or atext editor to create your programs. It is often convenient to test your ideasusing the interpreter, revising a line of code until it does what you expect.Once you're ready, you can paste the code(minus any >>> or ... prompts) into the text editor,continue to expand it, and finally save the programin a file so that you don't have to type it in again later.Give the file a short but descriptive name, using all lowercase letters and separatingwords with underscore, and using the .py filename extension, e.g., monty_python.py.

Note

Important:Our inline code examples include the >>> and ... promptsas if we are interacting directly with the interpreter. As they get more complicated,you should instead type them into the editor, without the prompts, and run themfrom the editor as shown above. When we provide longer programs in this book,we will leave out the prompts to remind you to type them into a file ratherthan using the interpreter. You can see this already in 2.2 above.Note that it still includes a couple of lines with the Python prompt;this is the interactive part of the task where you inspect some data and invoke a function.Remember that all code samples like 2.2 are downloadablefrom http://nltk.org/.

3.2Functions

Suppose that you work on analyzing text that involves different formsof the same word, and that part of your program needs to work outthe plural form of a given singular noun. Suppose it needs to do thiswork in two places, once when it is processing some texts, and againwhen it is processing user input.

Rather than repeating the same code several times over, it is moreefficient and reliable to localize this work inside a function.A function is just a named block of code that performs some well-definedtask, as we saw in 1.A function is usually defined to take some inputs, using special variables known as parameters,and it may produce a result, also known as a return value.We define a function using the keyword def followed by thefunction name and any input parameters, followed by the body of thefunction. Here's the function we saw in 1(including the import statement that is needed for Python 2, in order to make division behave as expected):

>>> from __future__ import division>>> def lexical_diversity(text):...  return len(text) / len(set(text))

We use the keyword return to indicate the value that isproduced as output by the function. In the above example,all the work of the function is done in the return statement.Here's an equivalent definition which does the same workusing multiple lines of code. We'll change the parameter namefrom text to my_text_data to remind you that this is an arbitrary choice:

>>> def lexical_diversity(my_text_data):...  word_count = len(my_text_data)...  vocab_size = len(set(my_text_data))...  diversity_score = vocab_size / word_count...  return diversity_score

Notice that we've created some new variables inside the body of the function.These are local variables and are not accessible outside the function.So now we have defined a function with the name lexical_diversity. But justdefining it won't produce any output!Functions do nothing until they are "called" (or "invoked"):

>>> from nltk.corpus import genesis>>> kjv = genesis.words('english-kjv.txt')>>> lexical_diversity(kjv)0.06230453042623537

Let's return to our earlier scenario, and actually define a simplefunction to work out English plurals. The function plural() in 3.1takes a singular noun and generates a plural form, though it is not alwayscorrect. (We'll discuss functions at greater length in 4.4.)

def plural(word): if word.endswith('y'): return word[:-1] + 'ies' elif word[-1] in 'sx' or word[-2:] in ['sh', 'ch']: return word + 'es' elif word.endswith('an'): return word[:-2] + 'en' else: return word + 's'

>>> plural('fairy')'fairies'>>> plural('woman')'women'

Example 3.1 (code_plural.py): Figure 3.1: A Python Function: this function tries to work out theplural form of any English noun; the keyword def (define)is followed by the function name, then a parameter insideparentheses, and a colon; the body of the function is theindented block of code; it tries to recognize patternswithin the word and process the word accordingly; e.g., if theword ends with y, delete the y and add ies.

3.3Modules

Over time you will find that you create a variety of useful little text processing functions,and you end up copying them from old programs to new ones. Which file contains thelatest version of the function you want to use?It makes life a lot easier if you can collect your work into a single place, andaccess previously defined functions without making copies.

To do this, save your function(s) in a file called (say) text_proc.py.Now, you can access your work simply by importing it from the file:

>>> from text_proc import plural>>> plural('wish')wishes>>> plural('fan')fen

Our plural function obviously has an error, since the plural offan is fans.Instead of typing in a new version of the function, we cansimply edit the existing one. Thus, at everystage, there is only one version of our plural function, and no confusion aboutwhich one is being used.

A collection of variable and function definitions in a file is called a Pythonmodule. A collection of related modules is called a package.NLTK's code for processing the Brown Corpus is an example of a module,and its collection of code for processing all the different corpora isan example of a package. NLTK itself is a set of packages, sometimescalled a library.

Caution!

If you are creating a file to contain some of your Pythoncode, do not name your file nltk.py: it may get imported inplace of the "real" NLTK package. When it imports modules, Pythonfirst looks in the current directory (folder).

A lexicon, or lexical resource, is a collection of words and/or phrases alongwith associated information such as part of speech and sense definitions.Lexical resources are secondary to texts, and are usually created and enriched with the helpof texts. For example, if we have defined a text my_text, thenvocab = sorted(set(my_text)) builds the vocabulary of my_text,while word_freq = FreqDist(my_text) counts the frequency of each word in the text. Bothof vocab and word_freq are simple lexical resources. Similarly, a concordancelike the one we saw in 1gives us information about word usage that might help in the preparation ofa dictionary. Standard terminology for lexicons is illustrated in 4.1.A lexical entry consists of a headword (also known as a lemma)along with additional information such as the part of speech and the sensedefinition. Two distinct words having the same spelling are called hom*onyms.

Figure 4.1: Lexicon Terminology: lexical entries for two lemmashaving the same spelling (hom*onyms), providing part of speechand gloss information.

The simplest kind of lexicon is nothing more than a sorted list of words.Sophisticated lexicons include complex structure within and acrossthe individual entries. In this section we'll look at some lexical resourcesincluded with NLTK.

4.1Wordlist Corpora

NLTK includes some corpora that are nothing more than wordlists.The Words Corpus is the /usr/share/dict/words file from Unix, used bysome spell checkers. We can use it to find unusual or mis-speltwords in a text corpus, as shown in 4.2.

def unusual_words(text): text_vocab = set(w.lower() for w in text if w.isalpha()) english_vocab = set(w.lower() for w in nltk.corpus.words.words()) unusual = text_vocab - english_vocab return sorted(unusual)>>> unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt'))['abbeyland', 'abhorred', 'abilities', 'abounded', 'abridgement', 'abused', 'abuses','accents', 'accepting', 'accommodations', 'accompanied', 'accounted', 'accounts','accustomary', 'aches', 'acknowledging', 'acknowledgment', 'acknowledgments', ...]>>> unusual_words(nltk.corpus.nps_chat.words())['aaaaaaaaaaaaaaaaa', 'aaahhhh', 'abortions', 'abou', 'abourted', 'abs', 'ack','acros', 'actualy', 'adams', 'adds', 'adduser', 'adjusts', 'adoted', 'adreniline','ads', 'adults', 'afe', 'affairs', 'affari', 'affects', 'afk', 'agaibn', 'ages', ...]

Example 4.2 (code_unusual.py): Figure 4.2: Filtering a Text: this program computes the vocabulary of a text,then removes all items that occur in an existing wordlist,leaving just the uncommon or mis-spelt words.

There is also a corpus of stopwords, that is, high-frequencywords like the, to and also that we sometimeswant to filter out of a document before further processing. Stopwordsusually have little lexical content, and their presence in a text failsto distinguish it from other texts.

>>> from nltk.corpus import stopwords>>> stopwords.words('english')['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours','yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers','herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves','what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are','was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does','did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until','while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into','through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down','in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here','there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more','most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so','than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']

Let's define a function to compute what fraction of words in a text are not in thestopwords list:

>>> def content_fraction(text):...  stopwords = nltk.corpus.stopwords.words('english')...  content = [w for w in text if w.lower() not in stopwords]...  return len(content) / len(text)...>>> content_fraction(nltk.corpus.reuters.words())0.7364374824583169

Thus, with the help of stopwords we filter out over a quarter of the words of the text.Notice that we've combined two different kinds of corpus here, using a lexicalresource to filter the content of a text corpus.

Figure 4.3: A Word Puzzle: a grid of randomly chosen letters with rules forcreating words out of the letters; this puzzle is known as "Target."

A wordlist is useful for solving word puzzles, such as the one in 4.3.Our program iterates through every word and, for each one, checks whether it meetsthe conditions. It is easy to check obligatory letter and length constraints (and we'llonly look for words with six or more letters here).It is trickier to check that candidate solutions only use combinations of thesupplied letters, especially since some of the supplied lettersappear twice (here, the letter v).The FreqDist comparison method permits us to check thatthe frequency of each letter in the candidate word is less than or equalto the frequency of the corresponding letter in the puzzle.

>>> puzzle_letters = nltk.FreqDist('egivrvonl')>>> obligatory = 'r'>>> wordlist = nltk.corpus.words.words()>>> [w for w in wordlist if len(w) >= 6 ...  and obligatory in w ...  and nltk.FreqDist(w) <= puzzle_letters] ['glover', 'gorlin', 'govern', 'grovel', 'ignore', 'involver', 'lienor','linger', 'longer', 'lovering', 'noiler', 'overling', 'region', 'renvoi','revolving', 'ringle', 'roving', 'violer', 'virole']

One more wordlist corpus is the Names corpus, containing 8,000 first names categorized by gender.The male and female names are stored in separate files. Let's find names which appearin both files, i.e. names that are ambiguous for gender:

>>> names = nltk.corpus.names>>> names.fileids()['female.txt', 'male.txt']>>> male_names = names.words('male.txt')>>> female_names = names.words('female.txt')>>> [w for w in male_names if w in female_names]['Abbey', 'Abbie', 'Abby', 'Addie', 'Adrian', 'Adrien', 'Ajay', 'Alex', 'Alexis','Alfie', 'Ali', 'Alix', 'Allie', 'Allyn', 'Andie', 'Andrea', 'Andy', 'Angel','Angie', 'Ariel', 'Ashley', 'Aubrey', 'Augustine', 'Austin', 'Averil', ...]

It is well known that names ending in the letter a are almost always female.We can see this and some other patterns in the graph in 4.4,produced by the following code. Remember that name[-1] is the last letterof name.

>>> cfd = nltk.ConditionalFreqDist(...  (fileid, name[-1])...  for fileid in names.fileids()...  for name in names.words(fileid))>>> cfd.plot()

Figure 4.4: Conditional Frequency Distribution: this plot shows the number of female and male namesending with each letter of the alphabet; most names ending with a, e or iare female; names ending in h and l are equally likely to be male or female;names ending in k, o, r, s, and t are likely to be male.

4.2A Pronouncing Dictionary

A slightly richer kind of lexical resource is a table (or spreadsheet), containing a wordplus some properties in each row. NLTK includes the CMU PronouncingDictionary for US English, which was designed foruse by speech synthesizers.

>>> entries = nltk.corpus.cmudict.entries()>>> len(entries)133737>>> for entry in entries[42371:42379]:...  print(entry)...('fir', ['F', 'ER1'])('fire', ['F', 'AY1', 'ER0'])('fire', ['F', 'AY1', 'R'])('firearm', ['F', 'AY1', 'ER0', 'AA2', 'R', 'M'])('firearm', ['F', 'AY1', 'R', 'AA2', 'R', 'M'])('firearms', ['F', 'AY1', 'ER0', 'AA2', 'R', 'M', 'Z'])('firearms', ['F', 'AY1', 'R', 'AA2', 'R', 'M', 'Z'])('fireball', ['F', 'AY1', 'ER0', 'B', 'AO2', 'L'])

For each word, this lexicon provides a list of phoneticcodes — distinct labels for each contrastive sound —known as phones. Observe that fire has two pronunciations(in US English):the one-syllable F AY1 R, and the two-syllable F AY1 ER0.The symbols in the CMU Pronouncing Dictionary are from the Arpabet,described in more detail at http://en.wikipedia.org/wiki/Arpabet

Each entry consists of two parts, and we canprocess these individually using a more complex version of the for statement.Instead of writing for entry in entries:, we replaceentry with two variable names, word, pron .Now, each time through the loop, word is assigned the first part of theentry, and pron is assigned the second part of the entry:

>>> for word, pron in entries: ...  if len(pron) == 3: ...  ph1, ph2, ph3 = pron ...  if ph1 == 'P' and ph3 == 'T':...  print(word, ph2, end=' ')...pait EY1 pat AE1 pate EY1 patt AE1 peart ER1 peat IY1 peet IY1 peete IY1 pert ER1pet EH1 pete IY1 pett EH1 piet IY1 piette IY1 pit IH1 pitt IH1 pot AA1 pote OW1pott AA1 pout AW1 puett UW1 purt ER1 put UH1 putt AH1

The above program scans the lexicon looking for entries whose pronunciation consists ofthree phones . If the condition is true, it assigns the contentsof pron to three new variables ph1, ph2 and ph3. Notice the unusualform of the statement which does that work .

Here's another example of the same for statement, this time used inside a listcomprehension. This program finds all words whose pronunciation ends with a syllablesounding like nicks. You could use this method to find rhyming words.

>>> syllable = ['N', 'IH0', 'K', 'S']>>> [word for word, pron in entries if pron[-4:] == syllable]["atlantic's", 'audiotronics', 'avionics', 'beatniks', 'calisthenics', 'centronics','chamonix', 'chetniks', "clinic's", 'clinics', 'conics', 'conics', 'cryogenics','cynics', 'diasonics', "dominic's", 'ebonics', 'electronics', "electronics'", ...]

Notice that the one pronunciation is spelt in several ways: nics, niks, nix,even ntic's with a silent t, for the word atlantic's. Let's look for some othermismatches between pronunciation and writing. Can you summarize the purpose ofthe following examples and explain how they work?

>>> [w for w, pron in entries if pron[-1] == 'M' and w[-1] == 'n']['autumn', 'column', 'condemn', 'damn', 'goddamn', 'hymn', 'solemn']>>> sorted(set(w[:2] for w, pron in entries if pron[0] == 'N' and w[0] != 'n'))['gn', 'kn', 'mn', 'pn']

The phones contain digits to representprimary stress (1), secondary stress (2) and no stress (0).As our final example, we define a function to extract the stress digitsand then scan our lexicon to find words having a particular stress pattern.

>>> def stress(pron):...  return [char for phone in pron for char in phone if char.isdigit()]>>> [w for w, pron in entries if stress(pron) == ['0', '1', '0', '2', '0']]['abbreviated', 'abbreviated', 'abbreviating', 'accelerated', 'accelerating','accelerator', 'accelerators', 'accentuated', 'accentuating', 'accommodated','accommodating', 'accommodative', 'accumulated', 'accumulating', 'accumulative', ...]>>> [w for w, pron in entries if stress(pron) == ['0', '2', '0', '1', '0']]['abbreviation', 'abbreviations', 'abomination', 'abortifacient', 'abortifacients','academicians', 'accommodation', 'accommodations', 'accreditation', 'accreditations','accumulation', 'accumulations', 'acetylcholine', 'acetylcholine', 'adjudication', ...]

Note

A subtlety of the above program is that ouruser-defined function stress() is invoked inside the condition ofa list comprehension. There is also a doubly-nested for loop.There's a lot going on here and you might wantto return to this once you've had more experience using list comprehensions.

We can use a conditional frequency distribution to help us find minimally-contrastingsets of words. Here we find all the p-words consisting of three sounds ,and group them according to their first and last sounds .

>>> p3 = [(pron[0]+'-'+pron[2], word) ...  for (word, pron) in entries...  if pron[0] == 'P' and len(pron) == 3] >>> cfd = nltk.ConditionalFreqDist(p3)>>> for template in sorted(cfd.conditions()):...  if len(cfd[template]) > 10:...  words = sorted(cfd[template])...  wordstring = ' '.join(words)...  print(template, wordstring[:70] + "...")...P-CH patch pautsch peach perch petsch petsche piche piech pietsch pitch pit...P-K pac pack paek paik pak pake paque peak peake pech peck peek perc perk ...P-L pahl pail paille pal pale pall paul paule paull peal peale pearl pearl...P-N paign pain paine pan pane pawn payne peine pen penh penn pin pine pinn...P-P paap paape pap pape papp paup peep pep pip pipe pipp poop pop pope pop...P-R paar pair par pare parr pear peer pier poor poore por pore porr pour...P-S pace pass pasts peace pearse pease perce pers perse pesce piece piss p...P-T pait pat pate patt peart peat peet peete pert pet pete pett piet piett...P-UW1 peru peugh pew plew plue prew pru prue prugh pshew pugh...

Rather than iterating over the whole dictionary, we can also access itby looking up particular words. We will use Python's dictionary datastructure, which we will study systematically in 3.We look up a dictionary by giving its name followed by a key(such as the word 'fire') inside square brackets .

>>> prondict = nltk.corpus.cmudict.dict()>>> prondict['fire'] [['F', 'AY1', 'ER0'], ['F', 'AY1', 'R']]>>> prondict['blog'] Traceback (most recent call last): File "<stdin>", line 1, in <module>KeyError: 'blog'>>> prondict['blog'] = [['B', 'L', 'AA1', 'G']] >>> prondict['blog'][['B', 'L', 'AA1', 'G']]

If we try to look up a non-existent key , we get a KeyError.This is similar to what happens when we index a list with aninteger that is too large, producing an IndexError.The word blog is missing from the pronouncing dictionary,so we tweak our version by assigning a value for this key (this has no effect on the NLTK corpus; next time we access it,blog will still be absent).

We can use any lexical resource to process a text, e.g., to filter out words havingsome lexical property (like nouns), or mapping every word of the text.For example, the following text-to-speech function looks up each wordof the text in the pronunciation dictionary.

>>> text = ['natural', 'language', 'processing']>>> [ph for w in text for ph in prondict[w][0]]['N', 'AE1', 'CH', 'ER0', 'AH0', 'L', 'L', 'AE1', 'NG', 'G', 'W', 'AH0', 'JH','P', 'R', 'AA1', 'S', 'EH0', 'S', 'IH0', 'NG']

4.3Comparative Wordlists

Another example of a tabular lexicon is the comparative wordlist.NLTK includes so-called Swadesh wordlists, lists of about 200 common wordsin several languages. The languages are identified using an ISO 639 two-letter code.

>>> from nltk.corpus import swadesh>>> swadesh.fileids()['be', 'bg', 'bs', 'ca', 'cs', 'cu', 'de', 'en', 'es', 'fr', 'hr', 'it', 'la', 'mk','nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sr', 'sw', 'uk']>>> swadesh.words('en')['I', 'you (singular), thou', 'he', 'we', 'you (plural)', 'they', 'this', 'that','here', 'there', 'who', 'what', 'where', 'when', 'how', 'not', 'all', 'many', 'some','few', 'other', 'one', 'two', 'three', 'four', 'five', 'big', 'long', 'wide', ...]

We can access cognate words from multiple languages using the entries() method,specifying a list of languages. With one further step we can convert this intoa simple dictionary (we'll learn about dict() in 3).

>>> fr2en = swadesh.entries(['fr', 'en'])>>> fr2en[('je', 'I'), ('tu, vous', 'you (singular), thou'), ('il', 'he'), ...]>>> translate = dict(fr2en)>>> translate['chien']'dog'>>> translate['jeter']'throw'

We can make our simple translator more useful by adding other source languages.Let's get the German-English and Spanish-English pairs, convert each to adictionary using dict(), then update our original translate dictionarywith these additional mappings:

>>> de2en = swadesh.entries(['de', 'en']) # German-English>>> es2en = swadesh.entries(['es', 'en']) # Spanish-English>>> translate.update(dict(de2en))>>> translate.update(dict(es2en))>>> translate['Hund']'dog'>>> translate['perro']'dog'

We can compare words in various Germanic and Romance languages:

>>> languages = ['en', 'de', 'nl', 'es', 'fr', 'pt', 'la']>>> for i in [139, 140, 141, 142]:...  print(swadesh.entries(languages)[i])...('say', 'sagen', 'zeggen', 'decir', 'dire', 'dizer', 'dicere')('sing', 'singen', 'zingen', 'cantar', 'chanter', 'cantar', 'canere')('play', 'spielen', 'spelen', 'jugar', 'jouer', 'jogar, brincar', 'ludere')('float', 'schweben', 'zweven', 'flotar', 'flotter', 'flutuar, boiar', 'fluctuare')

4.4Shoebox and Toolbox Lexicons

Perhaps the single most popular tool used by linguists for managing datais Toolbox, previously known as Shoebox since it replacesthe field linguist's traditional shoebox full of file cards.Toolbox is freely downloadable from http://www.sil.org/computing/toolbox/.

A Toolbox file consists of a collection of entries,where each entry is made up of one or more fields.Most fields are optional or repeatable, which means that this kind oflexical resource cannot be treated as a table or spreadsheet.

Here is a dictionary for the Rotokas language. We see just the first entry,for the word kaa meaning "to gag":

>>> from nltk.corpus import toolbox>>> toolbox.entries('rotokas.dic')[('kaa', [('ps', 'V'), ('pt', 'A'), ('ge', 'gag'), ('tkp', 'nek i pas'),('dcsv', 'true'), ('vx', '1'), ('sc', '???'), ('dt', '29/Oct/2005'),('ex', 'Apoka ira kaaroi aioa-ia reoreopaoro.'),('xp', 'Kaikai i pas long nek bilong Apoka bikos em i kaikai na toktok.'),('xe', 'Apoka is gagging from food while talking.')]), ...]

Entries consist of a series of attribute-value pairs, like ('ps', 'V')to indicate that the part-of-speech is 'V' (verb), and ('ge', 'gag')to indicate that the gloss-into-English is 'gag'.The last three pairs containan example sentence in Rotokas and its translations into Tok Pisin and English.

The loose structure of Toolbox files makes it hard for us to do much more with themat this stage. XML provides a powerful way to process this kind of corpus andwe will return to this topic in 11..

Note

The Rotokas language is spoken on the island of Bougainville, Papua New Guinea.This lexicon was contributed to NLTK by Stuart Robinson.Rotokas is notable for having an inventory of just 12 phonemes (contrastive sounds),http://en.wikipedia.org/wiki/Rotokas_language

WordNet is a semantically-oriented dictionary of English,similar to a traditional thesaurus but with a richer structure.NLTK includes the English WordNet, with 155,287 wordsand 117,659 synonym sets. We'll begin bylooking at synonyms and how they are accessed in WordNet.

5.1Senses and Synonyms

Consider the sentence in (1a).If we replace the word motorcar in (1a) by automobile,to get (1b), the meaning of the sentence stays pretty much the same:

(1)

Benz is credited with the invention of the motorcar.

Benz is credited with the invention of the automobile.

Since everything else in the sentence has remained unchanged, we canconclude that the words motorcar and automobile have thesame meaning, i.e. they are synonyms. We can explore thesewords with the help of WordNet:

>>> from nltk.corpus import wordnet as wn>>> wn.synsets('motorcar')[Synset('car.n.01')]

Thus, motorcar has just one possible meaning and it is identified as car.n.01,the first noun sense of car. The entity car.n.01 is called a synset,or "synonym set", a collection of synonymous words (or "lemmas"):

>>> wn.synset('car.n.01').lemma_names()['car', 'auto', 'automobile', 'machine', 'motorcar']

Each word of a synset can have several meanings, e.g., car can also signifya train carriage, a gondola, or an elevator car. However, we are only interestedin the single meaning that is common to all words of the above synset. Synsetsalso come with a prose definition and some example sentences:

>>> wn.synset('car.n.01').definition()'a motor vehicle with four wheels; usually propelled by an internal combustion engine'>>> wn.synset('car.n.01').examples()['he needs a car to get to work']

Although definitions help humans to understand the intended meaning of a synset,the words of the synset are often more useful for our programs.To eliminate ambiguity, we will identify these words ascar.n.01.automobile, car.n.01.motorcar, and so on.This pairing of a synset with a word is called a lemma.We can get all the lemmas for a given synset ,look up a particular lemma ,get the synset corresponding to a lemma ,and get the "name" of a lemma :

>>> wn.synset('car.n.01').lemmas() [Lemma('car.n.01.car'), Lemma('car.n.01.auto'), Lemma('car.n.01.automobile'),Lemma('car.n.01.machine'), Lemma('car.n.01.motorcar')]>>> wn.lemma('car.n.01.automobile') Lemma('car.n.01.automobile')>>> wn.lemma('car.n.01.automobile').synset() Synset('car.n.01')>>> wn.lemma('car.n.01.automobile').name() 'automobile'

Unlike the word motorcar, which is unambiguous and has onesynset, the word car is ambiguous, having five synsets:

>>> wn.synsets('car')[Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'),Synset('cable_car.n.01')]>>> for synset in wn.synsets('car'):...  print(synset.lemma_names())...['car', 'auto', 'automobile', 'machine', 'motorcar']['car', 'railcar', 'railway_car', 'railroad_car']['car', 'gondola']['car', 'elevator_car']['cable_car', 'car']

For convenience, we can access all the lemmas involving the word caras follows.

>>> wn.lemmas('car')[Lemma('car.n.01.car'), Lemma('car.n.02.car'), Lemma('car.n.03.car'),Lemma('car.n.04.car'), Lemma('cable_car.n.01.car')]

Note

Your Turn:Write down all the senses of the word dish that you can think of. Now, explore thisword with the help of WordNet, using the same operations we used above.

5.2The WordNet Hierarchy

WordNet synsets correspond to abstract concepts, and they don't alwayshave corresponding words in English. These concepts are linked together in a hierarchy.Some concepts are very general, such as Entity, State, Event — these are calledunique beginners or root synsets. Others, such as gas guzzler andhatchback, are much more specific. A small portion of a concepthierarchy is illustrated in 5.1.

Figure 5.1: Fragment of WordNet Concept Hierarchy: nodes correspond to synsets;edges indicate the hypernym/hyponym relation, i.e. the relation betweensuperordinate and subordinate concepts.

WordNet makes it easy to navigate between concepts.For example, given a concept like motorcar,we can look at the concepts that are more specific;the (immediate) hyponyms.

>>> motorcar = wn.synset('car.n.01')>>> types_of_motorcar = motorcar.hyponyms()>>> types_of_motorcar[0]Synset('ambulance.n.01')>>> sorted(lemma.name() for synset in types_of_motorcar for lemma in synset.lemmas())['Model_T', 'S.U.V.', 'SUV', 'Stanley_Steamer', 'ambulance', 'beach_waggon','beach_wagon', 'bus', 'cab', 'compact', 'compact_car', 'convertible','coupe', 'cruiser', 'electric', 'electric_automobile', 'electric_car','estate_car', 'gas_guzzler', 'hack', 'hardtop', 'hatchback', 'heap','horseless_carriage', 'hot-rod', 'hot_rod', 'jalopy', 'jeep', 'landrover','limo', 'limousine', 'loaner', 'minicar', 'minivan', 'pace_car', 'patrol_car','phaeton', 'police_car', 'police_cruiser', 'prowl_car', 'race_car', 'racer','racing_car', 'roadster', 'runabout', 'saloon', 'secondhand_car', 'sedan','sport_car', 'sport_utility', 'sport_utility_vehicle', 'sports_car', 'squad_car','station_waggon', 'station_wagon', 'stock_car', 'subcompact', 'subcompact_car','taxi', 'taxicab', 'tourer', 'touring_car', 'two-seater', 'used-car', 'waggon','wagon']

We can also navigate up the hierarchy by visiting hypernyms. Some wordshave multiple paths, because they can be classified in more than one way.There are two paths between car.n.01 and entity.n.01 becausewheeled_vehicle.n.01 can be classified as both a vehicle and a container.

>>> motorcar.hypernyms()[Synset('motor_vehicle.n.01')]>>> paths = motorcar.hypernym_paths()>>> len(paths)2>>> [synset.name() for synset in paths[0]]['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01','instrumentality.n.03', 'container.n.01', 'wheeled_vehicle.n.01','self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01']>>> [synset.name() for synset in paths[1]]['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01','instrumentality.n.03', 'conveyance.n.03', 'vehicle.n.01', 'wheeled_vehicle.n.01','self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01']

We can get the most general hypernyms (or root hypernyms) ofa synset as follows:

>>> motorcar.root_hypernyms()[Synset('entity.n.01')]

Note

Your Turn:Try out NLTK's convenient graphical WordNet browser: nltk.app.wordnet().Explore the WordNet hierarchy by following the hypernym and hyponym links.

5.3More Lexical Relations

Hypernyms and hyponyms are called lexical relations because they relate onesynset to another. These two relations navigate up and down the "is-a" hierarchy.Another important way to navigate the WordNet network is from items to theircomponents (meronyms) or to the things they are contained in (holonyms).For example, the parts of a tree are its trunk, crown, and so on;the part_meronyms().The substance a tree is made of includes heartwood and sapwood;the substance_meronyms().A collection of trees forms a forest; the member_holonyms():

>>> wn.synset('tree.n.01').part_meronyms()[Synset('burl.n.02'), Synset('crown.n.07'), Synset('limb.n.02'),Synset('stump.n.01'), Synset('trunk.n.01')]>>> wn.synset('tree.n.01').substance_meronyms()[Synset('heartwood.n.01'), Synset('sapwood.n.01')]>>> wn.synset('tree.n.01').member_holonyms()[Synset('forest.n.01')]

To see just how intricate things can get, consider the word mint, whichhas several closely-related senses. We can see that mint.n.04 is part ofmint.n.02 and the substance from which mint.n.05 is made.

>>> for synset in wn.synsets('mint', wn.NOUN):...  print(synset.name() + ':', synset.definition())...batch.n.02: (often followed by `of') a large number or amount or extentmint.n.02: any north temperate plant of the genus Mentha with aromatic leaves and small mauve flowersmint.n.03: any member of the mint family of plantsmint.n.04: the leaves of a mint plant used fresh or candiedmint.n.05: a candy that is flavored with a mint oilmint.n.06: a plant where money is coined by authority of the government>>> wn.synset('mint.n.04').part_holonyms()[Synset('mint.n.02')]>>> wn.synset('mint.n.04').substance_holonyms()[Synset('mint.n.05')]

There are also relationships between verbs. For example, the act of walking involves the act of stepping,so walking entails stepping. Some verbs have multiple entailments:

>>> wn.synset('walk.v.01').entailments()[Synset('step.v.01')]>>> wn.synset('eat.v.01').entailments()[Synset('chew.v.01'), Synset('swallow.v.01')]>>> wn.synset('tease.v.03').entailments()[Synset('arouse.v.07'), Synset('disappoint.v.01')]

Some lexical relationships hold between lemmas, e.g., antonymy:

>>> wn.lemma('supply.n.02.supply').antonyms()[Lemma('demand.n.02.demand')]>>> wn.lemma('rush.v.01.rush').antonyms()[Lemma('linger.v.04.linger')]>>> wn.lemma('horizontal.a.01.horizontal').antonyms()[Lemma('inclined.a.02.inclined'), Lemma('vertical.a.01.vertical')]>>> wn.lemma('staccato.r.01.staccato').antonyms()[Lemma('legato.r.01.legato')]

You can see the lexical relations, and the other methods definedon a synset, using dir(), for example: dir(wn.synset('harmony.n.02')).

5.4Semantic Similarity

We have seen that synsets are linked by a complex network oflexical relations. Given a particular synset, we can traversethe WordNet network to find synsets with related meanings.Knowing which words are semantically relatedis useful for indexing a collection of texts, sothat a search for a general term like vehicle will match documentscontaining specific terms like limousine.

Recall that each synset has one or more hypernym paths that link itto a root hypernym such as entity.n.01.Two synsets linked to the same root may have several hypernyms in common(cf 5.1).If two synsets share a very specific hypernym — one that is lowdown in the hypernym hierarchy — they must be closely related.

>>> right = wn.synset('right_whale.n.01')>>> orca = wn.synset('orca.n.01')>>> minke = wn.synset('minke_whale.n.01')>>> tortoise = wn.synset('tortoise.n.01')>>> novel = wn.synset('novel.n.01')>>> right.lowest_common_hypernyms(minke)[Synset('baleen_whale.n.01')]>>> right.lowest_common_hypernyms(orca)[Synset('whale.n.02')]>>> right.lowest_common_hypernyms(tortoise)[Synset('vertebrate.n.01')]>>> right.lowest_common_hypernyms(novel)[Synset('entity.n.01')]

Of course we know that whale is very specific (and baleen whale even more so),while vertebrate is more general and entity is completely general.We can quantify this concept of generality by looking up the depth of each synset:

>>> wn.synset('baleen_whale.n.01').min_depth()14>>> wn.synset('whale.n.02').min_depth()13>>> wn.synset('vertebrate.n.01').min_depth()8>>> wn.synset('entity.n.01').min_depth()0

Similarity measures have been defined over the collection of WordNet synsetswhich incorporate the above insight. For example,path_similarity assigns a score in the range 0–1 based on the shortest path that connects the concepts in the hypernymhierarchy (-1 is returned in those cases where a path cannot befound). Comparing a synset with itself will return 1.Consider the following similarity scores, relating right whaleto minke whale, orca, tortoise, and novel.Although the numbers won't mean much, they decrease aswe move away from the semantic space of sea creatures to inanimate objects.

>>> right.path_similarity(minke)0.25>>> right.path_similarity(orca)0.16666666666666666>>> right.path_similarity(tortoise)0.07692307692307693>>> right.path_similarity(novel)0.043478260869565216

Note

Several other similarity measures are available; you can type help(wn)for more information. NLTK also includes VerbNet, a hierarhical verb lexicon linked to WordNet.It can be accessed with nltk.corpus.verbnet.

A text corpus is a large, structured collection of texts. NLTK comes with many corpora,e.g., the Brown Corpus, nltk.corpus.brown.
Some text corpora are categorized, e.g., by genre or topic; sometimes thecategories of a corpus overlap each other.
A conditional frequency distribution is a collection of frequency distributions,each one for a different condition. They can be used for counting word frequencies,given a context or a genre.
Python programs more than a few lines long should be entered using a text editor,saved to a file with a .py extension, and accessed using an import statement.
Python functions permit you to associate a name with a particular block of code,and re-use that code as often as necessary.
Some functions, known as "methods", are associated with an object and we give the objectname followed by a period followed by the function, like this: x.funct(y),e.g., word.isalpha().
To find out about some variable v,type help(v) in the Python interactive interpreter to read the help entry for this kind of object.
WordNet is a semantically-oriented dictionary of English, consisting of synonym sets — or synsets —and organized into a network.
Some functions are not available by default, but must be accessed usingPython's import statement.

Extra materials for this chapter are posted at http://nltk.org/, including links to freelyavailable resources on the web. The corpus methods are summarized in theCorpus HOWTO, at http://nltk.org/howto, and documented extensively in the online API documentation.

Significant sources of published corpora are the Linguistic Data Consortium (LDC) andthe European Language Resources Agency (ELRA). Hundreds of annotated text and speechcorpora are available in dozens of languages. Non-commercial licences permit the data tobe used in teaching and research. For some corpora, commercial licenses are also available(but for a higher fee).

A good tool for creating annotated text corpora is called Brat,and available from http://brat.nlplab.org/.

These and many other language resources have been documented using OLAC Metadata, and canbe searched via the OLAC homepage at http://www.language-archives.org/. Corpora List is a mailing listfor discussions about corpora, and you can find resources by searching the list archivesor posting to the list.The most complete inventory of the world's languages is Ethnologue, http://www.ethnologue.com/.Of 7,000 languages, only a few dozen have substantial digital resources suitable foruse in NLP.

This chapter has touched on the field of Corpus Linguistics. Other useful books in thisarea include , (McEnery, 2006), (Meyer, 2002), , .Further readings in quantitative data analysis in linguistics are:(Baayen, 2008), (Gries, 2009), .

The original description of WordNet is (Fellbaum, 1998).Although WordNet was originally developed for researchin psycholinguistics, it is now widely used in NLP and Information Retrieval.WordNets are being developed for many other languages, as documentedat http://www.globalwordnet.org/.For a study of WordNet similarity measures, see .

Other topics touched on in this chapter were phonetics and lexical semantics,and we refer readers to chapters 7 and 20 of .

☼ Create a variable phrase containing a list of words.Review the operations described in the previous chapter, including addition,multiplication, indexing, slicing, and sorting.
☼ Use the corpus module to explore austen-persuasion.txt.How many word tokens does this book have? How many word types?
☼ Use the Brown corpus reader nltk.corpus.brown.words() or the Web text corpusreader nltk.corpus.webtext.words() to access some sample text in two different genres.
☼ Read in the texts of the State of the Union addresses, using thestate_union corpus reader. Count occurrences of men, women,and people in each document. What has happened to the usage of thesewords over time?
☼ Investigate the holonym-meronym relations for some nouns.Remember that there are three kinds of holonym-meronym relation,so you need to use:member_meronyms(), part_meronyms(), substance_meronyms(),member_holonyms(), part_holonyms(), and substance_holonyms().
☼ In the discussion of comparative wordlists, we created an objectcalled translate which you could look up using words in both German and Spanishin order to get corresponding words in English.What problem might arise with this approach?Can you suggest a way to avoid this problem?
☼ According to Strunk and White's Elements of Style,the word however, used at the start of a sentence,means "in whatever way" or "to whatever extent", and not"nevertheless". They give this example of correct usage:However you advise him, he will probably do as he thinks best.(http://www.bartleby.com/141/strunk3.html)Use the concordance tool to study actual usage of this wordin the various texts we have been considering.See also the LanguageLog posting "Fossilized prejudices about 'however'"at http://itre.cis.upenn.edu/~myl/languagelog/archives/001913.html
◑ Define a conditional frequency distribution over the Names corpusthat allows you to see which initial letters are more frequent for malesvs. females (cf. 4.4).
◑ Pick a pair of texts and study the differences between them,in terms of vocabulary, vocabulary richness, genre, etc. Can youfind pairs of words which have quite different meanings across thetwo texts, such as monstrous in Moby Dick and in Sense and Sensibility?
◑ Read the BBC News article: UK's Vicky Pollards 'left behind' http://news.bbc.co.uk/1/hi/education/6173441.stm.The article gives the following statistic about teen language:"the top 20 words used, including yeah, no, but and like, account for around a third of all words."How many word types account for a thirdof all word tokens, for a variety of text sources? What do you conclude about this statistic?Read more about this on LanguageLog, at http://itre.cis.upenn.edu/~myl/languagelog/archives/003993.html.
◑ Investigate the table of modal distributions and look for other patterns.Try to explain them in terms of your own impressionistic understandingof the different genres. Can you find other closed classes of words thatexhibit significant differences across different genres?
◑ The CMU Pronouncing Dictionary contains multiple pronunciationsfor certain words. How many distinct words does it contain? What fractionof words in this dictionary have more than one possible pronunciation?
◑ What percentage of noun synsets have no hyponyms?You can get all noun synsets using wn.all_synsets('n').
◑ Define a function supergloss(s) that takes a synset s as its argumentand returns a string consisting of the concatenation of the definition of s, andthe definitions of all the hypernyms and hyponyms of s.
◑ Write a program to find all words that occur at least three times in the Brown Corpus.
◑ Write a program to generate a table of lexical diversity scores (i.e. token/type ratios), as we saw in1.1. Include the full set of Brown Corpus genres (nltk.corpus.brown.categories()).Which genre has the lowest diversity (greatest number of tokens per type)?Is this what you would have expected?
◑ Write a function that finds the 50 most frequently occurring wordsof a text that are not stopwords.
◑ Write a program to print the 50 most frequent bigrams(pairs of adjacent words) of a text, omitting bigrams that contain stopwords.
◑ Write a program to create a table of word frequencies by genre,like the one given in 1 for modals.Choose your own words and try to find words whose presence(or absence) is typical of a genre. Discuss your findings.
◑ Write a function word_freq() that takes a word and the name of a sectionof the Brown Corpus as arguments, and computes the frequency of the wordin that section of the corpus.
◑ Write a program to guess the number of syllables contained in a text,making use of the CMU Pronouncing Dictionary.
◑ Define a function hedge(text) which processes atext and produces a new version with the word'like' between every third word.
★ Zipf's Law:Let f(w) be the frequency of a word w in free text. Suppose thatall the words of a text are ranked according to their frequency,with the most frequent word first. Zipf's law states that thefrequency of a word type is inversely proportional to its rank(i.e. f × r = k, for some constant k). For example, the 50th mostcommon word type should occur three times as frequently as the150th most common word type.
1. Write a function to process a large text and plot wordfrequency against word rank using pylab.plot. Doyou confirm Zipf's law? (Hint: it helps to use a logarithmic scale).What is going on at the extreme ends of the plotted line?
2. Generate random text, e.g., using random.choice("abcdefg "),taking care to include the space character. You will need toimport random first. Use the stringconcatenation operator to accumulate characters into a (very)long string. Then tokenize this string, and generate the Zipfplot as before, and compare the two plots. What do you make ofZipf's Law in the light of this?
★ Modify the text generation program in 2.2 further, todo the following tasks:
1. Store the n most likely words in a list words then randomlychoose a word from the list using random.choice(). (You will needto import random first.)
2. Select a particular genre, such as a section of the Brown Corpus,or a genesis translation, one of the Gutenberg texts, or one of the Web texts. Trainthe model on this corpus and get it to generate random text. Youmay have to experiment with different start words. How intelligibleis the text? Discuss the strengths and weaknesses of this method ofgenerating random text.
3. Now train your system using two distinct genres and experimentwith generating text in the hybrid genre. Discuss your observations.
★ Define a function find_language() that takes a stringas its argument, and returns a list of languages that have thatstring as a word. Use the udhr corpus and limit your searchesto files in the Latin-1 encoding.
★ What is the branching factor of the noun hypernym hierarchy?I.e. for every noun synset that has hyponyms — or children in thehypernym hierarchy — how many do they have on average?You can get all noun synsets using wn.all_synsets('n').
★ The polysemy of a word is the number of senses it has.Using WordNet, we can determine that the noun dog has 7 senseswith: len(wn.synsets('dog', 'n')).Compute the average polysemy of nouns, verbs, adjectives andadverbs according to WordNet.
★ Use one of the predefined similarity measures to scorethe similarity of each of the following pairs of words.Rank the pairs in order of decreasing similarity.How close is your ranking to the order given here,an order that was established experimentallyby :car-automobile, gem-jewel, journey-voyage, boy-lad,coast-shore, asylum-madhouse, magician-wizard, midday-noon,furnace-stove, food-fruit, bird-co*ck, bird-crane, tool-implement,brother-monk, lad-brother, crane-implement, journey-car,monk-oracle, cemetery-woodland, food-rooster, coast-hill,forest-graveyard, shore-woodland, monk-slave, coast-forest,lad-wizard, chord-smile, glass-magician, rooster-voyage, noon-string.

About this document...

UPDATED FOR NLTK 3.0.This is a chapter from Natural Language Processing with Python,by Steven Bird, Ewan Klein and Edward Loper,Copyright © 2019 the authors.It is distributed with the Natural Language Toolkit [http://nltk.org/],Version 3.0, under the terms of theCreative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License[http://creativecommons.org/licenses/by-nc-nd/3.0/us/].

This document was built onWed 4 Sep 2019 11:40:48 ACST