ZipfExplorer

This tool lets you compare the frequencies of shared word types in different texts or corpora.

Select texts or corpora to explore with the drop-down menus. The plot x-axes show word frequency ranks, while the y-axes show the relative frequency per 10k words. The circles represent individual word types. Hovering over a word shows its rank, frequency, relative frequency (per 10k words), a log-likelihood measure (Dunning's G) compared to the other text, and the log-likelihood p-value. Use the plot tools (above the second plot) to drag the plots around, select specific words, zoom in and out, and reset the plots.

Selecting words on the plot or on the sortable tables below highlights them. The tables show word rank, frequency, difference in relative frequency per 10k words compared to the other text, and log-likelihood.

Words with positive rel_diff values are more frequent in the first text; those with negative values in the second text.

With the 'remove frequent words' drop-down menu, up to 200 of the most frequent words in English can be removed from the plots/tables. This can help to highlight content differences between the texts. The frequent words are from Google's n-gram corpus of 1 trillion words from public web pages.

Several measures of lexical diversity provided: the type-token ratio (TTR), the Gini coefficient, which ranges from 0 (all words have the same frequency) to a theoretical maximum of 1 (all words have frequency zero except one word, $n \rightarrow \infty$), the alpha parameter of the fitted power law function, and the Shannon entropy.

The data consist of several literary texts and a corpus of inaugural addresses of U.S. presidents from 1789–2009 (from NLTK), two additional novels (Twain's Huckleberry Finn and Hemingway's A Farewell to Arms), the Brown Corpus and its subsections, and the Freiburg-Brown Corpus of American English, available via Clarin-NO's Corpuscle tool.

Tool created by Steven Coats