COMPARE TO OTHER FREQUENCY LISTS (British National Corpus /
American National Corpus)
There
are many English word lists and frequency lists out on the Web. Some
are good, some are very bad. Not all frequency lists are created
equal.
One should be very, very suspicious of word lists that are taken
from samples of
web data, outdated texts, or corpora that are too
small to effectively model what is happening in the real world. Or
worse, word lists that don't give you any idea what they are based
on. As the saying goes: "garbage in (bad texts), garbage out
(frequency lists)".
Here are some questions you might ask yourself as you consider using
or downloading a word list:
Depth and accuracy.
Why do so many wordlists on the web contain just the top 1000-3000
words of English? Why not the top 20,000 or 60,000? It's because
even a bad corpus (the collection of texts that the word lists are
based on) can produce a moderately accurate list for the very most
frequent words. But because the corpus is neither deep nor balanced
enough, you start getting messy data for medium and lower frequency
words. Ask to see
samples of the top 20,000 or 60,000 words (e.g.
every 7th or 10th word). If they don't have it, then you should be
very, very suspicious of that word list.
Genres.
Does the corpus contain texts from a wide variety of genres --
spoken, fiction, popular magazines, newspapers, and academic
journals? Frequency lists that are based on just one of these may
only contain 40-50% of the words from a more balanced corpus. Our
frequency list is based on the
Corpus of Contemporary American
English (COCA), which is almost perfectly balanced across genres.
Size. COCA contains more than 450 million
words, and each of the top 20,000 words occurs at least 300 times.
In a small 10-20 million word corpus, some of these words would
occur just 7-8 times. At that point, the lower frequency words might
make it into the list "by chance", whereas others are left out. No
such problem with COCA.
How recent is it?
Language change happens. If the word list is based on
15-20 year-old
texts (or much worse, 100 year old public domain novels), then it
will be missing many of the words from the modern language. COCA is
based on texts from 1990-2012 (20 million words each year)-- or in
other words, virtually right up to the current time.
Is it
just a bare wordlist? Word lists are nice, but to be really useful
(especially for language learning) there ought to be some indication
of what these words mean and how they are used. Most of our
frequency lists contain the top 20-30 collocates (nearby words) for
each word in the list, which creates a great "sketch" of each word.
Are they just word forms? Do you really want to see the
individual frequency of shoe and shoes, or realize, realizes,
realized, and realizing? Do you want to have the combined frequency
of watch as a verb (they watch TV) and watch
as a noun (his watch
broke)? If the lists are simply taken from
pages that are "scraped"
from the web, they will just provide long lists of words, without
grouping them meaningfully (e.g. shoe/shoes), or separating them
when necessary (e.g. watch).
Summary. There are many word frequency lists out on the web.
Some are just OK, and some are truly bad. The frequency lists that
we have created are the only ones that are based on a large, recent,
and balanced corpus of English, and which provide indications of the
meaning and use of each word.
|