Corpora: Number of pages on the Internet

From: AssocProf William H. Fletcher (
Date: Mon Dec 03 2001 - 17:39:29 MET

  • Next message: "Corpora: Re: Lines of English on Internet"

    There really is no way to measure the ever-growing number of
    pages on the Internet, and every study I have seen comes to very
    different conclusions, based on their sampling techniques and
    extrapolations from them. Reliable sources suggest that the
    number of pages publicly accessible from links (as opposed to say
    database queries) is on the order of 2-3 billion (i.e. 1000
    million=10^9), and that is perhaps 20% of all information online.
    Of these, about 58% are in English (according to Alex Franz of

    The last large-scale sampling I know of was done by Inktomi in
    Jan 2000. They counted 1.6 billion pages and showed this
    language distribution: English 86.55%, German 5.83%, French
    2.36%, Italian 1.55%, Spanish 1.23%, Portuguese 0.85%, Dutch
    0.54%, Finnish 0.50%, Swedish 0.36%, Japanese 0.34%. Since these
    figures add up to 100% while excluding most languages, they
    obviously do not give the complete picture.

    I have spent many days looking into this and have always been
    disappointed by the inconclusive results. Perhaps the most
    interesting trend is reveal in this self-quote, echoed by others:

    Historically English-language users and content have overshadowed
    other languages on the Internet, but the trend away from the
    preponderance of English seems clear. Statistics compiled by
    Global Reach illustrate the long-term development. In 1996,
    four-fifths of the 50 million Internet users were native speakers
    of English. By September 2001 Anglophones constituted only 43% of
    the world's online population of 503 million. Global Reach
    expects their share to fall below 30% of the 850 million Web
    users projected for 2005. The anticipated phenomenal growth in
    this non-Anglophone Web population should spur tremendous
    expansion of online resources in tongues other than English,
    particularly the smaller non-Western ones, to the benefit of
    those who teach, learn, and investigate these languages. [Global
    reach's current estimates of users by language: English 43%,
    Chinese 9.3%, Japanese 9.2%, Spanish 6.7%, German 6.7%, Korean
    4.4%, Italian 3.8%, French 3.3%, Portuguese 2.5%, Dutch 2.2%,
    Other 8.9%. ]

    If anyone has fresher reliable estimates I'd love to hear about

    Bill Fletcher

    Here are some sources, many based on / derived from each other.

    Excellent but dated study (explains how to sample and estimate--
    excellent background information):
    Lawrence, S. & C. L. Giles. (1999). Accessibility of Information
    on the Web. Nature, 400: 107-109. Summary, commentary, update and
    download at

    This study concludes 85% of information was from USA; authors do
    not plan to update the study:
    Moore, A. & Murray, B.H. (2000). Sizing the Internet. July 10,
    2000. Arlington, VA: Cyveillance, Inc. Retrieved 8 October 2000
    from the World Wide Web:

    Agence de la Francophonie's "L5 The Fifth Study on Languages and
    the Internet"
    studies the presence on the Internet of English, German, and the
    Romance languages

    Study of number of USERS per language; methodologically sound:
    Global Internet Statistics (by Language). San Francisco, CA:
    Global Reach Retrieved 6 October 2001 from the World Wide Web:

    Comparable figures:
    Nua Internet How Many Online. Dublin: Nua Ltd. Retrieved 8
    October 2001 from the World Wide Web: and regional

    Lots of information gleaned from various sources:
    Estadísticas de Internet en el ámbito internacional Madrid:
    Asociación de Usuarios de Internet. Retrieved 6 November 2001
    from the World Wide Web:

    Interesting methodology -- tries to estimate number of WORDS, not
    PAGES per language, but restricted to select Western European
    Grefenstette, Gregory & Julien Nioche. (2000) Estimation of
    English and non-English Language Use on the WWW. RIAO 2000,
    Paris, 12-14 April 2000. Retrieved 12 October 2001 from the
    World Wide Web:

    Further quotes from my paper
    Concordancing the Web with KWiCFinder, William H. Fletcher,
    United States Naval Academy

    Submitted for publication in proceedings of
    North American Association for Applied Corpus Linguistics
    Third North American Symposium on Corpus Linguistics and Language
    Teaching, Boston, MA, 23-25 March 2001

    The World Wide Web is a wondrous place, with an overwhelming
    variety of information in countless languages and domains. Just
    how many webpages there are and how they are distributed by
    language and content are not easy questions to answer. The Web is
    constantly growing and changing, and even the best estimates can
    only approximate its extent and composition. Studies of the
    nature of the Web echo the story of the blind men and the
    elephant: each extrapolates from different samples of an
    ever-evolving entity taken at different times and by divergent
    means. The most reliable estimates suggest that the number of
    publicly-indexable webpages in mid-2001 falls in the range of two
    to five billion (i.e. thousand million = 109), a number projected
    to grow to 10-15 billion by mid-decade.

    These two billion-plus pages are only the visible tip of the
    iceberg. For a page to be indexable, there must be a valid link
    to it from another publicly accessible site, which excludes the
    many pages with restricted access. Far larger is the vast
    "invisible web" of content in databases, which can only be evoked
    by entering relevant queries in a text box, and text materials
    stored in formats which are not typically indexed, such as word
    processor, Post Script and Adobe Acrobat files.

    Despite the overall size of this corpus, one language, English,
    continues to predominate. Studies conducted in 2000 by Inktomi
    and Cyveillance conclude that over 85% of publicly-accessible
    webpages are in English, but here again even the best-informed
    estimates vary widely. In the summer of 2001 the Agence de la
    Francophonie released L5: the Fifth Study of Language and the
    Internet, based on these studies and the one by Global Reach
    cited below, complemented by research into the numbers of
    webpages in various languages returned by search engines. This
    report investigates the relative presence of the Romance
    languages, German, and English among online documents. It shows
    strong growth among the non-English languages in the proportion
    of webpages found relative to English, concluding that the number
    of webpages in each is roughly proportional to the number of Web
    users with that language as native tongue. Data from these and
    other studies of linguistic diversity on the Web are summarized
    in this note.

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
     - -

      William H. Fletcher (410) 293-6362
      Associate Professor of German and Spanish (410) 293-2729
      Language Studies Department
      US Naval Academy
      589 McNair Road
      Annapolis, MD 21402 - 5030

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
     - - -

    This archive was generated by hypermail 2b29 : Mon Dec 03 2001 - 18:45:21 MET