Dean's Digital World

Sources 58

By Dean Tudor
Journalism Professor Emeritus Ryerson University
www.deantudor.com

Lately, there has been some researcher press about Deep Webs. Now, these are not your Deep Throat type of sources. Rather, they are part of the Web that does not turn up in search engines like Google or Yahoo, although together the major search engines can crawl through almost one-third of these sites. Deep Web was once called "Invisible Web" (coined by Dr. Jill Ellsworth in 1994), which I think is still a better term. Deep Web actually means an accessible Web site that not indexable or queryable by conventional search engines. Invisible Web means not visible, and has wider coverage; it includes both accessible Web sites (but not indexable or queryable) and non-accessible Web sites (Intranets, passworded, non-coded).

Normally, indexed Web sites are displayed when you retrieve pages from the various Web search engines. Subject directories such as Yahoo will also display these sites. Some people call this the Surface Web. The Deep Web (or Invisible Web) is what you cannot retrieve, for you cannot see it in your search statements. Plus of course all the URLs contained in these types of sites. There are seven main divisions here:

1) The HTTP command is but a subset of Internet content. There are also FTP (file transfer), E-mail, news, Telnet, and pre-Web Gopher), which are not searchable.

2) Excluded pages: not every Web site wants to be totally included in search engine reports. In the HTML source code there is a provision and procedure for turning away the spider search bots so that they don't report that particular page. In other words, the URL has been turned off by the code.

3) Databases: many Web sites have the contents of thousands of specialized searchable databases that you can search via the Web. You can get results from these databases, but only in answer to your one specific query. You cannot access the whole database. In fact, the database may not even be stored online. You only wake it up when it is needed for a response. "It is easier and cheaper to dynamically generate the answer page for each query than to store all the possible pages containing all the possible answers to all the possible queries people could make to the database" (Berkeley, see below). And thus the search devices cannot find or create these pages. Tabular formats are a bitch to display without appropriate software to do so. Even simple layouts such as crossword puzzles have some display component, and hence are normally not indexed (check out www.ecrostic.com). Databases with tables created by Access, Oracle, SQL Server and DB2 are accessible only by query. There's a lot of information out there on the Web via databases. Content on the Deep Web may be 500 times larger than the normal Google-searchable Web. The 60 largest Deep Web sources contain 84 billion pages of content. That's about 750 terabytes of information. Top dogs are the US National Climatic Data Center, 366,000 GigaB, 42 billion records; US NASA EOSDIS, 219,000 GigaB, 25 billion records; and US National Oceanographic Data Center, a mere 33,000 GigaB, 4 billion records. By contrast, Google indexes only about 6 billion pages. And 95% of the Deep Web is publicly-accessible information, not subject to fees or subscriptions. Awesome….

4) For a variety of technical reasons (easy to understand, but long and cumbersome to explain), there are "static" pages on the Web. These reside on servers, waiting to be retrieved when their URL is used in an HTTP command. But they are not linked, and hence spiders do not find them.

5) Some sites require a password or loginID, and these sites are closed to spiders. Passworded sites include indexing services, encyclopedias, directories, Lexis and Nexis. In fact, any site that is not free requires a password. There are thousands of such sites, although some will let you in with teasers or partials, such as the Wall Street Journal. Yahoo in 2005 made a small part of the Deep Web searchable by creating Yahoo Subscriptions which searches through a few subscription-only Web sites.

6) Non-HTML formatted pages: these include programming languages which have codes that are incompatible with HTML, although the links can be indexed but not the actual pages. Search engines have a hard time with Adobe .pdf files (although Google has a reformatting tool), image databases, spreadsheets (.xls files), multimedia files, PostScript (.ps), Flash, Shockwave, PowerPoint (.ppt), and even wordprocessing programs (Word .doc, WordPerfect .wp). There is no problem downloading these materials once they are found; the major trick is finding them in the first place!

7) Script-based pages with a ? (question mark) in their URL: these are particularly devilish for spiders to locate. Most spiders do not return the URL because of script problems and, believe it or not, spider traps.

The basic Invisible Web, of course, are the various Intranets put up by businesses, governments, and universities. These are locally connected Web sites meant for just the corporation's use: sometimes passwords are required. All manner of documents, many unclassified, are posted - terabytes of information. And they are a major concern for internal security since they can be hacked and also accessed by rogue employees. There is no outside index to these sites, since they are just local. All are hidden behind firewalls. I cannot tell you how many times people have told me that a particular document is on a Web site - "just go over and follow the links" - only to find out that it is on their Intranet and hence inaccessible to me. Actually, I can tell you: about a score of times…

Other major invisible content includes static online library catalogues, hidden portions of major Web sites, schedules and maps, complete databases, tables of statistics especially in spreadsheets, phone books, people finders (lists of professionals), patents, laws, dictionaries, Web store or Web-auction products, newspaper archives, many blogs, multimedia and graphical files. The Invisible Web is the fastest growing category of new information on the Internet.

Also, "dynamically changing new information" will be part of the Invisible Web. This includes news, job postings, travel data (airline flights, hotels, etc.), stock market postings.

How do you find Deep Web? One way is through academic search tools such as Infomine, Librarians Index, and AcademicInfo. You could try Direct Search at www.freepint.com/gary/direct.htm. There is also www.profusion.com, and www.completeplanet.com,. Another way is through your usual search engine. Just type in a short subject term with the word "database" (e.g., biomedical database). If the database includes the word "database", then bingo! (Bob's your uncle?). If you drill through a directory such as Yahoo, then be sure to also use the term "database": this will pick up additional listings. Many search engines feature searchable databases as part of their service. Google, for example, has separate searches for audio-visual material, images, news, and non-HTML formats. These are just one click away from the main HTML search.

Some interesting Deep Web sites include:

* AnimalSearch (animalsearch.net): family-safe animal-related sites, search by group, type, and regions.

* Educator's Reference Desk (www.eduref.org): contains 2000+ lesson plans, 3000+ links to value-added online education information, and 200+ question archive. It also provides access to the ERIC database -- the world's largest source of information on education research & practice, including free, full-text expert digest reports, and it also links you to the Gateway to Educational Materials (GEM), which "provides quick and easy access to over 40,000 educational resources found on various US federal, state, university, non-profit and commercial Internet sites."

* NatureServe Explorer (www.natureserve.org/explorer): "information on more than 65,000 plants, animals, and ecosystems of the United States and Canada. Explorer includes particularly in-depth coverage for rare and endangered species."

* Nuclear Explosions Database (www.ga.gov.au/oracle/nukexp_form.jsp):
Geoscience Australia's database provides location, time, & size of explosions worldwide since 1945. Click on "databases".

* On-Line Encyclopedia of Integer Sequences (www.research.att.com/~njas/sequences): "Type in a series of numbers and this database will complete the sequence and provide the sequence name, along with its mathematical formula, structure, references, and links."

* PubMed (www.ncbi.nlm.nih.gov/entrez/query.fcgiwww.ncbi.nlm.nih.gov/entrez/query.fcgi): access to 16 million+ MEDLINE citations, including links to full text articles & related resources. PubMed Central (PMC) is an e-archive of free, full text articles from 200+ life sciences journals, as well as Bookshelf, "a growing collection of [full text] biomedical books (50+) that can be searched directly." Plus the global NCBI 'Entrez' search engine for their many life sciences databases.

* FindArticles (www.findarticles.com): now searching 10 million+ articles from "leading academic, industry and general interest publications."

* MagPortal.com (magportal.com): freely available magazine articles on the Web, using keyword searching or category browsing methods.

* Directory of Open Access Journals (www.doaj.org): one stop open access directory, providing no-cost access to the full text of over 2,000 journals, with over 500 journals searchable on the article level (over 83,000 articles available) in the science and humanities/social sciences

* Cryptome (cryptome.org): specializes in posting both previously classified or under-publicized US federal documents, along with similar documents from other jurisdictions. There could be half-a-dozen posted every business day. Just go over to the site, and the home page lists the latest docs. Typical titles include "Expansion of the Strategic Petroleum Reserve", "Calendar of 2,482 US Military dead in Iraqi War", "Security Measures for Radioactive Materials", "Outer Continental Shelf Polluters Fined", "CIA Creation Documents". There is also an index to off-site documents, dealing with topics such as the Israeli Lobby and US foreign policy, Al Qaeda documents, New York City public safety.
Cryptome is a true Web site, with multiple links to other similar document retrieval efforts. You could do worse than beginning with Cryptome for searches involving nefarious actions of government. The site also has a searchable data DVD of its archives of over 33,000 files (since June 1996), just under 3 GB worth.


There is actually a firm promising to locate Deep Web material. It is BrightPlanet (www.brightplanet.com). Their mission statement: "BrightPlanet applies unique and fully automated technology to internet search site identification, retrieval, qualification, classification, summarization, characterization and publication". Currently, BrightPlanet software is configured to query 70,000+ Deep Web sources. It'll even walk your dog…

For the immediate future, you should expect a big impact from two sources.

One is the court system. The Canadian Judicial Council (the organization of Canada's top judges) has recommended that access to court records via the Internet be restricted. Many of these may be moving over to the local intranet and never accessible via the open Internet. You'll soon have to visit your nearest courthouse to view legal documents, much as you have to now just to view the paper copies.

Remote access would still be available for judicial decisions and case information, but not to affidavits, motion records, and pleadings. All in the interests of privacy and identify theft. It is one thing to have publicly available documents at the courthouse (you must first determine which courthouse and you must ask for the right piece of paper), but it is quite another thing to have publicly available documents floating out on the Internet where just anybody can read them. Yes, they are public, but only in paper form and locally disseminated… U.S. courts are still more open. Documents from scores of federal courts can be downloaded through PACER (Public Access to Court Electronic Records) for a small fee, by the page.

Another is change of ownership. While most of the databases within the Deep Web are government-owned or non-profit, there are still vast areas such as E-mail and FTP which are in private hands. Every time someone buys an Internet property, there are policy changes. What should we expect with the newest batch of dot com purchases by the media itself? How will this play out for searching for data? NBC Universal has bought iVillage, the top women's oriented site on the Internet, with over 30 million unique visitors a month. News Corp (Murdoch) has bought MySpace, the fastest growing social networking site on the Web, about 50 million unique visitors a month. News Corp also bought IGN, a top gaming and entertainment site for young hot males, with under 20 million unique users a month. Viacom (owner of MTV and Paramount) has bought Neopets, a young person's community site with virtual pets. Viacom also has bought iFilm (where users track the film industry and post their own videos), GameTrailers (a competitor to IGN with more hot males), and GoCityKids (via Nickelodeon). The New York Times has bought About.Com, an online advice site with over 60 million unique users a month.

Other hot properties appear to be photo- and video-sharing sites. Murdoch still has $2 billion earmarked for these purchases, coming up real soon. The big audiences in all the new acquisitions can link to each other within and without their communities. And they could be susceptible to database searching by new owners or positioned for a sell off of contents to database searchers.

For more details on the Invisible Web and the Deep Web, try these URLs:

www.lib.berkeley.edu/TeachingLib/Guides/Internet/InvisibleWeb.html
http://www.internettutorials.net/deepweb.htm
www.brightplanet.com/deepcontent/deep_web_faq.asp


Published in
Sources 58, Summer 2006.

 


Subject Headings



Connexions Links    -    Connexions Directory A-Z Index    -    Connexions Library

    Periodicals & Broadcasters Online    -    Volunteer Opportunities    -    Publicity & media relations resources

Connexions
Phone: 416-964-5735
E-mail:
www.connexions.org