It is well-known that when you see an iceberg in the ocean, you are only seeing about 10% of its total mass. The remaining 90% extends below the water’s surface reaching depths of 600 feet or more. The Internet is similar to an iceberg because only 10% of it is “visible” to most users. The part of the Web that is invisible to conventional search engines and is inaccessible through traditional browsers is called the “deep Web.” Like the depths of the ocean, the deep Web is a little known world that contains dazzling and shocking creations as well as vast swaths of emptiness.
The deep Web evades the notice of the majority of traditional search engines. Even though Google may seem ubiquitous and omnipotent when running a search that produces millions of results, it is unable to reach the majority of websites online. Search engines, such as Google, Yahoo! and Bing, use Web crawler (or spider) programs that browse the Internet and copy pages that the search engine will later index and make accessible to users. The crawlers go through URLs and search for all the hyperlinks on web pages, adding them to their “crawl frontier” which maps out other websites and pages for them to index. Crawling the deep Web is a challenge for this type of software because it relies on normally shaped URLs and the presence of tags or text. Companies who operate the popular search engines have become aware of the shortcomings of crawlers and are looking for ways to improve them.
A significant reason why crawler software has such difficulty with the deep Web is because a portion of it is composed of dynamic pages that only exist when someone types a query into a database. This format makes it challenging for a person, let alone mindless software, to discover such pages. Some users report that the majority of the deep Web is in the form of these databases; however, this is merely speculative. The popular search engines are unable to index websites because they may have scripted content and thus appear dynamically (such as with Flash), or a non-HTML (standard) format. Websites can also choose to block crawlers by requiring users to login or implement other features such as CAPTCHAs.
People who access the deep Web have reported their findings on forums on the “surface Web” like explorers returning from strange countries. Their tales generally fall into one of three categories. The first group is those people who were unimpressed with their virtual excursion and claim that the deep Web is just a 1990’s version of the surface Web. People in the second group have ended up unintentionally gazing into the dark abyss of pedophiles, assassins, crime rings, human experimentation, etc. and warn everyone away from the deep Web. The third group is those people who have found truly interesting things such as scientific papers, discussion groups, e-books, blogs, and tech communities, and want to share these with those on the surface Web.
From having made forays into the deep Web, I can say that these impressions are all fairly accurate. Hunting around for deep websites can be a frustrating experience because many of the websites are frequently boring, badly scripted or unfinished. Furthermore, slow servers cause these sites to take up to ten minutes to load, which exacerbates the feelings of frustration. There are websites on the deep Web that are truly horrific and cater to the darker elements of our society and world. However, in order to avoid such sites, all a person has to do is take hyperlink descriptions seriously and resist clicking on those that seem too disturbing to be real. With enough persistence and caution, one can find a treasure trove of academic reference material, newspaper articles, maps, data engines and more.
The companies that own the popular search engines are aware of the untapped resources that the deep Web holds and are attempting to improve Web crawlers so they can access the hidden websites. Understandably, there are websites on the deep Web that do not want to appear on the surface Web because being visible in such a way would identify their owners and users. Like shining light into the darkness of the deep-sea, the entire ecosystem of the deep Web would change if it became accessible through Google. Perhaps the world would benefit from being able to easily access information on the deep Web, but at what cost?
For an answer to this question, stay tuned for the second part of “Diving the Deep Web.”
1) Spetka, Scott. “The TkWWW Robot: Beyond Browsing.” Wayback Machine. Internet
Archive, Sept. 2004. Web. 15 Sept. 2012.
- Exploring the Deep Web with Semantic Search [INFOGRAPHIC] (bostinno.com)
- The Deep Web: The Place Where the Internet’s Secrets Are (techpp.com)