Diving the Deep Web: An In-depth Look

network map

With this article, I intend to provide information about the Deep Web; however, it is not really possible to describe what qualifies as the Deep Web without first discussing how the Internet works on a technical level. It is also important to understand how the Internet came to be because its current structure and way of transferring data was a solution to the financial problem of creating a connection between two servers that were hundreds of miles apart. As one might expect, I will explain how the Deep Web is different from the Internet with which we commonly interact, i.e. the Surface Web; however, the way in which it is different may be rather surprising. Therefore, I will explain the significance of search engines and their “web crawlers” and how these create the Surface Web. Finally, I will present the various types of networks the fall under the umbrella term of “the Deep Web.

The Internet

When one refers to “the Internet,” one is actually referring to a global network of computer networks that follows the Transmission Control Protocol and the Internet Protocol (TCP/IP) which specify how data is formatted, addressed, sent, routed, and received. When a person sits down at a computer and opens up a web-browser to go to a specific webpage, he or she begins a series of data transfers. First, the command to go to a website is passed from the computer through copper cable wires, or to a wireless router which connects to a cable jack. The request is then sent to a modem in the person’s local Internet service provider hub. Next, the request leaves the hub and goes to a company’s server or a data center that houses many smaller websites. The requested website receives a ping which notifies it that a computer is requesting access and as well as that computer’s location. The website then sends the requested data back through the same chain of connection through data packets. TCP/IP breaks data up into smaller bits that are known as packets. These packets are assigned IP addresses that designate the destination computer and the port location in that particular machine. The data packets may not follow the same path if a person’s router determines that some packets of data may move more quickly if they go by an alternate route through different servers and hubs.

As mentioned in the previous paragraph, packets of data are tagged with an address which directs them to the destination computers as well as the port location in a given computer. Ports constructs within a computer’s operating system that relate to specific software or applications that are responsible for specific processes. Data packets also have port addresses so that the information they carry reaches the appropriate location within a computer. Approximately 250 port numbers have become “well-known,” meaning that they are registered and reserved for specific purposes cross computer networks.

A brief history of the Internet

Please see How Did the Internet Come to be?

The Significance of Search Engines

It is not possible to provide an adequate explanation of the Deep Web, and the variety of networks which fall under this catchall term, without providing a discussion of its antonym, i.e. the Surface Web. The Surface Web, also known as the indexable Web, is the portion of the World Wide Web that conventional search engines, such as Google, Bing, Yahoo!, AOL, Ask, and Baidu have made accessible through indexing.

Search engines construct databases of the Web by using programs called Web crawlers, or spiders, which operate as explorers and archivers. They also create the Surface Web, which many people consider to be the entire Internet. However, because spiders are unable to access certain kinds of webpages and networks, a portion of the Internet is inaccessible. In other words, because of the limitations of Web crawling software, conventional search engines have unintentionally created virtual regions of the Internet, through omission, which we call the Deep Web. In addition, although there does not have to be any great technical difference between a website “in the Surface Web” and one “in the Deep Web” (they could both be on the same server), the inaccessibility of these websites has contributed to the idea of the Deep Web being a den of nefarious, illegal, and/or exciting activity.

How Web crawlers work

Web crawlers, or spiders, are programs that browse the Internet and copy pages that their parent search engines will later index and make accessible to users. Spiders go through uniform resource locators (or URLs, also known as web addresses especially when used with HTTP) and search for all hyperlinks on a given webpage. The spiders then “crawl” via hyperlinks to all linked pages, pulling keywords and metadata for future indexing. In addition, by following the links between pages, the spiders create a map of the accessible Internet.

The limitations of Web crawlers

Spiders’ attempts to crawl the Web can be hampered by a number of factors. Naturally, they are unable to find webpages that are not linked to any other pages. Web crawlers rely on normally shaped URLs, i.e. web addresses that are relatively short and contain real words. In addition, they are unable to pull keywords from pages written in non-HTML languages, such as JavaScript. Most significantly, they are unable to access databases that require users to input search terms, or get past password protections, or CAPTCHAs. As a result, websites that prevent spiders from “crawling” their pages do not become accessible via a search engine and, therefore, are not on the Surface Web.

The Deep Web

Having discussed how the Surface Web is constructed from a technical standpoint, it is now possible to define the Deep Web by describing how it is different. The term “Deep Web” can be seen as an umbrella term to describe any network that is not accessible through the commonly used search engines. As discussed in The limitations of Web Crawlers, there are many ways that websites resist, intentionally or unintentionally, web crawlers attempts to save and index their pages.

Other terms that people use to describe these regions of the Internet are: Dark web, Dark Net, the Invisible Web, the Undernet or the Hidden Web. These terms describe something different from the “dark Internet” or a “Darknet.” As I will discuss in the subsequent subsections, these types of networks are part of the Deep Web, but are not synonymous with it. For the sake of clarity, I will refer to the unindexed portion of the Internet, as well as not easily accessible computer networks, only as the Deep Web, as opposed to the other possible names, and will describe the different types of networks that fall under this category.

The Deep Web is not actually synonymous with nefarious activity in the way that people have constructed it to be. In reality, the Deep Web itself contains online TV guides (which are only accessible by entering information into search boxes), price comparison websites (requiring user input), driving direction websites (also requiring users to fill out mini-forms), services that track the value of gross stocks, information databases that require subscription, academic services that require username/password/VPNs. In other words, these websites, or parts of websites, only exist on the Deep Web because web crawlers are unable to submit basic forms for users and get around subscription requirements.

“dark Internet”

The dark Internet is comprised of websites that are cut off from the Internet (not just the Surface Web) due to wrongly configured routers and hackers abusing loopholes in net software. A large portion of websites that are in the dark Internet are old military websites that were part of the ARPANET. These websites are no longer accessible because they use old IP addresses that routers no longer reference. Routers contain lengthy lists of web addresses and bounce information packets between each other, attempting to find the shortest path to the destination address.

For instance, if I try to go to a website that my ISP’s routers do not have the address of; however, these routers coordinate with other routers that have the desired address. So instead of my ISP sending my request to access the website directly to the server that website is on, the ISP sends it to another router which is able to send it. However, in the case of the ARPANET era websites, there are no routers that exist anywhere that know of their addresses, therefore they are inaccessible.

However, if I were to know the IP address of a website in the dark Internet and it did not have some sort of security or encryption, it is possible that I would be able to access it by connecting directly to its host server.

Darknet

A Darknet is an anonymity network whose purpose is to facilitate friend to friend file sharing (F2F). This is different from the more common peer-to-peer networks (P2P) because the users’ IP addresses are not accessible to the public. In a P2P file exchange, it is possible for government agencies or corporations to gather identifying information about the people involved in the exchange. In the case of true Darknets, people share files directly with each other and not through an intermediary network. This makes it extremely secure from surveillance; however, it requires a great deal of trust between those engaged in the file sharing.

Examples of Darknets are the Waste and Retroshare networks. These networks are accessible through the Surface Web because the companies that run them have websites from which users can download the file-sharing software. However, the P2P or F2F connections that users establish are in the Deep Web because they are inaccessible through conventional search methods.

10179291013_3c9c202577_z

TOR network

The TOR Network is an anonymity network that whistle-blowers, political dissidents, terrorists, governments, spies, and various kinds of criminals use to communicate without revealing their identities to each other or people potentially surveying them. This network, and its reputation, is often conflated with the entire concept of the Deep Web.

The TOR network is an acronym for “the onion router” network. The TOR software directs traffic through a free, worldwide volunteer network consisting of more than four thousand relay points on personal computers and servers. The TOR network passes data packets in a similar way as conventional computer networks; however, it encrypts the data before it passes through the network. The onion router network gets its name because the data packets are nested within layers of encryption which anonymize the content of the data packet as well as the relay info. Instead of only choosing a relay path that is the shortest distance to the desired server, the TOR network also requires that there are three relay points before the final destination.

When a user requests data, and thereby provides user and location information, the TOR software creates three layers of encryption that each provide directions to a random relay point. As the information passes to the first point, that router decrypts one layer and only receives instruction where to send the data next. The router does not have information as to where the data packet came from, where it is ultimately going, or what the packet contains. The second relay point decrypts the second layer which sends it on to the third point which ultimately decrypts the final layer, without revealing the content, and sends it to the final destination. The desired website or server receives the data, but is unable to identify the user or the location of his or her computer. Although there are vulnerabilities in servers themselves, the TOR network is one of the most secure anonymity networks in the world.

The TOR Network was originally sponsored by the US Naval Research Laboratory and protected by DARPA. The software is now maintained and updated by the Tor Project, a non-profit company that receives funding from the Electronic Frontier Foundation. The network not only provides anonymity to users, but also to websites. Websites do not disclose the IP addresses, so are not accessible through the World Wide Web. One can only access them by knowing their “onion addresses” which enable users to go around firewalls, but protects the identities of both the users and the websites.

The TOR Network has gained the attention of the mainstream media recently for three reasons: 1) Edward Snowden allegedly used the TOR Network to anonymously send millions of NSA documents to the Washington Post and the Guardian; 2) The owner of the popular drugs and weapons website, the Silk Road, was arrested; 3) the online P2P payment system and digital currency called Bitcoin was accepted as legitimate payment for some popular online services and investors have become increasingly interested in the appreciation value of this currency.

 

Enhanced by Zemanta
Advertisements

3 thoughts on “Diving the Deep Web: An In-depth Look

  1. Pingback: Diving the Deep Web: The Internet You Did Not Know Existed | Cybernid

  2. A similar term I heard once is “dark social.” This refers to information exchanged between people in a way that does not leave easily read metadata. Sending a link to a youtube video to a friend by chat, or posting it in a chatroom or on a forum or mentioning it to someone else in the room or typing the url on someone else’s computer to show them the funny cat video can generate traffic that doesn’t get counted by facebook, twitter, or google.

    I found this idea funny bc it presumes that all social media activity can normally be seen clearly (by a few companies?), measured, and studied. Traditional forms of communication have been relabeled “dark” relative to certain standards of visibility, ie those of the dominant corporations and industries that have grown up around their services.

    • That is an excellent point! It is strange that forms of communication which are quite common are being labeled “dark” because a few companies are unable to gather data on them. It seems that when companies call these activities “dark social” it implies that there is something suspicious about what people are communicating.

I would love to hear your thoughts, ideas and questions! I will make sure to visit your blog (if you have one) and make substantive comments on your posts. Thank you for reading!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s