Home
    Shop
    Advertise
    Write For Us
    Affiliate
    Newsletter
    Contact

Web Crawler, spider, ant, bot... how to make one?

Introduction

A web crawler is a program that browses the World Wide Web in a methodical and automated manner. It also known as web spider, web robot, ant, bot, worm, and automated indexer. Famous examples of web crawlers are search engines.

 

They are crawling the web all the time to get up to date data for search purposes. In general, web crawler starts with a list of URLs to visit, this list is called "seeds". As the web crawler visits these URLs, it gets all the hyperlinks in the visited pages and add them to the list of URLs to visit. These new added URLs are called crawl frontier.

In this four part tutorial we will build a web spider application. Its main function is to traverse a complete web site to get all URLs inside the entire site, and saves all its pages to a database file. The input to this spider is the web site URL (the seed). The spider visits the given URL, gets all the hyperlinks in the visited page, saves the new links to the URLs list to visit (crawl frontier), saves the page itself to a database file, traverse the URLs list, pick a new URL to visit, and do the same cycle again.

To be able to build the above scenario as an application you have to learn some new libraries and namespaces, then you will be ready to build up your spider, and this is what we will do in our series of web crawlers tutorials.

So, let's start.

What We Really Need?

First of all we will create our application using Microsoft Visual Studio 2005 as the development environment, Visual Basic as the language, and Microsoft Access as the backend database.

In order to work we will need to add some new libraries and namespaces to our application. These are: "System.Net" namespace, "System.IO" namespace, "System.Data.OleDb" namespace, "System.Data" namespace, "mshtml" DLL, and "Microsoft Web Browser" COM component.

In this part of tutorial we will talk about the first four libraries, and in the next tutorial we will explain the remaining ones in more details.

"System.Net" Namespace

This interface is the main programming interface for most of used  network protocols. By using classes like "WebRequest" and "WebResponse" you can access your internet resource without worrying about the underlying network protocol. These classes carries out all the annoyed communication details for you letting you focus on the main task you need to do. These two classes are the basis of so called pluggable protocols, which give you the ability to access network resources without worrying about the used protocol or its specific details. We will use these two classes to access each URL in the URLs list and gets the text stream that represent this URL.

"System.IO" Namespace

This namespace provides all what you need to deal with files, directories, and data streams. By using it you can do all file operations like open, write, read, rename, copy, or delete. You can also use it to check for file existence or folder existence. You can use it to check path name validity for a file or directory. If you - for any reason - need to deal with files or folders or data streams you definitely will need to import this namespace to your application. In our program we will use this namespace to check file existence and file copying as we will see.

"System.Data" Namespace

This namespace provides the interface for classes and types that represent the ADO.Net architecture. ADO.Net is a set of software components that gives the programmer the ability to access and modify data stored in relational database systems and non-relational database systems. As its architecture, ADO.Net consists of two main parts: data provider, and data set.

Data provider classes provide access to data sources like Microsoft SQL server, Microsoft Access, or Oracle databases. Although they have a common set of utilities, each kind of data provider has its own set of "Connection", "Command", "Parameter", and "DataReader" classes. These classes provides the connection used to communicate with the required data source, the command that will be carried out against data, the command parameter, a data reader for a large list of results on a record per time basis.

DataSet is a group of classes that represent in memory representation of relational database. Each data set can contain a set of related tables (belongs to the same database project), relationships between these table, and a data view for each table. Each table has its own data columns and data rows. By using these set of classes you can create your database tables, construct their columns and relationships, or fill the dataset with an existing database schema from existing data source. To transfer data between in-memory representation and the actual data source stored on hard disks, you need to make use of the "Data Adapter" class. This class is used to populate the data set with data from the data source and update the data source with the modified data. In our program we will construct our in-memory table to manage our URLs list, and we will use data commands to insert new rows each represent a page into our database output file.

"System.Data.OleDb" Namespace

This namespace provides the representation of ADO.Net interface to OLE database types. As "System.Data" this namespace provides all what you need to deal with databases. The difference is that this namespace provides all these services for OLE databases only. Under this interface you will find a class named "OleDbCommand" that represent a command object to OLE DB type, "OleDbConnection" represents a connection to OLE DB type, and so on. In our program we will use this namespace because we will use Microsoft MS Access as our database engine, and the suitable data provider for this type is the OLE data provider.

To download the complete program, just click here.

For further information

Refer to the online copy of Microsoft Developers Network at http://msdn.microsoft.com or use your own local copy of MSDN.


Tutorial toolbar:  Tell A Friend  |  Add to favorites  |  Feedback  |   


comments powered by Disqus