Thursday, June 12, 2008

HTML Screen Scraping using C# .Net WebClient

What is Screen Scraping ?

Screen Scraping means reading the contents of a web page. Suppose you go to yahoo.com, what you see is the interface which includes buttons, links, images etc. What we don't see is the target url of the links, the name of the images, the method used by the button which can be POST or GET. In other words we don't see the HTML behind the pages. Screen Scraping pulls the HTML of the web page. This HTML includes every HTML tag that is used to make up the page.
Why use screen scraping ?

The question that comes to our mind is why do we ever want the HTML of any web page. Screen Scraping does not stop only on pulling out the HTML but displaying it also. In other words you can pull out the HTML from any web page and display that web page on your page. It can be used as frames. But the good thing about screen scraping is that it is supported by all browsers and frames unfortunately are not.

Also sometimes you go to a website which has many links which says image1, image2, image3 and so on. In order to see those images you have to click on the image and it will enlarge in the parent or the new window. By using screen scraping you can pull all the images from a particular web page and display them on your own page.

Displaying a web page on your own page using Screen Scraping :

Lets see a small code snippet which you can use to display any page on your own page. First make a small interface as I have made below. As you can see the interface is quite simple. It has a button which says "Display WebPages below" and the web page trust me or not will be displayed in place of label. All the code will be written for the Button Click event. Below you can see the "Button Click Code".



C# Button Click Code :
private void Button1_Click(object sender, System.EventArgs e)
{
WebClient webClient = new WebClient();
const string strUrl = "http://www.yahoo.com/";
byte[] reqHTML;
reqHTML = webClient.DownloadData(strUrl);
UTF8Encoding objUTF8 = new UTF8Encoding();
lblWebpage.Text = objUTF8.GetString(reqHTML);

}
Explanation of the Code Snippet in C#:

As you can see the code is few lines long. This is because Microsoft.net has a very strong set of class libraries that makes the task easier for the developer. If you were trying to achieve the same result from classic Asp you might have to write a lot more code, I guess that's good for all the coders out there in the programming world.

In the first line I made an object of the WebClient class. The WebClient class provides common methods for sending data to or receiving data from any local, intranet, or Internet resource identified by a URI.

In the next line we just defined a private string variable strUrl which holds the url of the web page we wish to use in our example.

Then we declared a byte array reqHTML which will hold the bytes transferred from the web page.

Next line downloads the data in the form of bytes and put them in the reqHTML byte array.

The UTF8Encoding class represents the UTF-8 encoding of Unicode characters.

And in the next line we use the UTF8Encoding class method GetString to get the bytes as a string representation and finally we binds the result to the label.

This code now gets the www.yahoo.com homepage when the label is bound with the HTML of the yahoo page. The whole yahoo page is displayed.
The Generated HTML :

For those curious people who want to see that HTML was generated when the request was made. You can easily view the HTML by just viewing the source code of the yahoo page. In our internet explorer go to View -> Source. The notepad will open with the complete HTML generated of the page. Lets see a small screen shot of the HTML generated when we visit yahoo.com. As you can see the HTML generated is quite complex. Wouldn't it be really cool if you can extract out all the links from the generated source. Lets try to do that :)
Extracting Urls :

The first thing you need to extract all the Urls from the web page is the regular expression. I am not saying you cannot do this without regular expression you can but it will be much harder.
Regular Expression for Extracting Urls :

First you need to introduce System.Text.RegularExpressions. Next you need to make a regular expression that can extract all urls from the generated HTML. There are many regular expressions already made for you which you can view at http://www.regexlib.com/ . Your regular expression would like this:

Regex r = new Regex("href\\s*=\\s*(?:(?:\\\"(?[^\\\"]*)\\\")|(?[^\\s]* ))");

This just says that extract everything from the web page source which starts with "href\\"
User Interface in Visual Studio .Net:

I am keeping user interface pretty simple. It consist of a textbox, datagrid and button. The datagrid will be used to display all the extracted urls.

Here is a screen shot of the User Interface.

The Code:

Okay the code is implemented in the button click event. But before that lets see the important declarations. You also need to include the following namespaces:


System.Net;

System.Text;

System.IO // If you plan to write in a file
// creates a button protected System.Web.UI.WebControls.Button Button1; // creates a byte array private byte[] aRequestHTML; // creates a string private string myString = null; // creates a datagrid protected System.Web.UI.WebControls.DataGrid DataGrid1; // creates a textbox protected System.Web.UI.WebControls.TextBox TextBox1; // creates the label protected System.Web.UI.WebControls.Label Label1; // creates the arraylist private ArrayList a = new ArrayList();

Okay now lets see some button click code that does the actual work.
private void Button1_Click(object sender, System.EventArgs e)
{
// make an object of the WebClient class
WebClient objWebClient = new WebClient();
// gets the HTML from the url written in the textbox
aRequestHTML = objWebClient.DownloadData(TextBox1.Text);
// creates UTf8 encoding object
UTF8Encoding utf8 = new UTF8Encoding();
// gets the UTF8 encoding of all the html we got in aRequestHTML
myString = utf8.GetString(aRequestHTML);
// this is a regular expression to check for the urls
Regex r = new Regex("href\\s*=\\s*(?:(?:\\\"(?[^\\\"]*)\\\")|(?[^\\s]* ))");
// get all the matches depending upon the regular expression
MatchCollection mcl = r.Matches(myString);

foreach(Match ml in mcl)
{
foreach(Group g in ml.Groups)
{
string b = g.Value + "
";
// Add the extracted urls to the array list
a.Add(b);

}
}
// assign arraylist to the datasource
DataGrid1.DataSource = a;
// binds the databind
DataGrid1.DataBind();

// The following lines of code writes the extracted Urls to the file named test.txt
StreamWriter sw = new StreamWriter(Server.MapPath("test.txt"));
sw.Write(myString);
sw.Close();
}
The MatchCollection mc1 has all the extracted urls and you can iterate through the collection to get all of them. Once you enter the url in the textbox and press the button the datagrid will be populated with the extracted urls. Here is a screen shot of the datagrid. The screen shot only shows few urls extracted there are at least 50 of them.
Final Note:

As you see that its simple to extract urls from any web page. You can also make the Column in the datagrid a hyperlink column so you can browse the extracted url.

No comments: