From Web Site to Web Service


HTML screen-scraping is the practice of connecting to a web site, retrieving a result, (usually HTML), and then sifting through the unstructured data to extract a useful value. For example, to find the current sales rank for this book, you could look it up by ISBN (0-7645-5890-0) using http://shop. barnesandnoble .com/booksearch/isbnInquiry.asp?isbn=0764558900 on Barnes and Noble.

You could then view the HTML source and search for 'sales rank'. You would find something like:

 <font size="-1">sales rank: 1,823</font> 

In the past, to automate this through code, we would write a small application that would connect to BarnesAndNoble.com through WinInet, or the XMLHttp component, and request the document. Then we'd write some code to search the HTML for a string that matched <font size="-1">sales rank:

and also search for a string immediately following the previous string of </font> . Anything found between these two strings would be the result.

This isn't efficient, and the code to perform this match would be very fragile. If the HTML changed, your code would no longer function correctly.

.NET makes a lot of this much easier. For example, there is support for regular expression pattern-matching . Finding the sales rank value in the preceding string is now simply a matter of creating the appropriate regular expression syntax, and then searching the document:

  "size=.-1.>sales rank:.(.*?)</"  

This regular expression search string returns the sales rank.

Note

You can find more information on regular expressions and pattern-matching in Chapter 16.

While regular expressions simplify searching for strings, you still need to write all the other code to access the site and return the HTML, as well as wrapping all of this in a friendly API.

ASP.NET Web services automates much of this by allowing you to build custom WSDL documents that specify the location, parameters, regular expression to match, and return types. You can then use one of your proxy generation tools, such as Visual Studio .NET or wsdl.exe , to generate a proxy object to encapsulate this.

Authoring the WSDL

Let's write an example that returns the sales rank for any book, for a provided ISBN. First, the WSDL:

  <?xml version="1.0"?>   <definitions xmlns:s="http://www.w3.org/2000/10/XMLSchema"   xmlns:http="http://schemas.xmlsoap.org/wsdl/http/"   xmlns:mime="http://schemas.xmlsoap.org/wsdl/mime/"   xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/"   xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/"   xmlns:s0="http://tempuri.org/"   targetNamespace="http://tempuri.org/"   xmlns="http://schemas.xmlsoap.org/wsdl/"   xmlns:msType="http://microsoft.com/wsdl/mime/textMatching/">   <types/>   <message name="GetBookDetailsHttpGetIn">   <part name="isbn" type="s:string"/>   </message>   <message name="GetBookDetailsHttpGetOut"/>   <portType name="BarnesAndNobleHttpGet">   <operation name="GetBookDetails">   <input message="s0:GetBookDetailsHttpGetIn"/>   <output message="s0:GetBookDetailsHttpGetOut"/>   </operation>   </portType>   <binding name="BarnesAndNobleHttpGet" type="s0:BarnesAndNobleHttpGet">   <http:binding verb="GET"/>   <operation name="GetBookDetails">   <http:operation location="/booksearch/isbnInquiry.asp"/>   <input>   <http:urlEncoded/>   </input>   <output>   <msType:text>   <msType:match name="Rank"   pattern="size=.-1.&gt;sales rank:.(.*?)&lt;/"   ignoreCase="true"/>   </msType:text>   </output>   </operation>   </binding>   <service name="BarnesAndNoble">   <port name="BarnesAndNobleHttpGet" binding="s0:BarnesAndNobleHttpGet">   <http:address location="http://shop.barnesandnoble.com"/>   </port>   </service>   </definitions>  

This WSDL defines a <service> named BarnesAndNoble , which also names the end-point, http://shop.barnesandnoble.com that we'll make queries against. This service does not use SOAP, but instead uses HTTP GET to make requests . There are two other areas of interest in the WSDL elements you just saw: <binding> and <message name="GetBookDetailsHttpGetIn"> .

The <binding> element defines an operation, GetBookDetails , that further qualifies the end-point to which the HTTP-GET request is sent ( /booksearch/isbnInquiry.asp ). It also defines, in the <output> section, a <msType:match ...> element . Within a <match> element, we declare the regular expression syntax used for our string match. The value of the name attribute allows you to control the name of the property the proxy will create, which will be used to access the 'sales rank' of a given ISBN. The pattern value is the regular expression pattern used for searching the document and returning results with.

The other section of interest is the <message name="GetBookDetailsHttpGetIn"> element, which is used to define parameters that we want to send as part of the HTTP GET request. The defined value, isbn (which is of type string ) will be used to formulate a request, such as, http://shop.barnesandnoble.com/booksearch/isbnInquiry.asp?isbn=1861004753, or http://shop.barnesandnoble.com/booksearch/isbnInquiry.asp?isbn=1861004885.

We're now ready to build a proxy for this WSDL.

Building the Proxy

Open a command prompt and move to the directory where the WSDL was created. Then issue the following command:

  wsdl.exe /language:VB bn.wsdl  

If there were no errors in the WSDL, you should have a VB.NET source file named BarnesAndNoble.vb . Let's look at the source of this file (some elements not relevant to this discussion, namely comments and the asynchronous methods , have been removed):

  Option Strict Off   Option Explicit On     Imports System   Imports System.ComponentModel   Imports System.Diagnostics   Imports System.Web.Services   Imports System.Web.Services.Protocols   Imports System.Xml.Serialization     <System.Diagnostics.DebuggerStepThroughAttribute(), _   System.ComponentModel.DesignerCategoryAttribute("code")> _   Public Class BarnesAndNoble   Inherits System.Web.Services.Protocols.HttpGetClientProtocol     Public Sub New()   MyBase.New   Me.Url = "http://shop.barnesandnoble.com"   End Sub     <System.Web.Services.Protocols.HttpMethodAttribute(GetType(System.Web.Services._   Protocols.TextReturnReader),   GetType(System.Web.Services.Protocols.UrlParameterWriter))> _   Public Function GetBookDetails(ByVal isbn As String) As GetBookDetailsMatches   Return CType(Me.Invoke("GetBookDetails", (Me.Url +   "/booksearch/isbnInquiry.asp"),   New Object() {isbn}),GetBookDetailsMatches)   End Function     Public Function BeginGetBookDetails(ByVal isbn As String, ByVal callback As   System.AsyncCallback, ByVal asyncState As Object) As System.IAsyncResult   Return Me.BeginInvoke("GetBookDetails", (Me.Url +   "/booksearch/isbnInquiry.asp"), New Object() {isbn}, callback, asyncState)   End Function     Public Function EndGetBookDetails(ByVal asyncResult As System.IAsyncResult) As   GetBookDetailsMatches   Return CType(Me.EndInvoke(asyncResult),GetBookDetailsMatches)   End Function   End Class     Public Class GetBookDetailsMatches   <System.Web.Services.Protocols.MatchAttribute("size=.-1.>sales rank:.(.*?)</",   IgnoreCase:=true)> _   Public Rank As String   End Class  

This auto-generated source file contains two classes. The first class, BarnesAndNoble , has a single function called GetBookDetails that accepts an ISBN as a parameter and returns an instance of GetBookDetailsMatches . The returned type, GetBookDetailsMatches , is the second class defined in the source file. It contains a single member variable Rank . The Rank member variable has an attribute applied to it that represents the regular expression syntax declared in the WSDL.

Compile this source file by executing the following command (note that this command should be typed all on one line):

  vbc.exe /t:library /r:System.dll /r:System.Web.dll /r:System.Web.Services.dll   /r:System.Xml.dll BarnesAndNoble.vb  

This will generate an assembly named BarnesAndNoble.dll .

Using the Screen Scrape Proxy

Now that an assembly is built, you can deploy it to the \bin directory of a web application. You can then write the following ASP.NET page (using VB.NET here) that loads and uses the BarnesAndNoble proxy:

  <%@ Import Namespace="System.Net" %>     <Script runat=server>   Public Sub GetSalesRank(sender As Object, e As EventArgs)   Dim bn As New BarnesAndNoble()   Dim match As GetBookDetailsMatches   match = bn.GetBookDetails(isbn.Value)     rank.Text = match.Rank   End Sub   </Script>   <font face=arial>   <form runat=server>   ISBN number: <input type=text id="isbn" runat=server/>   <input type=submit id=submit onserverclick="GetSalesRank" runat=server/>   </form>   Sales Rank: <font color=red><b><asp:label id=rank runat=server/></b></font>   </font>  

Using C#, you would write:

 <%@ Page Language="C#" %> <%@ Import Namespace="System.Net" %>      <Script runat="server"> public void GetSalesRank(Object sender, EventArgs e) {    BarnesAndNoble bn = new BarnesAndNoble();    GetBookDetailsMatches match;    match = bn.GetBookDetails(isbn.Value);    rank.Text = match.Rank; } </Script>      <font face=arial>    <form runat=server ID="Form1">       ISBN number: <input type=text id="isbn" runat=server NAME="isbn"/>       <input type=submit id=submit onserverclick="GetSalesRank" runat=server                                     NAME="submit"/>    </form>    Sales Rank: <font color=red><b><asp:label id=rank runat=server/></b></font> </font> 

This ASP.NET page creates a new instance of the BarnesAndNoble proxy object in the GetSalesRank event handler for when the input button is clicked. The BarnesAndNoble instance, bn , is used to call the GetBookDetails method (passing in the ISBN number):

click to expand
Figure 20-7:

In the screenshot shown in Figure 20-7 you can see that the ASP.NET page has successfully queried BarnesAndNoble.com for the sales rank of this book.

Note

This screen-scraping feature of ASP.NET Web services allows you to turn any web site into a web service. You can simply author the WSDL, and VS.NET or wsdl.exe takes care of the rest.




Professional ASP. NET 1.1
Professional ASP.NET MVC 1.0 (Wrox Programmer to Programmer)
ISBN: 0470384611
EAN: 2147483647
Year: 2006
Pages: 243

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net