[TuT] Parsing HTML

Discussion in 'Programming General' started by speljohan, Mar 16, 2007.

[TuT] Parsing HTML
  1. Unread #1 - Mar 16, 2007 at 5:08 AM
  2. speljohan
    Joined:
    Apr 24, 2005
    Posts:
    1,450
    Referrals:
    3
    Sythe Gold:
    0

    speljohan Guru
    Visual Basic Programmers

    [TuT] Parsing HTML

    In this tutorial we will learn how to parse html, and you will see how easy it really is.

    First: you should have basic knowledge of VB.net before you begin. Basically, what this application will be doing is to:

    1. Download the HTML of a website
    2. Put the HTML in a variable
    3. Parse the HTML and get specific words from it
    4. Output it in the application

    For the simplicity of this tutorial, i decided to make it parse how many users are on a world.

    Ok, ready?

    Creating The Base​


    1. Create a new function in the code. Let's call it GetPlayers:
    Code:
    Public Function GetPlayers(ByVal worldNumber As Integer) As Integer
    'our code will go here
    End Function
    2. Now, let's add some basic stuff in the getPlayers method:
    Code:
    Public Function GetPlayers(ByVal worldNumber As Integer) As Integer
    Dim src As String
    Dim net As New Net.WebClient()
    src = net.DownloadString("http://www.runescape.com/slj.ws?lores.x=366&lores.y=232&plugin=0")
    End Function
    Ok, so now this code needs some explanation.

    Code:
    Dim src As String 'Declares the variable that will hold the source code
    Code:
    Dim net As New Net.WebClient() 'Creates a new instance of a WebClient
    Code:
    src = net.DownloadString("http://www.runescape.com/slj.ws?lores.x=366&lores.y=232&plugin=0") 'Downloads the html source and puts it in the variable src
    3. Great, now we have the whole source code :) Now, all we got to do is parse the data for the specified world.

    String Handling​


    First, let me introduce you to some commonly used functions when parsing strings.
    Code:
    InStr
    Substring
    Split
    These are the three keys to sucessful parsing. So, what does each one do?

    InStr​


    InStr Returns the character position in a string. So, if we for example ran this code:
    Code:
    Dim myString As String = "Hi Mom! I am home"
    Dim pos As Integer
    pos =  InStr(myString, "home", CompareMethod.Text)
    MsgBox(pos)
    
    This would return 14, because it located that the word "home" begun at character position 14.

    So, what can you use this for? well, in the next example i will explain Substring, which can be combined with InStr.

    Substring​


    Substring returns the text located at a specific position in a variable. Here's an example that can be connected to the previous one:
    Code:
    Dim homeOnly As String
    Dim myString As String = "Hi Mom! I am home"
    
    homeOnly =  myString.Substring(InStr(myString, "home", CompareMethod.Text) - 1, 4) 
    MsgBox(homeOnly)
    
    Looks hard? It isn't actually. Let's go through it step by step:
    Code:
    myString.Substring( 'We are looking for a word inside myString
    InStr(myString, "home", CompareMethod.Text) - 1, 'The first argument in Substring is the index where to begin and since InStr returns the index, we use it to return the index of that character. The -1 is to go back one step and start on the "h" of "home"
    4) 'Tells us there's 4 letters to go through
    So, as you see it's not very hard to parse a string with ease. Now, let's say we wanted to get only text after a certain point. That's what the Split method is for.

    Split​


    Well, as i said previously split is used to split a string into an array. Let's look at this example:
    Code:
    Dim toSplit As String = "Hello,World!"
    Dim output() As String
    output = toSplit.Split(",")
    MsgBox(output(0))
    MsgBox(output(1))
    That code would return "Hello" in output(0) and "World!" in output(1). Why? Well, because we used split on the string with the character ",". That would remove the "," and put it in the array specified.

    Now you know everything about the most important functions used (at least by me) so let's move on and parse that damn HTML to get us that function working :)

    3. Parsing the actual Runescape HTML to retrieve amount of users on a world​


    Before you begin, study the html code of the website, and see if you can locate the world numbers in HTML. At the moment they're all in there as JavaScript. Each world currently has a layout like:
    Code:
    e(worldNumber,members,status,prefix,[color=red]usersOnline[/color],country/flag);
    So as you see, we've already found what we need to parse. So let's continue with the code we started in the beginning of this tutorial, shall we? First, let's make some fact. The easiest way to parse this, would be to split it off somehow in a easy way. what do we found repeatedly along the source? "e(" is very common. So let's split it everytime "e(" is found :D

    Code:
    Public Function GetPlayers(ByVal worldNumber As Integer) As Integer
    Dim src As String
    Dim net As New Net.WebClient()
    src = net.DownloadString("http://www.runescape.com/slj.ws?lores.x=366&lores.y=232&plugin=0")
    
    src = src.Substring(InStr(src, "e(" & worldNumber & ","))
    tmp = src.Split(",")
    
    Return tmp(4)
    End Function
    Now, if you read my whole tutorial you should understand the code above. It's one of the simplest cases of string parsing from Runescape.com. The same code could be used to retrieve other values about world easily, by returning some other index for tmp. example:
    Code:
    Return tmp(3)
    would return the prefix for that world.

    Hope you enjoyed learning about this stuff, and if you have any questions, just post them here.
     
  3. Unread #2 - May 12, 2007 at 3:47 PM
  4. 5cript
    Joined:
    Jan 22, 2007
    Posts:
    138
    Referrals:
    1
    Sythe Gold:
    0

    5cript Active Member

    [TuT] Parsing HTML

    Thanks alot :D I've been looking around for a tutorial like this for a LONG time.
     
< Loading a form into a tabpage | vb 2005 download program? >

Users viewing this thread
1 guest


 
 
Adblock breaks this site