[Tutorial] PHP Scraping 101

Discussion in 'Web Programming' started by Viral_, Jul 21, 2017.

[Tutorial] PHP Scraping 101
  1. Unread #1 - Jul 21, 2017 at 9:31 PM
  2. Viral_
    Joined:
    Jul 21, 2017
    Posts:
    2,480
    Referrals:
    1
    Sythe Gold:
    3,211
    Discord Unique ID:
    827322595988865025

    Viral_ Grand Master

    [Tutorial] PHP Scraping 101

    I am going to show all of you the basics on how to scrape with PHP. For the demonstration I am going to show you how to scrape stats from the runescape website.

    Now the basics of this is all going to be through CURL. If someone would like a tutorial on CURL I would be happy to give you the run down, but for this demonstration we are going to use a library

    GitHub - FriendsOfPHP/Goutte: Goutte, a simple PHP Web Scraper

    I have been very fond of this library. I have built my own but for an example of usability I will be using this library.

    Requirements
    • Webserver with LAMP installed
    • Composer
    If you do not have a webserver. You can go to this thread and follow my tutorial on how to set one up.
    [Tutorial] [Free] LAMP web server

    If you need help setting up composer go to this link
    Composer


    Tutorial

    Run

    composer require fabpot/goutte

    In the folder where you are putting your php file to scrape.

    Start your php file

    Code:
    <?PHP
    
    
    ?>
    Now depending on if you are pulling a platform in or if you are starting from scratch. You always want some sort of debugging to see if there is any errors.

    PHP:
    <?PHP

    $debug 
    1;

    if(
    $debug == 1) {
      
    error_reporting(E_ALL);
      
    ini_set('display_errors'1);
    }

    ?>
    This will now allow us to track the errors. Now lets add the autoloader to make sure we are pulling in the goutte library.

    PHP:
    <?PHP
    $debug 
    1;

    if(
    $debug == 1) {
      
    error_reporting(E_ALL);
      
    ini_set('display_errors'1);
    }

    require_once(
    'vendor/autoload.php');

    ?>
    Now to instantiate the Goutte Client we need to put the use properties of it.

    PHP:
    <?PHP
    $debug 
    1;

    if(
    $debug == 1) {
      
    error_reporting(E_ALL);
      
    ini_set('display_errors'1);
    }

    require_once(
    'vendor/autoload.php');

    $username 'zezima';

    $client = new Client();

     
    $post_values = array('user1' => $username'submit' => 'Search');

     
    $crawler $client->request('POST''http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws', [], [], ['HTTP_CONTENT_TYPE' => 'application/x-www-form-urlencoded'], http_build_query($post_values));

    ?>
    Lets start the client now define our post parameters and the do our crawl. Now I found the post parameters by going to the hiscores lookup and then typing a username in. I was in the inspect element console for chrome and then I just looked at the request and what the parameters were. If you need more details on this a tutorial can be made.

    Now lets go ahead and build a each loop on the element #contentHiscores tr. We want to loop through all the tr and find all the values. The syntax is kind of like Jquery.

    PHP:
    <?PHP
    $debug 
    1;

    if(
    $debug == 1) {
      
    error_reporting(E_ALL);
      
    ini_set('display_errors'1);
    }

    require_once(
    'vendor/autoload.php');

    $username 'zezima';

    $client = new Client();

     
    $post_values = array('user1' => $username'submit' => 'Search');

     
    $crawler $client->request('POST''http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws', [], [], ['HTTP_CONTENT_TYPE' => 'application/x-www-form-urlencoded'], http_build_query($post_values));

     
    $acc_array = array();
     
    $crawler->filter('#contentHiscores tr')->each(function($node$k) use (&$acc_array){
        if(!
    in_array($k, array(0,1,2,3)) && $node->children()->count() == 5) {
          
    $acc_array['skills'][strtolower(trim($node->children()->eq(1)->text()))] = array('rank' => $node->children()->eq(2)->text(), 'level' => $node->children()->eq(3)->text(), 'xp' => $node->children()->eq(4)->text());
        }
      });
    ?>
    Lets now find the username and add it to the $acc_array
    PHP:
    <?PHP
    $debug 
    1;

    if(
    $debug == 1) {
      
    error_reporting(E_ALL);
      
    ini_set('display_errors'1);
    }

    require_once(
    'vendor/autoload.php');

    $username 'zezima';

    $client = new Client();

     
    $post_values = array('user1' => $username'submit' => 'Search');

     
    $crawler $client->request('POST''http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws', [], [], ['HTTP_CONTENT_TYPE' => 'application/x-www-form-urlencoded'], http_build_query($post_values));

     
    $acc_array = array();
     
    $crawler->filter('#contentHiscores tr')->each(function($node$k) use (&$acc_array){
        if(!
    in_array($k, array(0,1,2,3)) && $node->children()->count() == 5) {
          
    $acc_array['skills'][strtolower(trim($node->children()->eq(1)->text()))] = array('rank' => $node->children()->eq(2)->text(), 'level' => $node->children()->eq(3)->text(), 'xp' => $node->children()->eq(4)->text());
        }
      });

    $acc_array['username'] = preg_replace('/Personal(.*?)('.$username.')$$/''$2'$crawler->filter('#contentHiscores tr td')->children()->eq(0)->text());
    ?>
    To finish it off to show you the results lets do a print_r


    FINAL EXAMPLE:

    PHP:
    <?PHP
    $debug 
    1;

    if(
    $debug == 1) {
      
    error_reporting(E_ALL);
      
    ini_set('display_errors'1);
    }

    require_once(
    'vendor/autoload.php');

    $username 'zezima';

    $client = new Client();

     
    $post_values = array('user1' => $username'submit' => 'Search');

     
    $crawler $client->request('POST''http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws', [], [], ['HTTP_CONTENT_TYPE' => 'application/x-www-form-urlencoded'], http_build_query($post_values));

     
    $acc_array = array();
     
    $crawler->filter('#contentHiscores tr')->each(function($node$k) use (&$acc_array){
        if(!
    in_array($k, array(0,1,2,3)) && $node->children()->count() == 5) {
          
    $acc_array['skills'][strtolower(trim($node->children()->eq(1)->text()))] = array('rank' => $node->children()->eq(2)->text(), 'level' => $node->children()->eq(3)->text(), 'xp' => $node->children()->eq(4)->text());
        }
      });

    $acc_array['username'] = preg_replace('/Personal(.*?)('.$username.')$$/''$2'$crawler->filter('#contentHiscores tr td')->children()->eq(0)->text());

      echo 
    '<pre>';
      
    print_r($acc_array);
    ?>
    Hope this tutorial was helpful. If you have any questions feel free to reply and I will respond in a prompt manner or Pm me.
     
    ^ Panda and 70i like this.
  3. Unread #2 - Feb 16, 2018 at 3:31 PM
  4. kmjt
    Joined:
    Aug 21, 2009
    Posts:
    14,450
    Referrals:
    8
    Sythe Gold:
    449

    kmjt -.- The nocturnal life chose me -.-
    Banned

    [Tutorial] PHP Scraping 101

  5. Unread #3 - Feb 16, 2018 at 3:42 PM
  6. Hamtower
    Joined:
    Oct 23, 2011
    Posts:
    62
    Referrals:
    0
    Sythe Gold:
    7

    Hamtower Member

    [Tutorial] PHP Scraping 101

    Thanks for this, very cool!
     
  7. Unread #4 - Feb 16, 2018 at 8:53 PM
  8. Viral_
    Joined:
    Jul 21, 2017
    Posts:
    2,480
    Referrals:
    1
    Sythe Gold:
    3,211
    Discord Unique ID:
    827322595988865025

    Viral_ Grand Master

    [Tutorial] PHP Scraping 101

    Glad you guys liked it :)
     
< Need HTML Help :c (Usertitle) | Looking for Bitcoin ticker [Will pay] >

Users viewing this thread
1 guest


 
 
Adblock breaks this site