How can I deal with pagination when scraping information from grocery store flyer?

Discussion in 'Programming General' started by kmjt, Jul 16, 2015.

How can I deal with pagination when scraping information from grocery store flyer?
  1. Unread #1 - Jul 16, 2015 at 8:32 PM
  2. kmjt
    Joined:
    Aug 21, 2009
    Posts:
    14,450
    Referrals:
    8
    Sythe Gold:
    449

    kmjt -.- The nocturnal life chose me -.-
    Banned

    How can I deal with pagination when scraping information from grocery store flyer?

    I am trying to get information such as the item title in the following grocery store flyer: http://www.yourindependentgrocer.ca/en_CA/[email protected]@896.html

    I would like to System.out.println() all of the item names in Eclipse IDE. I am not sure how to get this information because the HTML from the above link does not seem to contain any items. Let alone when switching pages in the flyer the URL does not change...

    To reiterate what I wish to accomplish here is an example that I did for another grocery store: http://www.metro.ca/flyer/index.en.html

    However this one was easy because I didn't have to worry about pagination. All of the information came with the above URL's HTML. The following is my code to accomplish it:



    Code:
    import java.io.BufferedReader;
    import java.io.InputStreamReader;
    import java.net.URL;
    import java.net.URLConnection;
    
    import org.apache.commons.lang3.StringEscapeUtils;
    
    public class TestGroceryStore 
    {
        public static void main(String[] args) throws Exception
        {
            URL flyerURL = new URL("http://www.metro.ca/flyer/index.en.html");
            URLConnection connection = flyerURL.openConnection();
            BufferedReader brFlyer = new BufferedReader(new InputStreamReader(connection.getInputStream()));
            StringBuilder sbFlyer = new StringBuilder();
    
            String flyerLine = null;
            while((flyerLine = brFlyer.readLine()) != null)
                sbFlyer.append(flyerLine + "\n");
    
            String flyerHTML = sbFlyer.toString();
            System.out.println(flyerHTML);
    
            // GETTING ITEM NAME
            while(flyerHTML.contains("\"shortDescription\":"))
            {
                String shortDescription = "";
                String shortDescriptionTag = "\"shortDescription\":";
                flyerHTML = flyerHTML.substring(flyerHTML.indexOf(shortDescriptionTag)+shortDescriptionTag.length());
    
                if(flyerHTML.startsWith("null"))
                    shortDescription = "null";
    
                else
                    shortDescription = StringEscapeUtils.unescapeHtml4(flyerHTML.substring(1, flyerHTML.indexOf(",\"longDescription\":")-1));
    
                System.out.println("SHORT DESCRIPTION = " + shortDescription);
                System.out.println("");
            }
        }
    }


    [​IMG]



    I would like to accomplish this (outputting the items in Eclipse) for grocery store flyers that deal with pagination so I am trying to do it for this grocery store: http://www.yourindependentgrocer.ca/en_CA/[email protected]@896.html

    I do not wish to use any third party webscraping application. I would like to know how to do it in java.
     
  3. Unread #2 - Jul 16, 2015 at 10:28 PM
  4. Superfluous
    Joined:
    Jul 5, 2012
    Posts:
    18,934
    Referrals:
    5
    Sythe Gold:
    9,128
    Vouch Thread:
    Click Here
    Discord Unique ID:
    247909953925414913
    Discord Username:
    .superfluous.
    Air Fryer DIAF m`lady Le Kingdoms Player STEVE Creamy

    Superfluous Rainbet.com Casino & Sportsbook

    How can I deal with pagination when scraping information from grocery store flyer?

    going to be honest and say i haven't read your post fully, and i'm also ultra tired.

    I'm pretty sure you can do it with jaunt because that's what i used to make the sythe scam report list thing: http://jaunt-api.com/jaunt-tutorial.htm. so maybe look into that briefly? i'll think about this more over the weekend if no one else has been able to help

    there's also an online web scraper called kimono that i've dabbled around with. i think it's supposed to be very good at this kind of thing but it's a 3rd party application, so obviously not preferable: https://www.kimonolabs.com/welcome.html
     
  5. Unread #3 - Jul 17, 2015 at 2:21 PM
  6. kmjt
    Joined:
    Aug 21, 2009
    Posts:
    14,450
    Referrals:
    8
    Sythe Gold:
    449

    kmjt -.- The nocturnal life chose me -.-
    Banned

    How can I deal with pagination when scraping information from grocery store flyer?


    Thanks for your post. I only quickly glanced over jaunt but it looks like it is more of a raw code parser? How can it handle dynamic content like AJAX? I am currently looking into Selenium. It required an open browser though so that is a downfall.
     
  7. Unread #4 - Jul 17, 2015 at 7:06 PM
  8. SuF
    Joined:
    Jan 21, 2007
    Posts:
    14,212
    Referrals:
    28
    Sythe Gold:
    1,234
    Discord Unique ID:
    203283096668340224
    <3 n4n0 Two Factor Authentication User Community Participant Spam Forum Participant Sythe's 10th Anniversary

    SuF Legend
    Pirate Retired Global Moderator

    How can I deal with pagination when scraping information from grocery store flyer?

    You have to run the JavaScript in something, most likely a browser.
     
  9. Unread #5 - Jul 21, 2015 at 3:51 PM
  10. 70i
    Joined:
    Jan 11, 2014
    Posts:
    462
    Referrals:
    0
    Sythe Gold:
    174

    70i Forum Addict
    Banned

    How can I deal with pagination when scraping information from grocery store flyer?

    The item name and price is in the HTML. If you save the code as HTML and open in your browser you'll see the product and image.
    Which is PC® or Blue Menu® Smokies&#8482; for $6.99 but you can get from below. Dreamweaver helped isolate this. I clicked on a product and it took me to that code.


    Code:
    <div class="card product" tabindex="0"> <span href="#"><img src="./Your Independent Grocer Store Flyers _ Your Independent Grocer_files/214x214" class="product-image" alt="PC® or Blue Menu® Smokies&#8482;"></span>  <div class="footer"> <div class="details more"> <p class="price"><sup>$</sup>6<sup>99</sup></p>    <h3 class="title">PC® or Blue Menu® Smokies&#8482;</h3>  <div class="more"> <span style="color:#000000;">selected varieties 1 kg<br>
    or Webers beef burgers<br>
    frozen 612 g</span> </div> </div>    <div class="controls uniform no-correction-1" data-add-success-text="Item Added">  </div>  </div> </div>
    
     
  11. Unread #6 - Jul 21, 2015 at 6:17 PM
  12. SuF
    Joined:
    Jan 21, 2007
    Posts:
    14,212
    Referrals:
    28
    Sythe Gold:
    1,234
    Discord Unique ID:
    203283096668340224
    <3 n4n0 Two Factor Authentication User Community Participant Spam Forum Participant Sythe's 10th Anniversary

    SuF Legend
    Pirate Retired Global Moderator

    How can I deal with pagination when scraping information from grocery store flyer?

    The HTML is being dynamically generated by Javascript
     
  13. Unread #7 - Jul 21, 2015 at 10:38 PM
  14. Blupig
    Joined:
    Nov 23, 2006
    Posts:
    7,145
    Referrals:
    16
    Sythe Gold:
    1,609
    Discord Unique ID:
    178533992981594112
    Valentine's Singing Competition Winner Member of the Month Winner MushyMuncher Gohan has AIDS Extreme Homosex World War 3 I'm LAAAAAAAME
    Off Topic Participant

    Blupig BEEF TOILET
    $5 USD Donor

    How can I deal with pagination when scraping information from grocery store flyer?

    I mean I haven't tried it myself, but if he's right, you could just download the page course and parse it normally that way. There'll just be that intermediate step and the slowness of IO added to the mix unless loaded to RAM.
     
< Wanting to learn how to Code | python runner >

Users viewing this thread
1 guest


 
 
Adblock breaks this site