Adblock breaks this site

How can I deal with pagination when scraping information from grocery store flyer?

Discussion in 'Programming General' started by kmjt, Jul 16, 2015.

  1. kmjt

    kmjt -.- The nocturnal life chose me -.-
    Banned

    Joined:
    Aug 21, 2009
    Posts:
    14,450
    Referrals:
    8
    Sythe Gold:
    449
    How can I deal with pagination when scraping information from grocery store flyer?

    I am trying to get information such as the item title in the following grocery store flyer: http://www.yourindependentgrocer.ca/en_CA/[email protected]@896.html

    I would like to System.out.println() all of the item names in Eclipse IDE. I am not sure how to get this information because the HTML from the above link does not seem to contain any items. Let alone when switching pages in the flyer the URL does not change...

    To reiterate what I wish to accomplish here is an example that I did for another grocery store: http://www.metro.ca/flyer/index.en.html

    However this one was easy because I didn't have to worry about pagination. All of the information came with the above URL's HTML. The following is my code to accomplish it:



    Code:
    import java.io.BufferedReader;
    import java.io.InputStreamReader;
    import java.net.URL;
    import java.net.URLConnection;
    
    import org.apache.commons.lang3.StringEscapeUtils;
    
    public class TestGroceryStore 
    {
        public static void main(String[] args) throws Exception
        {
            URL flyerURL = new URL("http://www.metro.ca/flyer/index.en.html");
            URLConnection connection = flyerURL.openConnection();
            BufferedReader brFlyer = new BufferedReader(new InputStreamReader(connection.getInputStream()));
            StringBuilder sbFlyer = new StringBuilder();
    
            String flyerLine = null;
            while((flyerLine = brFlyer.readLine()) != null)
                sbFlyer.append(flyerLine + "\n");
    
            String flyerHTML = sbFlyer.toString();
            System.out.println(flyerHTML);
    
            // GETTING ITEM NAME
            while(flyerHTML.contains("\"shortDescription\":"))
            {
                String shortDescription = "";
                String shortDescriptionTag = "\"shortDescription\":";
                flyerHTML = flyerHTML.substring(flyerHTML.indexOf(shortDescriptionTag)+shortDescriptionTag.length());
    
                if(flyerHTML.startsWith("null"))
                    shortDescription = "null";
    
                else
                    shortDescription = StringEscapeUtils.unescapeHtml4(flyerHTML.substring(1, flyerHTML.indexOf(",\"longDescription\":")-1));
    
                System.out.println("SHORT DESCRIPTION = " + shortDescription);
                System.out.println("");
            }
        }
    }


    [​IMG]



    I would like to accomplish this (outputting the items in Eclipse) for grocery store flyers that deal with pagination so I am trying to do it for this grocery store: http://www.yourindependentgrocer.ca/en_CA/[email protected]@896.html

    I do not wish to use any third party webscraping application. I would like to know how to do it in java.
     
  2. Superfluous

    Superfluous Rainbet.com Casino & Sportsbook
    Crabby Retired Global Moderator

    Joined:
    Jul 5, 2012
    Posts:
    18,939
    Referrals:
    5
    Sythe Gold:
    9,135
    Vouch Thread:
    Click Here
    Discord Unique ID:
    247909953925414913
    Discord Username:
    .superfluous.
    Two Factor Authentication User Pool Shark Air Fryer DIAF m`lady Le Kingdoms Player STEVE Creamy
    How can I deal with pagination when scraping information from grocery store flyer?

    going to be honest and say i haven't read your post fully, and i'm also ultra tired.

    I'm pretty sure you can do it with jaunt because that's what i used to make the sythe scam report list thing: http://jaunt-api.com/jaunt-tutorial.htm. so maybe look into that briefly? i'll think about this more over the weekend if no one else has been able to help

    there's also an online web scraper called kimono that i've dabbled around with. i think it's supposed to be very good at this kind of thing but it's a 3rd party application, so obviously not preferable: https://www.kimonolabs.com/welcome.html
     
  3. kmjt

    kmjt -.- The nocturnal life chose me -.-
    Banned

    Joined:
    Aug 21, 2009
    Posts:
    14,450
    Referrals:
    8
    Sythe Gold:
    449
    How can I deal with pagination when scraping information from grocery store flyer?


    Thanks for your post. I only quickly glanced over jaunt but it looks like it is more of a raw code parser? How can it handle dynamic content like AJAX? I am currently looking into Selenium. It required an open browser though so that is a downfall.
     
  4. SuF

    SuF Legend
    Pirate Retired Global Moderator

    Joined:
    Jan 21, 2007
    Posts:
    14,212
    Referrals:
    28
    Sythe Gold:
    1,234
    Discord Unique ID:
    203283096668340224
    <3 n4n0 Two Factor Authentication User Community Participant Spam Forum Participant Sythe's 10th Anniversary
    How can I deal with pagination when scraping information from grocery store flyer?

    You have to run the JavaScript in something, most likely a browser.
     
  5. 70i

    70i Forum Addict
    Banned

    Joined:
    Jan 11, 2014
    Posts:
    462
    Referrals:
    0
    Sythe Gold:
    174
    How can I deal with pagination when scraping information from grocery store flyer?

    The item name and price is in the HTML. If you save the code as HTML and open in your browser you'll see the product and image.
    Which is PC® or Blue Menu® Smokies&#8482; for $6.99 but you can get from below. Dreamweaver helped isolate this. I clicked on a product and it took me to that code.


    Code:
    <div class="card product" tabindex="0"> <span href="#"><img src="./Your Independent Grocer Store Flyers _ Your Independent Grocer_files/214x214" class="product-image" alt="PC® or Blue Menu® Smokies&#8482;"></span>  <div class="footer"> <div class="details more"> <p class="price"><sup>$</sup>6<sup>99</sup></p>    <h3 class="title">PC® or Blue Menu® Smokies&#8482;</h3>  <div class="more"> <span style="color:#000000;">selected varieties 1 kg<br>
    or Webers beef burgers<br>
    frozen 612 g</span> </div> </div>    <div class="controls uniform no-correction-1" data-add-success-text="Item Added">  </div>  </div> </div>
    
     
  6. SuF

    SuF Legend
    Pirate Retired Global Moderator

    Joined:
    Jan 21, 2007
    Posts:
    14,212
    Referrals:
    28
    Sythe Gold:
    1,234
    Discord Unique ID:
    203283096668340224
    <3 n4n0 Two Factor Authentication User Community Participant Spam Forum Participant Sythe's 10th Anniversary
    How can I deal with pagination when scraping information from grocery store flyer?

    The HTML is being dynamically generated by Javascript
     
  7. Blupig

    Blupig BEEF TOILET
    $5 USD Donor

    Joined:
    Nov 23, 2006
    Posts:
    7,145
    Referrals:
    16
    Sythe Gold:
    1,609
    Discord Unique ID:
    178533992981594112
    Valentine's Singing Competition Winner Member of the Month Winner MushyMuncher Gohan has AIDS Extreme Homosex World War 3 I'm LAAAAAAAME
    Off Topic Participant
    How can I deal with pagination when scraping information from grocery store flyer?

    I mean I haven't tried it myself, but if he's right, you could just download the page course and parse it normally that way. There'll just be that intermediate step and the slowness of IO added to the mix unless loaded to RAM.
     
< Wanting to learn how to Code | python runner >


 
 
Adblock breaks this site