How can I deal with pagination when scraping information from grocery store flyer?

kmjt · Jul 16, 2015

I am trying to get information such as the item title in the following grocery store flyer: http://www.yourindependentgrocer.ca/en_CA/[email protected]@896.html

I would like to System.out.println() all of the item names in Eclipse IDE. I am not sure how to get this information because the HTML from the above link does not seem to contain any items. Let alone when switching pages in the flyer the URL does not change...

To reiterate what I wish to accomplish here is an example that I did for another grocery store: http://www.metro.ca/flyer/index.en.html

However this one was easy because I didn't have to worry about pagination. All of the information came with the above URL's HTML. The following is my code to accomplish it:
Code:
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;

import org.apache.commons.lang3.StringEscapeUtils;

public class TestGroceryStore 
{
    public static void main(String[] args) throws Exception
    {
        URL flyerURL = new URL("http://www.metro.ca/flyer/index.en.html");
        URLConnection connection = flyerURL.openConnection();
        BufferedReader brFlyer = new BufferedReader(new InputStreamReader(connection.getInputStream()));
        StringBuilder sbFlyer = new StringBuilder();

        String flyerLine = null;
        while((flyerLine = brFlyer.readLine()) != null)
            sbFlyer.append(flyerLine + "\n");

        String flyerHTML = sbFlyer.toString();
        System.out.println(flyerHTML);

        // GETTING ITEM NAME
        while(flyerHTML.contains("\"shortDescription\":"))
        {
            String shortDescription = "";
            String shortDescriptionTag = "\"shortDescription\":";
            flyerHTML = flyerHTML.substring(flyerHTML.indexOf(shortDescriptionTag)+shortDescriptionTag.length());

            if(flyerHTML.startsWith("null"))
                shortDescription = "null";

            else
                shortDescription = StringEscapeUtils.unescapeHtml4(flyerHTML.substring(1, flyerHTML.indexOf(",\"longDescription\":")-1));

            System.out.println("SHORT DESCRIPTION = " + shortDescription);
            System.out.println("");
        }
    }
}
I would like to accomplish this (outputting the items in Eclipse) for grocery store flyers that deal with pagination so I am trying to do it for this grocery store: http://www.yourindependentgrocer.ca/en_CA/[email protected]@896.html

I do not wish to use any third party webscraping application. I would like to know how to do it in java.

Superfluous · Jul 16, 2015

going to be honest and say i haven't read your post fully, and i'm also ultra tired.

I'm pretty sure you can do it with jaunt because that's what i used to make the sythe scam report list thing: http://jaunt-api.com/jaunt-tutorial.htm. so maybe look into that briefly? i'll think about this more over the weekend if no one else has been able to help

there's also an online web scraper called kimono that i've dabbled around with. i think it's supposed to be very good at this kind of thing but it's a 3rd party application, so obviously not preferable: https://www.kimonolabs.com/welcome.html

kmjt · Jul 17, 2015

Superfluous said: ↑

going to be honest and say i haven't read your post fully, and i'm also ultra tired.

I'm pretty sure you can do it with jaunt because that's what i used to make the sythe scam report list thing: http://jaunt-api.com/jaunt-tutorial.htm. so maybe look into that briefly? i'll think about this more over the weekend if no one else has been able to help

there's also an online web scraper called kimono that i've dabbled around with. i think it's supposed to be very good at this kind of thing but it's a 3rd party application, so obviously not preferable: https://www.kimonolabs.com/welcome.html
Click to expand...

Thanks for your post. I only quickly glanced over jaunt but it looks like it is more of a raw code parser? How can it handle dynamic content like AJAX? I am currently looking into Selenium. It required an open browser though so that is a downfall.

SuF · Jul 17, 2015

You have to run the JavaScript in something, most likely a browser.

70i · Jul 21, 2015

The item name and price is in the HTML. If you save the code as HTML and open in your browser you'll see the product and image.
Which is PC® or Blue Menu® Smokies™ for $6.99 but you can get from below. Dreamweaver helped isolate this. I clicked on a product and it took me to that code.
Code:
<div class="card product" tabindex="0"> <span href="#"><img src="./Your Independent Grocer Store Flyers _ Your Independent Grocer_files/214x214" class="product-image" alt="PC® or Blue Menu® Smokies&#8482;"></span>  <div class="footer"> <div class="details more"> <p class="price"><sup>$</sup>6<sup>99</sup></p>    <h3 class="title">PC® or Blue Menu® Smokies&#8482;</h3>  <div class="more"> <span style="color:#000000;">selected varieties 1 kg<br>
or Webers beef burgers<br>
frozen 612 g</span> </div> </div>    <div class="controls uniform no-correction-1" data-add-success-text="Item Added">  </div>  </div> </div>

SuF · Jul 21, 2015

70i said: ↑
The item name and price is in the HTML. If you save the code as HTML and open in your browser you'll see the product and image.
Which is PC® or Blue Menu® Smokies™ for $6.99 but you can get from below. Dreamweaver helped isolate this. I clicked on a product and it took me to that code.
Code:
<div class="card product" tabindex="0"> <span href="#"><img src="./Your Independent Grocer Store Flyers _ Your Independent Grocer_files/214x214" class="product-image" alt="PC® or Blue Menu® Smokies™"></span>  <div class="footer"> <div class="details more"> <p class="price"><sup>$</sup>6<sup>99</sup></p>    <h3 class="title">PC® or Blue Menu® Smokies™</h3>  <div class="more"> <span style="color:#000000;">selected varieties 1 kg<br>
or Webers beef burgers<br>
frozen 612 g</span> </div> </div>    <div class="controls uniform no-correction-1" data-add-success-text="Item Added">  </div>  </div> </div>
Click to expand...
The HTML is being dynamically generated by Javascript

Blupig · Jul 21, 2015

SuF said: ↑

The HTML is being dynamically generated by Javascript
Click to expand...

I mean I haven't tried it myself, but if he's right, you could just download the page course and parse it normally that way. There'll just be that intermediate step and the slowness of IO added to the mix unless loaded to RAM.

Log in or Sign up

How can I deal with pagination when scraping information from grocery store flyer?

kmjt -.- The nocturnal life chose me -.-

Superfluous Rainbet.com Casino & Sportsbook

kmjt -.- The nocturnal life chose me -.-

SuF Legend

70i Forum Addict

SuF Legend

BEEF TOILET

Log in or Sign up

How can I deal with pagination when scraping information from grocery store flyer?

Thread Tools

kmjt -.- The nocturnal life chose me -.-

Superfluous Rainbet.com Casino & Sportsbook

kmjt -.- The nocturnal life chose me -.-

SuF Legend

70i Forum Addict

SuF Legend

BEEF TOILET

Useful Searches