Friday, August 12, 2011

Parsing HTML on Blackberry



Creative Commons Licence
This work is licenced under a Creative Commons Licence.

Parsing HTML is not as easy as one would think when first starting out. HTML is based on XML after all, and XML is easily parsed. So an XML parser should be able to do the job. Unfortunately even the best HTML is rarely well formed XML so the parser blows up. Web browsers parse HTML all the time, but importing the code base of a major portion of a modern browser into a simple application is not very practical, especially on a limited device like a BlackBerry® smartphone. Wouldn't it be nice if we could ask the BlackBerry browser to parse the HTML for us and give is a nicely structured DOM Document with the page data in it? Well, starting with OS version 5.0 we can with net.rim.device.api.browser.field2.BrowserField. As a side benefit we can also show the web page to our users.

I'm not a big fan of page scraping. Not only because it is difficult, tedious and error prone to code, it is also problematic to maintain because the web page being scraped can change at the whim of the site owner. There are usually better options. The site I'm going to use as an example, OurAirports.Com will happily hand over all the back-end data to you. One look at the size of the files will give you the idea they aren't really suitable for a BlackBerry application. It will also provide the data already in nice well formed XML. Ultimately though, I find page scraping in most cases to be somewhat dishonest. Someone has gone to the trouble of setting up a web page. Taking bits and pieces out of their work, behind their back, does seem a very nice thing to do.

There are times though when scraping can be the thing to do. In the case of a project I'm working on I want to be able to provide my users with the ability to search for airports in a very flexible way. What criteria the user may use, from time to time, to select the airport may vary. What I need is the geographic coordinates. As I said above I can get the data and put it on the phone, perhaps on an SDCard. I could save space on the phone by setting up my own web server to make the data available to my application. But OurAirports is such a fantastic tool I want my users to be able to take advantage of it. Once they've used all the facilities of the web site to select their airport, I need only parse the page content for the data I want. This is how I do it:

/**
 * AirportInfoScreenCallback.java
 *
 * © Richard Buckley www.hrbuckley.net, 2011
 *
 * This work is licenced under the Creative
 * Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
 * To view a copy of this licence, visit
 * http://creativecommons.org/licenses/by-nc-sa/3.0/ or send a letter to
 * Creative Commons, 171 Second Street, Suite 300, San Francisco,
 * California 94105, USA.
 */

package flying.planning;


/**
 * The AirportInfoScreen callback
 */
public interface AirportInfoScreenCallback {
    public void airportSelected(String placename, double latitude, double longitude);
}

/**
 * AirportInfoScreen.java
 *
 * © Richard Buckley www.hrbuckley.net, 2011
 *
 * This work is licenced under the Creative
 * Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
 * To view a copy of this licence, visit
 * http://creativecommons.org/licenses/by-nc-sa/3.0/ or send a letter to
 * Creative Commons, 171 Second Street, Suite 300, San Francisco,
 * California 94105, USA.
 */

package flying.planning;

import java.io.IOException;
import net.rim.device.api.ui.container.MainScreen;
import net.rim.device.api.browser.field2.BrowserField;
import net.rim.device.api.browser.field2.BrowserFieldListener;
import net.rim.device.api.browser.field.ContentReadEvent;
import net.rim.device.api.script.ScriptEngine;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;


/**
 * Display a web page from ourairports.mobi
 * 
 * Parse the page for geographic location data for an airport. Return this data
 * if the user selects the airport.
 */
public class AirportInfoScreen extends MainScreen {
    private BrowserField    _browser;
    private double          _lat, _lon;
    private String          _placename;
    private boolean         _validAirport;
    
    private AirportInfoScreenCallback   _callback;

    /**
     * A BrowserFieldListener class to react to events as the page loads.
     */
    private class BrowserListener extends BrowserFieldListener {
        public void documentAborted(BrowserField field, Document document) {
            System.out.println("documentAborted");
            AirportInfoScreen.this._validAirport = false;
        }
        
        public void documentCreated(BrowserField field, ScriptEngine engine, Document document) {
            System.out.println("documentCreated");
            AirportInfoScreen.this.setTitle("Our Airports");
            AirportInfoScreen.this._validAirport = false;
        }
        
        public void documentError(BrowserField field, Document document) {
            System.out.println("documentError");
            AirportInfoScreen.this._validAirport = false;
        }
        
        /**
         * The document is loaded so this is the time to fetch the data
         * that I want.
         * 
         * Lucky for me ouraiports.mobi puts all the information I need
         * in HTML META tags in the header.
         * 
         * @param field The BrowserField
         * @param document A w3c.dom.Document
         */
        public void documentLoaded(BrowserField field, Document document) {
            System.out.println("documentLoaded");
            try {
                // Get a list of all HTML META tags, then go through them
                // looking for data
                NodeList list = document.getElementsByTagName("META");
                for (int i = 0; i < list.getLength(); i++) {
                    Node node = list.item(i);
                    
                    // Get the attributes
                    NamedNodeMap map = node.getAttributes();
                    for (int j = 0; j < map.getLength(); j++) {
                        Node name = map.getNamedItem("name");
                        
                        // I'm looking for META tags named 'geo.position'
                        if (name != null &&
                            name.getNodeValue().equalsIgnoreCase("geo.position")) {
                            
                            // the contents have latitude; longitude
                            String loc = map.getNamedItem("content").getNodeValue();
                            int idx = loc.indexOf(';');
                            AirportInfoScreen.this._lat =
                                Double.parseDouble(loc.substring(0, idx-1));
                            AirportInfoScreen.this._lon =
                                Double.parseDouble(loc.substring(idx+1));
                            AirportInfoScreen.this._validAirport = true;
                            
                            // and 'geo.placename'
                        } else if (name != null &&
                            name.getNodeValue().equalsIgnoreCase("geo.placename")) {
                            
                            // the contents have the associated place name.
                            AirportInfoScreen.this._placename =
                                map.getNamedItem("content").getNodeValue();
                            AirportInfoScreen.this.setTitle(AirportInfoScreen.this._placename);
                        }
                    }
                }
            } catch (Exception e) {
                System.out.println(e.toString());
            }
        }
        
        public void documentUnloading(BrowserField field, Document document) {
            System.out.println("documentUnloading");
            AirportInfoScreen.this._validAirport = false;
        }
        
        public void downloadProgress(BrowserField field, ContentReadEvent event) {
        }
    }

    /**
     * The default constructor will just open the main page.
     */
    public AirportInfoScreen()  { 
        setTitle("Our Airports");
        construct("http://ourairports.mobi/");
    }
    
    /**
     * An alternate constructor will open the site with an initial search.
     * @param query The search parameter
     */
    public AirportInfoScreen(String query) {
        setTitle(query);
        construct("http://ourairports.mobi/search.html?q=" + query);
    }
    
    /**
     * Actuall build the screen
     * @param url The url selected by the constructor
     */
    private void construct(String url) {
        _callback = null;
        
        // create new instance of the BrowserField
        _browser = new BrowserField();
        _browser.addListener(new BrowserListener());
        
        add(_browser);
        
        // request the content you wish to display
        // this method call is typically called once
        _browser.requestContent( url );
    }
    
    public void setCallback(AirportInfoScreenCallback callback) {
        _callback = callback;
    }
    
    public boolean onClose() {
        if (_validAirport) {
            this.setDirty(true);
        }
        
        return super.onClose();
    }
    
    public void save() throws IOException {
        if (_validAirport && _callback != null) {
            _callback.airportSelected(_placename, _lat, _lon);
        }
    }
}


And this is what it looks like:







No comments:

Post a Comment