Declarative Web Crawling and Data Extraction

A few little projects I've been working on had involved ripping information from sites with multiple pages into a database or JSON file. I've done this kind of stuff before and it usually ended up in the same kind of code, something that iterates over a queue of URLs, pulls the data down and parses it into a data structure and sends it off to where it needs to go.

I made a little script for managing all of the logic common to scraping information from websites, intended to be reused over different projects whenever I needed it. In the process of doing that I got to learn some new stuff like publishing an npm package to npmjs and doing automated testing on git commit with TravisCI.

So the package I've made for this is httprip. The module exposes a factory method for creating a ripper instance. The instance allows you to attach implementations for "processing" and "yielding". Processor implementations allow you to define the parsing and extraction of html into the elements of data you want. These elements are to be "yielded" to the data collectors of the ripper instance. Multiple processors and data collectors can be specified. Check out the most basic example of usage below.

var httprip = require("httprip");

var ripper = httprip()
    .processor(function(error, res, body, resolve) {
        // Perform parsing on body here.

        // Yield each item parse from body.
        ripper.yield("item" + Math.floor(Math.random() * 999));
        ripper.yield("item" + Math.floor(Math.random() * 999));

        // Resolve after we've finished processing.
        resolve();
    })
    .data(function(output) {
        console.log("Retrieved item:", output);
    });


// Queue requests.
ripper.enqueueRip({url: "http://google.com"});
ripper.enqueueRip({url: "http://yahoo.com"});
ripper.enqueueRip({url: "http://bing.com"});

// Wait for finish.
ripper.lastQueued().then(function() {
    console.log("done");
})

So as you can see, this queues up three URLs to be ripped. The processor function disregards the output however and yields 2 random items for each request. The data collector method simply console logs the items. Also, a method is attached to the Promise of the last queued element so that some logic can be performed after all queued entries have been completed. There is a more detailed documentation over at the readme for the httprip project.

Creating NPM packages is easy

I was kind of expecting this whole process to be a lot more involved and complicated than it actually was. but it was seriously as easy as this:

Create an account at npmjs.copm
Open up a command prompt and navigate to your code's folder
Run npm init and follow the prompts
Run npm login and enter your username and password
Run npm publish

I mean really, that was it. Every time you want to update your package, just increment your version number in package.json and run npm publish again. NPM publish will automatically ignore anything in .gitignore so don't worry about all the bloat you might have in your package, it wont get pushed to npm.

Automated testing with TravisCI is easy too

I mean I only really did this to get that little "passing" badge on the github page, but it was pretty interesting to see how easy Travis is to set up. By connecting travis to a github account, it will already see all available projects. Setting it up for NodeJS is as simple as committing in a .tavis.yml file and activating the repository in Travis.

The .travis.yml file is just a config YAML file, which is how you let Travis know that you are testing a NodeJS project as well as other things like what Node version to use for testing.

language: node_js
node_js:
  - "6"

Not a hell of a lot. By default, Travis will run npm test, meaning your package.json file needs to have a test script. httprip just uses mocha, nothing special. Once Travis detects a push to github, it will grab the latest code and run the tests. Travis gives you a cool image link to get that "build passing" badge, provided your builds are passing.