coded – simply

Code just like your momma used to make.

So I’m Web Scraping! So What!

My buddy is a life insurance salesman and has a great little lead in to his pitch. It goes a little something like this:

“You may not NEED life insurance, sir, but you may WANT it… and here’s why”

I feel a bit like an insurance salesman myself right now, because I’m going to try and sell you on some web scraping. You may not NEED to do it, but you may want to.

Don’t get me wrong, web scraping is a dirty, nasty little business – there are pitfalls abound because every website is implemented using a different stack and sometimes(ok often) they are implemented poorly. To compound the difficulty the tools available to access websites programmatically seem to be about 2-4 years behind the browser technologies that are being leveraged by web developers.

One tool that has made web scraping really viable in the modern browser age is HttpUnit. Although it is intended as a testing framework, it’s endlessly useful for scripting web page actions or accessing items behind login screens. Below is a quick sample I cooked up in about 20 minutes to show me all of the music I’ve downloaded from my favorite russian MP3 site. Code is available here


public class AuthUtilsTest extends TestCase
{
	static Logger log = Logger.getLogger(AuthUtilsTest.class);

	static String baseURL = "http://www.mp3fund.com/";

	public void testIt() throws Exception 
	{
		Properties p = new Properties();
		p.load(new FileInputStream("src/main/resources/login.properties"));
		String password = (String)p.get("password");

		WebConversation wc = new WebConversation();
		WebRequest wreq = new GetMethodWebRequest(baseURL + "/login.html");
		WebResponse wres = wc.getResource(wreq);

		WebForm wf = WebUtils.getFormByAction(wres, "/login.html");
		wf.setParameter("login", "email_address");
		wf.setParameter("password", password);

		wres = wf.submit();

		for(int i=1; i <=136; i++){
			wreq = new GetMethodWebRequest(baseURL + "downloads.html?p="+i);
			wres = wc.getResource(wreq);

			WebLink[] links = wres.getLinks();
			log.info("processing page: " + i);
			for(WebLink link : links){
				if(!"Download".equals(link.getText()))
					continue;
				String url = link.getURLString();
				String clean = url.replaceAll(".*\\/", "");
				log.info(link.getText() + ": " + clean);
			}

		}


	}

}


public class WebUtils 
{

	public static WebForm getFormByAction(WebResponse wc, String actionName) throws Exception 
	{
		for(WebForm form : wc.getForms()){
			if(actionName.equals(form.getAction()))
				return form;
		}
		return null;
	}

	public static void writeResp(WebResponse wr, String name) throws Exception
	{
		FileOutputStream fos = new FileOutputStream(new File(name));
		IOUtils.copy(new StringReader(wr.getText()), fos);
		fos.close();
	}

}

Simple Job Chaining

One of my most productive days was throwing away 1000 lines of code.

Ken Thompson

I have always felt that the best code I’ve ever written is code that never made it into production. I’m talking primarily about exploratory code – code that helps you to understand a framework or the problem domain which you will be working in. This is the code that helps you gather enough information to take you to the next logical step, which of course is deleting it and starting over. I think that this same mantra sometimes applies to frameworks as well. This occurred recently when I decided to toss Quartz – a job scheduling framework – in the trashcan in favor of my own solution.

I won’t go into too much specifics about Quartz but the short and sweet version is that Quartz was simply not the right tool for the job at hand. It simply took me a while to understand that. Quartz is intended to do job scheduling in response to system events or at precise intervals. In essence it is a lot like having cron at your disposal in Java.

I don’t need a job scheduler. I need to poll web services with differing response times and update a user interface with the latest results. Scheduling polling jobs at tight, fixed intervals causes problems when one job runs long and into the next scheduled interval. When that happens things get messy and the UI becomes inconsistent with reality. What I needed to do was job chaining. Running one job until completion and then running it again until the chain is stopped. Here is my shot at a simple job chaining mechanism.

First, a Job interface:


public interface ChainedJob
{
	public void execute() throws Exception;
}

and something to actually run the job and manage the thread it runs on.

package com.codedsimply.jobchain;

import org.apache.log4j.Logger;

public class ChainedJobExecutor
{
	private ChainedJob job;
	private Thread jobThread;
	volatile boolean running=false;
	static Logger log = Logger.getLogger(ChainedJobExecutor.class);

	public ChainedJobExecutor(ChainedJob job) {
		this.job = job;
	}

	public void startJob(){
		log.info("starting job of type: " + job.getClass().getCanonicalName());
		running=true;
		jobThread = new Thread(new JobRunner(job, this));
		jobThread.start();
	}

	public void stopJob() throws Exception {
		running=false;
		jobThread.join();
		log.info("stopped job: " + job.getClass().getCanonicalName());
	}

	public boolean isRunning() {
		return running;
	}

	public void setRunning(boolean running) {
		this.running = running;
	}

}

class JobRunner implements Runnable{
	ChainedJob job;
	ChainedJobExecutor executor;
	static Logger log = Logger.getLogger(JobRunner.class);

	public JobRunner(ChainedJob job, ChainedJobExecutor executor) {
		super();
		this.job = job;
		this.executor = executor;
	}

	public void run() {
		while(executor.isRunning()){
			try {
				job.execute();
			} catch (Exception e) {
				log.error("error running job: ", e);
			}
		}

	}

}

and finally a job manager to keep track of all my chained jobs and manage starting and stopping them all:


package com.codedsimply.jobchain;

import java.util.HashMap;

import org.apache.log4j.Logger;

public class ChainedJobManager
{
	static Logger log = Logger.getLogger(ChainedJobManager.class);
	private HashMap jobMap = new HashMap();

	public void addJob(ChainedJob job){
		jobMap.put(job, new ChainedJobExecutor(job));
	}

	public void startJobs(){
		for(ChainedJob nxt : jobMap.keySet()){
			ChainedJobExecutor exec = jobMap.get(nxt);
			exec.startJob();
		}
	}

	public void stopJobs(){
		for(ChainedJob nxt : jobMap.keySet()){
			ChainedJobExecutor exec = jobMap.get(nxt);
			try {
				exec.stopJob();
			} catch (Exception e) {
				log.error("error stopping job: ", e);
			}
		}
	}

}

All this code and a unit test to run it are available here, if you care to use it.

Hello World!

This is a blog about software development. Its primary purpose is to provide me with a landing place for software that could potentially be re-used – in the rare cases that I write reuse-able software. I will try and let the software do most of the talking and only give a basic overview of whatever was tossing around in my head when the software was written. That being said, enjoy… or don’t. Doesn’t matter much to me :-)

Follow

Get every new post delivered to your Inbox.