API

Most folks that are working with Mashups just assume that services and APIs will magically appear but unfourtunatly there are not that many public APIs around today. Just check out programmableweb and you will see. More and more are added every day, but it will bever reach the level that a majority of systems have an API, especially not if you think about systems within the coporate firewalls. Simply put there is a painful lack of APIs, and if that is not addressed it will stop the mashup wave in it’s tracks. Fortunatly there are already smart people working at this, and one of the solutions is to start using HTML as it is an API. That’s right, start using all the data and functionality that today is available in HTML to build new innovative mashups and solutions.

The potential of HTML

All new interesting applications (Skype being the exception that proves the rule) has an HTML interface. And this is true not just for the consumer facing applications, but for Enterprise level applications as well. So with the millions and millions of HTML pages in existance today it is not unlikely that HTML is one of the worlds most common data formats (I wonder how it compares to printed text and audio for example). The great thing with HTML is that it does not just contain data, it also is the interface to a whole lot of functionality (when you search Google you do that via HTML don’t you?). What if we could use HTML as one big API? That would make HTML the worlds most widespread API and that would give mashup developers and programmers access to more data and more functionality than ever before.

The problem with HTML

Almost not sites on the web today are following the HTML 4 standards. So todays browsers are very good at interpreting the tag soup that most pages consists of (ie broken HTML, missing end tags etc). Furthermore HTML is used to both mark up data in a document, for example with the <title> tag, and to mark up how the data should be presented, for example the <b> tag. All this together makes HTML documents unstructured documents (by implementation, not by nature) with data in very application specific formats (microformats will help here, but there will be some time before that is widespread enough to be really usefull).

Another problem is of course that there is fewer and fewer pages on the web that uses pure (albeit broken) HTML, there are more and more Javascript around. Especially in the Web 2.0 applications most of the really interesting functionality is available via AJAX. So it is not only HTML, but also Javascript that has to be taken into consideration when one wants to get to a web applications functionality.

Parsing

So we have huge amounts of data and functionality in HTML and we want to use it to make our latest funky Mashup. The good old approach is to try to parse the page in question using Perl, now it can be done pretty well using almost any modern programming language. There are several problems with parsing though:

It is damn complicated to get to work on serious web pages and once it is done it breaks easily
Good luck handling a real tag soup, already that breaks most parsers (since using XML parsers for this means that the parser simply stops at the first error it encounters)
It is boring to program those parsers (if you havent tried then lucky you)
Can not access functionality that uses javascript and AJAX
It is hard to handle things like login into a web application (ie session handling) and to navigate over several pages

Still this is a very usual approach to get to data and functionality in HTML. But there is a much easier way…

Web Scraping

I bet that a fair portion of the people reading the word “Web Scraping” think of old mainframe terminals and “Screen Scraping” and frown. Don’t worry, technology has moved forward lightyears since the days of mainframes. Web Scraping is to interact with HTML (including Javascript if it is a good scraper) and to either extract data from the HTML or repackage the functionality in the HTML. The data can be saved into a database or a file for example, and the functionality can be made available as a REST service, a programming language API or whatever else makes sense. Suddenly HTML is wide open. Just imagine that you wanted to get data from Digg (before the Digg API that came out a few weeks ago) for some reason, without an API that would be hard. But using a web scraper you could for example build a REST service out of the search on Digg only by accessing the HTML. Web Scrapers are used more and more for doing things like collecting large amounts of job ads or flight information and then repackage that data into sites that then allow users to search for a job or a cheap flight.

Openkapow

The web scraper of my choice is the one supplied on openkapow.com (disclaimer: I am working for Kapow Technologies, the company behind openkapow.com, but trust me in that I am not plugging openkapow to make my boss happy – it is really a great product). Using openkapow one can access data and functionality on any web page and access it as a REST service or and RSS/Atom feed. Of course JavaScript is handled automatically, it is possible to navigate multiple pages, login to restricted pages and have full control over the process flow with conditions and error handling. I recommend that anybody that is interested in how to use HTML as an API takes a look at openkapow.

An Eye Opener…

Thinking of HTML as an API does significantly expand your horizons as a developer. I have literaly seen a light go on in fellow geeks eyes when they realize the potential. Suddenly the web is really yours to use in your programs and mashups. When suddenly APIs and services are abundant then you can start using the other cool mashup tools around (Teqlo, jMaki etc).

Digitalistic

Mashup or die trying

HTML is the worlds most common API