Today's Web has terabytes of information available to humans, but hidden
from computers. It is a paradox that information is stuck inside HTML pages, formatted in
esoteric ways that are difficult for machines to process. The so called Web 3.0, which is
likely to be a pre-cursor of the
real semantic web, is going to change this. What
we mean by 'Web 3.0' is that major web sites are going to be transformed into web
services - and will effectively expose their information to the world.
The transformation will happen in one of two ways. Some web sites will follow the
example of Amazon, del.icio.us and Flickr and will offer their information via a REST
API. Others will try to keep their information proprietary, but it will be opened via
mashups created using services like Dapper, Teqlo and Yahoo! Pipes.
The net effect will be that unstructured information will give way to structured
information - paving the road to more intelligent computing. In this post we will
look at how this important transformation is taking place already and how it is likely to
evolve.
The Amazon E-Commerce API - open access to Amazon's catalog
We have look
about Amazon's visionary WebOS strategy. The Seattle web giant is reinventing itself by
exposing its own infrastructure via a set of elegant APIs. One of the first web services
opened up by Amazon was the
E-Commerce service
. This service opens access to the majority of items in Amazon's
product catalog. The API is quite rich, allowing manipulation of users, wish lists and
shopping carts. However its essence is the ability to lookup Amazon's products.
Why has Amazon offered this service completely free? Because most applications built
on top of this service drive traffic back to Amazon (each item returned by the service
contains the Amazon URL). In other words, with the E-Commerce service Amazon enabled
others to build ways to access Amazon's inventory. As a result many companies have come
up with creative ways of leveraging Amazon's information.
The rise of the API culture
The web 2.0 poster child, del.icio.us, is also famous as one
of the first companies to open a subset of its web site functionality via an API
. Many services followed, giving rise to a
true API culture. John Musser over at programmableweb
has been tirelessly cataloging
APIs and Mashups that use them. This page
shows almost 400 APIs
organized by category, which is an impressive number. However, only a fraction of those
APIs are opening up information - most focus on manipulating the service itself.
This is an important distinction to understand in the context of this article.
The del.icio.us API offering today is different from Amazon's one, because it does
not open the del.icio.us database to the world. What it does do is allow
authorized mashups to manipulate the user information stored in del.icio.us. For example,
an application may add a post, or update a tag, programmatically. However, there is no
way to ask del.icio.us, via API, what URLs have been posted to it or what has been tagged
with the tag web 2.0 across the entire del.icio.us database. These questions are
easy to answer via the web site, but not via current API.
Standardized URLs - the API without an API
Despite the fact that there is no direct API (into the database), many companies have
managed to leverage the information stored in del.icio.us. Here are some
examples...
How Web Scraping Works
Web Scraping is essentially reverse engineering of HTML pages. It can also be thought
of as parsing out chunks of information from a page. Web pages are coded in HTML, which
uses a tree-like structure to represent the information. The actual data is mingled with
layout and rendering information and is not readily available to a computer. Scrapers are
the programs that "know" how to get the data back from a given HTML page. They work by
learning the details of the particular markup and figuring out where the actual data is.
For example, in the illustration below the scraper extracts URLs from the del.icio.us
page. By applying such a scraper, it is possible to discover what URLs are tagged with
any given tag.
This sounds great, but is this legal?
Scraping technologies are actually fairly questionable. In a way, they can be
perceived as stealing the information owned by a web site. The whole issue is complicated
because it is unclear where copy/paste ends and scraping begins. It is okay for people to
copy and save the information from web pages, but it might not be legal to have software
do this automatically. But scraping of the page and then offering a service that
leverages the information without crediting the original source, is unlikely to be
legal.
But it does not seem that scraping is going to stop. Just like legal issues with
Napster did not stop people from writing peer-to-peer sharing software, or the more
recent YouTube
lawsuit is not likely to stop people from posting copyrighted videos. Information
that seems to be free is perceived as being free.
The opportunities that will come after the web has been turned into a database are
just too exciting to pass up. So if conversion is going to take place anyway, would it
not be better to rethink how to do this in a consistent way?
Why Web Sites should offer Web Services
There are several good reasons why Web Sites (online retailers in particular), should
think about offering an API. The most important reason is control. Having an API will
make scrapers unnecessary, but it will also allow tracking of who is using the data - as
well as how and why. Like Amazon, sites can do this in a way that fosters affiliates and
drives the traffic back to their sites.
The old perception is that closed data is a competitive advantage. The new reality is
that open data is a competitive advantage. The likely solution then is to stop
worrying about protecting information and instead start charging for it, by offering an
API. Having a small fee per API call (think Amazon Web Services) is likely to be
acceptable, since the cost for any given subscriber of the service is not going to be
high. But there is a big opportunity to make money on volume. This is what Amazon is
betting on with their Web
Services strategy and it is probably a good bet.