Web Species blog

As you might know from our website we work with a lot of different technologies, tools and projects.
It only makes sense for us to share our experiences to help others build things faster and better.

Building the Edinburgh Festival API

Couple months ago we started working on a very exciting project - building data access API for world’s largest cultural event, the Edinburgh Festival. It was a very exciting journey and here I’m sharing how we built it and what’s the stack used. The goal of this is to show how we solved specific problems and how you might apply it for your applications.

The problem

We started from having a specification from Festivals Lab of what this API is supposed to be doing and that was a really trivial task - outputting data of 7 different sub-festivals in one format. And this is just about right what this project was about. We took the data from various sources, processed into formats we could understand, did some filtering, cleaning up and validation, and pushed that data through the API.

Of course this is just a read-only API, so that changed the design a lot. There was no need to connect to an actual database from the API server, but rather to a search server, which would be faster and more reliable for data lookups. And reliability was one of the requirements, because as with all APIs they should always work, in theory. Having a database and search server decoupled allowed increasing fault-tolerance and scalability.

What we tried to achieve is that even if all the servers would die, we can spin up a new one, run some scripts and API is back live. Thus data imports were supposed to be deterministic and fast. Once we got that, we were certain that even if things go horribly wrong we can recover very quickly and for smaller cases database’s and search server’s redundancy is protecting us.

The stack

Festival API

Processing

Interestingly this is the part where we spent most of the time, because chasing for data has proven to be very challenging. However once we got all data in sensible formats, we wrote a collection of self-contained scripts processing the data into various our formats.

Key lesson here was that Excel is used very commonly and there are problems with most of the Excel file parsers out there. Especially with Unicode characters, which worked fine in most cases, but sometimes would just fail to unrecognizable chars. Exporting to CSV first and then reading that from the scripts was how we solved this.

Database server

Obviously data coming from data sources needs to be stored somewhere. Overall the structure was very trivial, only consisting of events containing a list of performances. So MySQL could have worked here easily, however the problem was that data structure was different between different festivals and also it was constantly evolving during development process.

Hence we went with CouchDB because of its reliability and the fact that we can just store events as nested objects.

Search server

One of the key elements of this API is to be able to filter data efficiently. ElasticSearch was chosen because it integrates with CouchDB almost out of the box, so it was as hard as executing one query and we had a (almost) real-time representation of data stored in a database, searchable through the API.

ElasticSearch also allows to kick-start in 5 minutes, without a need to define document’s structure allowing to add those later to optimize the performance. It supports anything Lucene supports too, so it wasn’t like we were throwing out possible features.

API

As I explained in the previous post, we use Python for all APIs. This wasn’t an exception. However, the actual API is just couple hundred lines of code - very thin layer on top of the search server, which is using HTTP interface anyway so there was no need to use any libraries too.

Nginx was used as a proxy for API servers, each monitored by supervisor. Deployments are obviously made using Ant and any task is one-click operation here, so any updates can be rolled out in a matter of minutes.

The outcome

Well, it works. In our internal tests we tried to make it go down, but it sustained them all quite well. We can’t share performance characteristics of the application, but it’s pretty damn fast and the servers have no problems with the load.

There is logging for all requests coming in, storing information about which API user accessed what sort of information. This should be producing some interesting results we might be able to share later. The API is going to be used in some mobile apps too and we haven’t worked with those before, so we will learn how use cases are different from mobile applications compared to websites or desktop applications.

Festival only runs for less than two months, so this project is quite short-lived, but there are big plans for next year to open the data even more. If it was us to decide, we would make it fully public and do not require any license agreements, but this is not the case, so we are going to push hard for this for next year.

Conclusion

Clients seem to be very happy about the outcome and we are happy because we had a chance to work on a very challenging and important impact-wise project. I made all those decisions, of course after discussions with the clients, and they seem to work so far. Any tips for future projects?

RESTful web services with Python. The easy way

More and more projects are exposing their functionality via REST APIs. We think APIs are awesome and it’s great what they’ve done for the web overall, but we also see a lot of bad APIs examples, like Twitter API. It might be the case that if you don’t have the right tools, it becomes hard to implement them correctly and quick. Lately we have been working on a couple APIs and I decided to share our experiences and why we went with Python in the end.

What APIs should do?

Rest

As little as possible ideally. In most cases it’s just a layer on top of a database or search server, providing a RESTful way to access data and get it back in some fashion understandable by a client. The less code there is, the lighter it’s and the easier it’s to make changes the better. And with a rise of popularity to expose data and functionality using APIs it should behave following REST and HTTP standards, so it can be adopted in no time.

Depending on how you want to do it, you can also go full RESTful or just pretend that what you are doing is a REST API. Supporting Hypermedia for example is something you are supposed to do, in theory. But at least the API should handle HTTP Accept headers correctly, use native HTTP authentication like Basic Auth and have meaningful resources’ URLs. Just that will make it somewhat much better than most APIs out there.

Importantly, web services should be as fast as possible and support huge amounts of requests per second. In most cases APIs are called from other applications and the more time it takes for your API to respond the slower that application becomes. You should aim at no more than 10ms to fulfill the request. And if application developer decides to retrieve some resources in a loop, your API server shouldn’t crash either.

Python or not

Snake

Although we might seem as a PHP company, we are not really - we use PHP for websites, where it works best, and nothing else. The reason why we tend to go with Python is simply because it’s just perfect for what I described above. There are libraries for pretty much anything when it comes to reading and writing data from any storage and it’s super lightweight compared to a lot of other languages.

Don’t get me wrong - there is nothing wrong with any other languages; it’s just that Python worked really great for us. However if you for example want to stay with PHP, Frapi might be a good option for APIs. Although you can’t really achieve a lot of things as easily as with Python and the language is just much more concise. Performance is a questionable topic, but from our experience Python wins any day. “It scales” that is.

From functionality perspective decorators allow achieving a lot of things without destroying application flow with endless listeners and callbacks. When I need to provide authentication for the API, I just wrap the application with @auth or when data from some API call needs to be cached it just gets wrapped with @cache. Makes workflow really clear and doesn’t require nested if structures and duplicated logic. It’s used heavily in most of the Python web frameworks.

API in a Bottle

Bottle

Bottle is one-file web framework based on WSGI, thus it works just as any other Python framework. It’s not really made for APIs exactly, but it works great for them. API looks a lot like Sinatra - it just maps routes to actions (functions). What is more, I find it to be allowing very rapid developing - in most cases I can write whole API in less than a day.

Compared to other Python frameworks it doesn’t do anything that special, but where it shines is that a lot of the things can be either configured or swapped for different ones. It’s just a box of building blocks with some default behaviour, but from there you can really make it work in any way you want. If you need real-time web services allowing you to push data to clients, Tornado might be a better choice though.

Here is an example of the simple API, with first method returning plain string and second returning Python dictionary which will be automatically converted to a JSON string:

import bottle
from bottle import route, run

@route('/', method='GET')
def homepage():
    return 'Hello world!'
    
@route('/events/:id', method='GET')
def get_event(id):
    return dict(name = 'Event ' + str(id))
   
bottle.debug(True) 
run()
XML vs JSON

A lot of functionality can be tweaked using plugins, so rather than allowing Bottle to automatically convert data structures to JSON, you might want to use plugin like this to return data strictly by the type client accepts:

class FormatPlugin(object):
    name = 'format'

    def apply(self, callback, context):
        def wrapper(*a, **ka):
            # Check if return data format is supported
            accept = request.environ.get('HTTP_ACCEPT')
            if not accept in ['application/json', 'application/atom+xml']:
                return HTTPError(500, "Unsupported data format")
            
            # Execute the action    
            rv = callback(*a, **ka)
            
            # Write out results
            response.content_type = accept
            if accept == 'application/json':
                return json.dumps(rv)
            elif accept == 'application/atom+xml':
                return render_xml(rv)
            return rv
        return wrapper

When it comes to performance we haven’t maxed it out yet. Usually we tunnel multiple Bottle applications through Nginx pool using uWSGI as a web server, which we chose because of the benchmarks done by Nicholas Piël where you’ll see that uWSGI demonstrates amazing performance. For this article I did some benchmarks on a VirtualBox VM with one core and I was getting at least 2000 req/s from an API talking to Mysql and MongoDB servers to fetch different data.

Conclusion

If you haven’t tried Bottle before I suggest you do - installation is as easy as easy_install bottle and you can be up and running in just a few minutes. It worked well for us as it allowed creating web services quickly and customizing certain behaviours, but even with defaults it was suitable for most of the use cases. If you need help getting up and running, you can always get in touch with us.

Symfony2 - the best framework today?

Symfony2

I used to use Zend Framework extensively and still believe it’s the best framework for anything what doesn’t support PHP 5.3. However a couple months ago I started using Symfony2 for internal tools at Web Species and have stayed there since. It has its problems and flaws, but let me give you some thoughts why I think it’s the framework which is going to go big. Very big.

Frameworks are big creatures and naming interesting features can take thousands of words, so this is just a short glimpse of the few things I find interesting, to me. Obviously there is much more to it. Hopefully you won’t find yourself feeling like this guy:

I have nothing against Symfony2, I’ve been using it and it’s great. But this blog post is nothing but a gushing verbal sex exposition with Symfony as the subject.

What I like about it

First of all, once you grasp how it works, it starts to produce great results. I was really sceptical at first, because it seemed very complicated and over-engineered, however after few days of work I started liking how it works. And that’s the thing about Symfony2 - after some time it starts to feel great to work with it, it even makes working with PHP interesting again. Once you know how to configure it properly it starts to play along.

Not a big secret, but I’m quite a Doctrine user and the fact that Symfony2 integrates with Doctrine2 out-of-the-box just makes things easier. For example working with forms is way simpler now because it can create forms out of entities’ definitions and also populate data to entities. And integration for Doctrine CLI is obviously there, so the only thing you need to do is to specify connection properties and everything just works, in theory.

But that’s all code which is replicable to other frameworks; the most impressive bit is bundles. Bundles are small self-contained plugins allowing sharing some functionality between different projects. And GitHub is just flourishing with all sorts of different ones, showing how much people are interested in this framework and are actively using it. Using bundles makes Symfony2 core smaller, but also gives more flexibility with how you bootstrap a new project.

Business reasons

As much as I’m still involved in writing code, I need to make business decisions at the end about tools, frameworks and languages we use. And the tools I tend to go with are the tools I believe in can stay around for years to come and are, obviously, high quality and popular. Without any doubt Symfony2 is the fastest growing framework and quite quickly should become the most popular out of all.

There are a lot of not as well-known frameworks out there which are quite good and work somewhat well, but the reason I’m not using them is because they are not popular enough, which makes me question if they are actually that good. It’s hard to find developers with experience, it’s hard to find blog posts and discussions online and I’m unsure how long they are going to stay with their limited contributors list. Symfony2 makes me feel safe so far.

It’s hard to answer why, but for me Symfony2 looks the most professional and/or professionally-developed framework. Party because I know a lot of companies and people working on it personally and I know they can deliver, but also because code quality they produce is amazing. And with the amount of developers constantly contributing I can see it being actively developed and becoming even more solid.

The bad parts

A lot of the things in this framework still have to be proven to be the right decisions. I know a lot of people, who love Symfony2, but at the same time there is a big group who just hate it, and they have different reasons. Maybe because it has a very steep learning curve, at least now, but also because of some of the design ideas.

I’ve mentioned this briefly in my blog post about frameworks, which is the fact that I see a lot of Java patterns and ideas being used in PHP. This is a big topic and I might blog about it sometime, although I have nothing against Java (he he), but even things like Dependency injection containers just don’t feel right (when used in PHP).

DiC does work fine when you can load objects’ graph into memory and provide dependencies when needed, but nature of PHP is very different from Java and that graph needs to be built on each and every request. Of course the performance problem or impact, to be exact, can be solved in some ways, but the difference is still there. So as much as I like this idea, I’m not yet fully confident that it will work well.

This brings us to another problem with Symfony2 - the amount of configuration involved. When I was starting with Symfony2 it took me quite some time to figure out how to achieve even simple things. Maybe the framework went too far with removal of magic, right now it feels quite demanding and asking for everything to be configured explicitly. And once an exception is thrown somewhere deep in core, good luck figuring out why is it happening.

Conclusion

I think we made a right decision choosing this as a base for our web apps and are starting to receive more and more inquires about Symfony training courses and consulting we do, which just shows that popularity is growing and growing. It does have some issues and sometimes just feels clunky, but overall allows producing high quality projects and being sure about the results. I feel confident investing in Symfony2.

Lazy evaluation with PHP

Recently I needed to process a huge array of data and because of PHP’s somewhat inefficient variables and especially arrays that was resulting in “out of memory” errors. However, I couldn’t use any other tools than PHP so was forced to come up with a solution implementation in it. Here is how I solved it using principles from functional languages.

In programming language theory, lazy evaluation or call-by-need is an evaluation strategy which delays the evaluation of an expression until its value is actually required (non-strict evaluation) and also avoid repeated evaluations (sharing). The sharing can reduce the running time of certain functions by an exponential factor over other non-strict evaluation strategies, such as call-by-name.

What you are about to read is not necessary about gaining extra performance, most of the times it’s going to stay the same, goal of this is to minimize memory usage as much as possible. So if you have a million of rows to process each weighting 10kb, the memory usage always stays at around 10kb. Theoretically, PHP is not that perfect at cleaning up memory.

The problem

Let’s image that you need to process those million database records, how are you going to do it? Well, obviously the first step is to fetch them from a database. Stop! I can’t even remember how many times I’ve seen people making mistakes here… Why? Eager evaluation, lookup this term. Wait, I’m going to make sure you know what I’m talking about so read it here in Wikipedia.

All comes down to using fetchAll() method - the script was fetching all orders in a specified time range from a database. What fetchAll() does is returns an array with all results matching the query, but this requires quite some memory if amount of results is in thousands. Later the script was doing some calculations and creating a second results array now with processed data. At the end of the main loop memory usage was row count * size per row * 2, a very big number.

It’s very rarely beneficial to do it like this - most of the time data can be processed per row, removing the need to store everything in memory. Sometimes SQL queries won’t allow it, but even those can be changed to make it possible to abandon pre-computation and work with data as streams. Once you have a data stream, you can start working with it using pipelines.

Functional languages

Haskell logo

Before looking at PHP solutions for this problem, let’s analyse how it can be done in a functional language. Image a function returning all Fibonacci numbers, here is an implementation in Haskell (from Haskell documentation):

magic :: Int -> Int -> [Int]
magic 0 _ = []
magic m n = m: (magic n (m+n))

By calling it with magic 1 1 it will return a list [1,1,2,3,5,8,…]. But the important part is that this list or the return value is infinite. There is no boundary like “return 100 numbers”, it will actually return all possible numbers. You might say that’s impossible and you are kind-of right, especially if you haven’t worked with similar functional languages before.

Because Haskell uses lazy evaluation (with strict being an option), calling this method doesn’t actually compute anything. Instead it creates a generator-like resource from which you can read as many numbers as you want, and as long as you are reading them it’s computing next one. So with a function like:

getIt :: [Int] -> Int -> Int
getIt [] _ = 0
getIt (x:xs) 1 = x
getIt (x:xs) n = getIt xs (n-1)

We can get Xth number of Fibonacci - call it like getIt (magic 1 1) 5 and output should be 5, because the 5th number of the sequence is 5. Important part here is that even though I’m passing a result of function magic to the getIt function, as mentioned above, magic doesn’t need to compute anything to return. getIt reads 5 numbers from that infinite list and terminates returning the last number.

PHP way

Sadly, you can’t really do anything like that easily in PHP, because it doesn’t support lazy evaluation or generators. However it’s possible to improvise and have a working solution. And the solution is… Iterators. One of the most underused functionality in PHP.

Any class which extends an Iterator can be used in foreach as a data source. Let me give you a short example of an iterator generating all numbers from 0 to infinity:

Lazy
<?php
class Numbers implements Iterator
{
    private $position = 0;
    
    function rewind() {
        $this->position = 0;
    }
    function current() {
        return $this->position;
    }
    function key() {
        return $this->position;
    }
    function next() {
        ++$this->position;
    }
    function valid() {
        return true;
    }
}

$n = new Numbers;

foreach($n as $value) {
    print $value . PHP_EOL;
}

Running this script will yield a never-ending list of numbers. But you can only do 10 iterations and get 10th number all without ever needing to store whole sequence in memory. Of course this is going to be slower than just calculating 10, but using similar approach you can process data one atomic unit per cycle without ever storing it in memory.

If you need to render data in a view again remember the memory - do not compute the result in a model (or god forbid controller) and then pass it to a view. Creating an iterator in model and return it for consumption in a view. For view it is the same thing - array and iterator are both iteratable, but you are saving a lot of otherwise wasted memory.

Database side

Aggregation should always happen in database side, when possible. So if you need a total of all items sold for each order - calculate that using a SUM() function rather than doing it when iterating over results. Because to do it you need to look back or forward in results set, and that breaks lazy evaluation immediately.

First of all, fetch data with a normal fetch() method in a while loop, rather that iterating over fetchAll() result. Nothing will break, hopefully, but instead of building array of all results and then processing them, the script will process each row and release it from memory. MySQL returns a cursor to results’ set and PHP driver will use that to get each row after row.

You might wonder how to render pages like a list of orders with items. It’s quite easy, just join orders’ and items’ tables and order results by order id. The data you will be getting back will look something like this

Order ID   Item ID   Item name
1          156       Milk
1          897       Bread
2          156       Milk

To render this just iterate over results and as soon as order id changes from previous one, start a new HTML table for an order. Using this approach you can render whole history of all orders with information about items too without ever using more memory that one row size. The only limiting factor is then response HTML size.

Conclusion

Obviously PHP wasn’t really created for anything like this, but situations when you need to make it work happen. Make sure you are not building up any data or fetching it eagerly and release resources as soon as you are done. This solved all the problems for me and all of the sudden same script was processing all the data without any memory leaks.

HTML5 History API - dynamic websites like never before

I have talked about this before, but JavaScript should not dictate content or website structure. It should only improve the UI, but even with JavaScript disabled website should work. Using the new HTML5 History API allows to do that one step further - making dynamic websites behave like normal ones.

What is History API?

History API is quite a simple concept - a JavaScript API you can use to control history state. If user clicks on an image and you show a lightbox with enlarged version, clicking Back sends user back to previous page, rather than closing lightbox popup. It does this, because there is no state information browser can use to know how to close that popup window.

With using History API you can add an entry to history stack once some dynamic content is loaded. When user clicks Back browser would go back by one element in history stack and fire off an event which then you can handle by closing the popup window. Same thing happens when user clicks Forward - event happens and it’s up to your script to handle this gracefully.

All this behavior is very well explained in “Dive into HTML5” book, but in short when you load some content using Ajax or you move user to a place in a page which you want to have linkable - use History API. Of course this involves handling the links in server side too, but this is quite trivial - build the website first then make use of History API to improve the interface. In my tests this improved user experienced a lot and allowed to achieve very rich user interfaces without destroying website structure.

Of course like with most of the new HTML5 functionality, this is not supported in all browsers. Most importantly this is not supported at all in any IE versions, not even IE9 which was released only couple months ago. But you can handle this by having improved UI for some users and falling back to normal non-js behavior for IE, for example this is what Github does (click on folders).

How to use it?

HTML5

There is only one main method:

history.pushState(state, title, link);

and one event:

window.addEventListener("popstate", function(e) {
    alert("location: " + document.location + ", state: " + JSON.stringify(event.state));
}

State example from Mozzila documentation:

history.pushState({page: 1}, "title 1", "?page=1");
history.pushState({page: 2}, "title 2", "?page=2");
history.replaceState({page: 3}, "title 3", "?page=3");
history.back(); // alerts "location: http://example.com/example.html?page=1, state: {"page":1}"
history.back(); // alerts "location: http://example.com/example.html, state: null
history.go(2);  // alerts "location: http://example.com/example.html?page=3, state: {"page":3}

However I would recommend using some abstraction for this, mainly because you need to do quite some manual work, which I hate doing. A great tool exists called History.js which does exactly that - abstracts History API and also has fall-back support for lame browsers like IE. I’ve used it extensively and it works great and it even has adapters for jQuery et al so you can use the same interface.

Most impressive part of it is optional fall-back for older browsers. What it does is uses a similar to hashbang url, like http://webspecies.co.uk/#/about which it handles inside and all methods for getting current url still return /about. And if you go to this url from a modern browser it detects the support for proper History API and redirects to correct url. All combined, makes everything work nicely and future-proof, here is an example of full script for Ajax page.

How I used it

When building our Web Species website designer had an idea to have all content in one scrollable page. If you go to the homepage and start scrolling you’d notice that immediately - menu stays in place but content slides scroll as usual. Now even though this is very nice from user perspective, it creates a problem from content side. Very big problem.

Main problem is that there is no way to link people to specific slide and that it’s hard to get good rankings on Google. Now even though the first problem can be easily fixed by using anchor urls (like http://webspecies.co.uk/#about), but fixing SEO is, I believe, impossible. Impossible because one page contains a lot of content and you won’t get good rankings for any keywords.

If you look at view source of http://webspecies.co.uk/about you’d see correct title and only “about” content, but as soon as you load this in browser AJAX loads all slides and replaces existing slide with them. So from user perspective there still exists one-page effect, but actual pages can be called directly (in server side I have all separate pages like about, training, clients etc. and one with all slides).

To fix linking I employed History API and it worked out beautifully. If you click a menu item on left side of the website, AJAX event fires off resulting in two actions - pushState() and scroll to correct slide. Same thing happens on Back or forward history actions - content is scrolled to where it belongs. If you go to news section clicking on news items result in same behaviour too.

Conclusion

As with all HTML5 functionally History API is still quite rarely used, but if you feel like building something awesome - I’d say give it a go. It works beautifully in modern browsers and there are tools which can make it happen for older browsers too. From my point of view, it allows to achieve crazy UI ideas and still have semantically correct websites, which is always my goal.

Fork me on GitHub