Web Species blog

As you might know from our website we work with a lot of different technologies, tools and projects.
It only makes sense for us to share our experiences to help others build things faster and better.

One month later... Azure+ is dead

Fail

What a ride this was… Just a bit more than a month ago I posted an article on the project we were secretly working on - Azure+, the PHP cloud platform built on top of Windows Azure infrastructure. From then functionality was improved, amazing additions were planned, I travelled tens of thousands of miles and showed it to hundreds of people. And now I can announce that it’s dead. Here is what happened.

If you want to know the “Why?” part you can skip the first few sections all together and just go straight to the sad part. However, this case is a bit complicated… Well ok, it’s very complicated. And I can’t talk about the main reason at all, so the only thing you will find is an “abstraction” of what happened. Something that I have no control of.

The launch

Launch

The blog post about the project has to be the most popular post of this blog, it received well over 10’000 unique visitors in a day. I think this is a quite big number, which was mainly fueled by social networks as everyone wanted to see what can be done with Windows and PHP. That’s why it had the “Wait… what?” at the end of the title - I assumed a lot of people would question the value of this kind of project.

And oh my god there was a lot of people who did. Microsoft guys were internet-high-five’ing us all the way, but the majority of everyone else was either still questioning the point of this or plainly calling it stupid. They had their reasons though - Windows and PHP are not really a welcomed topic, mainly because of the prejudice because it works just fine, even if PaaS shouldn’t “expose” the OS it is running on (as much as that’s possible). Also similar projects like Orchestra.io or PhpFog had taken off.

Ignoring the arguments online of whether it was a good project, the prototype that we launched worked flawlessly. One of the biggest demos of it was me doing a demo in my PHP in the Cloud keynote at the PHP Barcelona conference. Deploying an app on the stage with 500 people watching it… crazy, but it worked. Not once did it failed, even after showing it hundreds of times to all sort of different people I had a chance to show it to.

What people are looking for?

You might not realize this, but PHP ecosystem is a bit different from any other language (as they are different from each other). PHP developers usually start their careers by just hacking on some code locally, because it only takes minutes to setup a PHP environment. And if something doesn’t work - fix it and refresh. No need to recompile or deal with DLL hells. This obviously brings some disadvantages, but this post is not about how good or bad PHP is.

It's easy

Because of this upbringing, the tooling for PHP developers should follow this idea. And that’s what we were trying to achieve and push the industry to do the same. One of the key elements of the platform was that you could always push code directly. As awesome as pulling from Git sounds, you can do way more than that by just executing one terminal command. And if it doesn’t work - push again and refresh.

A lot of Microsoft folks asked us about all sorts of different enterprise behaviours and how we are planning to support them. We are not, because PHP (and even non PHP) projects do not need them. Need a custom PHP setup - build your own servers, you are obviously qualified enough. Same applies to all sorts of custom or niche functionality and technologies. But most PHP projects do not need that. And that’s why we didn’t overcomplicated our solution, we solved the problem for majority of folks and for anyone else who needs some flexibility which we don’t support - platform as a service is not right for you.

Future architecture

Azure+ wasn’t about being an abstraction on top of Windows Azure, even if initially it looked like one. The reason why apps would take 15 minutes to create was because we were spinning up new instances for every new app, this takes some time. Nonetheless the goal of Azure+ was to deploy apps to a “shared server” similarly to how Heroku does it (in quotes because it was not trying to be a shared hosting platform.)

Fast!

Imagine that each server has N number of slots; each slot can host an app in a fully isolated and secured environment. If it happens that your app needs more resources than the slot can provide (thus not killing other apps), you get more slots - on the same server or other servers. Apps can scale beautifully, any extra modifications are not required and price of the service can match any other PaaS and beat it. If that’s too crazy for you - you would still be able to choose dedicated boxes.

Once we would’ve had that and MySQL setup done, I’d say it would have been ready for the first public beta. We weren’t that far actually - it would only have required some hardening of the web server and PHP, a bunch of reverse proxies and a DNS server. This setup works beautifully because applications can be scaled very dynamically by us and there is a cluster supporting the infrastructure so it’s highly reliable. But then some things happened…

Good bye

I was put in a position to kill it. Not forced, but continuing it anymore would make no sense. That’s how much I’m going to say - I was advised to keep my mouth shut by some clever people in suits. Of course I’m sad to let it go after so much effort and time we have put into this, but sometimes ideas just don’t work out.

P.S.
Name Azure+ was never an issue; it was chosen as a codename and would have been changed as the project progresses.

We built a cloud platform for PHP. Wait… what?

We built a cloud platform for PHP. Yep, you heard it correctly. We see a huge opportunity in the market and are willing to work hard to make deploying PHP projects very easy. However this is a different one and here is the story behind it and what it can do for you.

We call it Azure+. Similarly to Notepad++ relation to Notepad, Azure+ is Azure done right and usable. This is a code name though, which might change once this goes to production. As will the design, which currently works as a good basis and is based on the great Twitter Bootstrap framework.

Why Azure?

Azure

Current workflow with Azure, original from XKCD

There is nothing specific about Azure that we wanted to leverage, but because so many existing PaaS providers are built on Amazon cloud it just made sense to try something else. Furthermore, I have a lot of experience with Windows and PHP so it all felt like a good plan. I think we are awesome enough to make Azure rock for PHP, because…

Azure is just impossible to use for PHP today. This is a fact. Doesn’t matter which way you look at it, it just su.. isn’t particularly good. The amount of steps you need to make, the knowledge you need to have and the fact that you can only deploy from Windows host are some of the things which make it a very painful experience. I had enough of this pain.

What is most important, I find Microsoft’s approach and tooling lacking in so many areas, that the only way I knew how to fix this was to build a service on top, rather than release Azure+ as a product or open source project. There was and still is no way I can change the 15-20 min. deploy time (try debugging a non-working app having to wait half hour before every retry), so we built something which overcomes it.

Oh God no, Windows?!

Oh God no

It’s not a big surprise that Azure is running on top of Windows, it’s a Microsoft cloud at the end of a day. I know a lot of PHP developers feel very negative about Microsoft and Windows specifically. Well, Internet Explorer 6 specifically, but Windows is not better either. But that is something what you would care if this was an infrastructure service.

Azure+ is Platform as a Service or PaaS in short. What that means is that you deploy apps to a cloud black box and the infrastructure it is running is completely irrelevant to you. There is more work to be completed to making it truly PaaS, but our goal is to make deploying to this service completely headache-free and to just make everything work*.

Important fact to note, this is not developed under any collaboration or affiliation with Microsoft and thus it’s our own decisions on where we’ll take it from here. I think PHP support on Windows is as good as on any other OS and all the PHP apps I tried (Zend Framework, Symfony2, Lithium) worked pretty much out of the box.

Features

Toys

First of all, PHP developers start by writing PHP code, because to start learning PHP you only need a Apache installed and that’s it. Hack on some code, click refresh and you see the result. That’s what PHP is. That’s why at least 15 minutes of wait is just something PHP developer wouldn’t want to do. We made it faster. How about 5 sec. or less deployment time?

Furthermore, in core we have mechanisms which allows us to support and change PHP configuration and version in the same short time. So you can try different PHP versions in a matter of one mouse click or switch off display_errors when your app is ready to live. Currently you can only choose from two PHP versions and error reporting mode, but there is more to come.

Speed of deployments and configuration freedom is a good building base to start with. But there is more baked in, like an API which allows pushing code directly and a service which will pull from a specified Git repository automatically. Right now we are working on adding MySQL support, so you can port pretty much any existing app. It’s a great core platform which allows adding new functionality very very easily.

Reception

Azure+ is good

It was an unbelievable journey so far and we learned insane amount of things about Azure itself and how to make PHP deployments blazing fast. Some things required hours to tackle, but in the end we made sure that our users are never going to have to deal with them. And believe me, there are a lot of things you can shoot yourself with when working with Windows.

This is a project which needs feedback and especially from people who know PHP, cloud stack etc. really well. I was running demos and giving access to some people I know and, I think, they were really impressed with the stack. Also because it relies heavily on Microsoft stack, I had spent past two weeks demoing it to a selected group of Microsoft friends and so far reception was amazing. To quote one:

I think you could single highhandedly revolutionize Azure

I think this is a great achievement for PHP community too, because a lot of the functionality we support is not available in some of the leading services so this should kick their asses a bit. We want to stay competitive and keep pushing the PHP ecosystem further, but when it comes to standards, we’ll adopt any upcoming specifications for PHP platforms.

Conclusion

Currently a group of 15 or so people is actively testing this and is sending us valuable feedback. Nevertheless it’s quite close to production-quality service and you’ll hear more about it very soon. If you feel like you’d like to test this (completely free of charge) and would be able to provide some good thoughts, feel welcome to write to me. You can find more details about Azure+ here.

Dependencies management in PHP projects

Rarely a project lives by itself, especially in the days of frameworks. Furthermore, there are a lot of great open source libraries you might want to use to save time. But all of this raises a new problem - how could we manage all those dependencies. Here are some thoughts on this problem and how you might want to solve it; without shooting yourself in a foot. Which is commonly known as DLL Hell.

Usually SVN or Git integrated external references management tools are used for this. But… Version control systems are not made for managing dependencies. Period. They can be made to do so, but sooner or later they are going to fail at doing that. This is a fact and there is no way avoiding it, if you don’t trust me on this here are some proofs why.

Version control systems

Stop

The most popular one couple years ago was svn:externals for SVN, which is quite similar to git submodule for GIT. The first obvious problem is that they both only support referencing repositories of the same type, that is you can’t include a Git dependency in a SVN project. Which today is a very problematic thing because you might still be using SVN, although not sure why you would be doing so, but a lot of the open source projects have moved on to GitHub.

If you are fine with the above, I think you should be quite quickly annoyed by the fact that those sub-folders you are automatically populating are in fact full checkouts by themselves, thus not read-only. Which potentially is a very risky design characteristic, because most of the time you aren’t supposed to commit from those checkouts, even if you have changed something there.

Git users might be disappointed to know that submodules do not support partial checkouts, that is you can only checkout full repositories. This works fine most of the times, but quite often you’d like to checkout a sub-folder of the repository (for example only library folder from Zend Framework). There is a solution for that called subtree merge, but I find it way too complicated for my liking and I only have used it a handful of times.

How far you want to go

Space

The most obvious use of external dependencies is to get a copy of the framework you are using. This is quite a simple task because it can even be solved by just downloading a copy of the framework and sticking it in the project folder. Easy enough to manage, although not ideal. If you have less than say 5 of such dependencies then any way you choose to manage them is going to be fine. As long as they don’t have dependencies themselves…

Dependencies actually are much more complicated than that. If you are using truly componentized libraries, those by themselves are going to have some dependencies. This introduces the transitive dependencies problem which you can’t easily solve. This is not such a big deal for PHP projects, because the biggest place where such libraries exist is PEAR.net and the tools there will help you with that. Anyhow, keep this in mind.

As you can see depending on what sort of external code you are trying pull in there are different problems attached to it. From my experience simple management of the dependencies is enough, because I’m yet to see a big number of libraries having clearly defined dependencies. So unless this changes soon, I just use the simplest tools available.

Tools made for this

Tools

One of the best known tools is Apache Maven, especially if you have Java experience. It does everything you’d want from a dependency manager and probably more, but having used it for couple projects I think it’s overcomplicating what I would need for our projects. Maybe because I haven’t worked on projects complicated enough, but more likely because I just don’t find tools like this attractive and valuable.

You might also want to use PEAR for dependencies management, although it requires external libraries to be stored in PEAR repositories. Similarly there is the composer project which tries to solve a lot of dependencies problems and can resolve them from various different sources, but it still seems to be in development and I haven’t played with it enough. I think composer might be the one to watch.

Symfony2 has an interesting approach of just having a deps file which is used to define where all the dependencies are and where to place them. Think of it as a very light build recipe. Following a similar approach I have extended it and added support for different repository types and sub-repository checkouts. One script ./bin/vendors install to run looks great to me.

Things you shouldn’t forget

Remember

With the growing popularity of Symfony2, there are more and more bundles floating around on GitHub. However I just recently had a case where a new developer checked out a project and it wasn’t working completely. Apparently in couple weeks of time one of the bundles was refactored resulting in all previous integrations completely failing.

It was my stupidity of not locking in to a specific revision of the bundle (which you can do using the deps.lock file), but it is likely that you will repeat this mistake too. The fact you need to understand is that most of the time you will be pulling 3rd party dependencies which you have no control of and if you depend on them heavily, which you probably are, you need to know a specific version you want to use. Point it to a stable version if they have one, because it’s extremely bad to just point your reference to a master branch.

Furthermore, a library can go away completely. If it’s hosted in a obscure small website you just found, month down the road there could be no trace of it anymore. Even GitHub doesn’t protect from this - repository can be deleted and will be gone forever. So you need to make a choice - how much do you trust the authors - and preferably backup the source code locally (or setup a mirror repository).

Conclusion

One way or another you will have to solve this problem. I’d say the easiest way to start with and have some room to grow is to integrate this with your build scripts, even as a simple bash script. Don’t try to reinvent the wheel if you need something more sophisticated - there exists working tools already, so just give some of them a go. Just make sure you know what version you want to depend on and have some safety nets against disappearing sources.

Never trust your sources

Data validation sounds like an obvious thing and it appears that everyone is doing it, but here are some ideas on where you might be doing it wrong. It’s not a practical examples article though, I’d assume they are pretty easy to figure out; this is more about implications and causes of various different validation errors. All of them are where we had suffered before, so make sure not to repeat the same mistakes.

This post is not about security, although security is probably one of the most important users of validation. Here I’d like to talk about other uses cases of validation, mainly being how to make sense of data you are receiving and make sure it’s not breaking your applications.

Obvious rules

Rules

If you expect an integer, check if it’s an integer. If you expect a date, check if it’s date. It doesn’t matter if it is an admin interface and

“only ourselves will be using it, so we will be always entering valid data”

This is not phpMyAdmin you are building (even that is actually validating what you have entered before storing in a database), making sure there is no way to mess up the database from any app will save you time. And grey hair.

More than once have I seen the cases of applications not checking what they are accepting as a price of a product and then failing to render any successive screens because math operations on it are invalid. It’s especially bad when users can’t fix it themselves and need to contact you, the developer, to handle that from the other end. If I enter 1’000’000’000 to stock quantity field make sure the whole app doesn’t explode trying to insert so many rows in a database.

Make assumptions

Programming, I believe, is about logic. So don’t be an idiot and use some of it. Ask questions about the input you are receiving, be it user entered data or auto-imports from external sources, and lock down the expectations. Here are some basic rules, just examples of course, sounding so natural, but still I rarely see them in practice:

Make assumptions
  • Product price larger than 0, smaller than 1000
  • Person’s age is between 0 and 150
  • Stock quantity is between 0 and 50
  • Order cannot have date in the future
  • Etc.

Obviously it depends on the application you are building and these might change, but most likely they won’t and allowing arbitrary data to be entered can create huge problems. This is better than the Blacklist approach which is not really pratical, as it requires specifying what your data can’t be rather what it can only be.

If you are importing data from an external source, does it make sense for the result to be empty? If it’s a list of events it cannot be empty, unless the whole thing got cancelled. So discard such an import and log that you got 0 events, and you expected at least a dozen… do not delete all events from your database. That is - do not trust the source 100%, use your, or code’s, head too.

Structure

Structure

This one is so easy to check it’s not even funny when your data imports go wrong. It can be an Excel spread sheet, some custom format or XML serialized data. All of those have structure, which you should be able to rely on. Personally, if data format changes, I make sure that my code would just stop processing it immediately, because it doesn’t know any more what any of it means.

For tabular data it’s very easy to check tables’ headers - the amount of them and their labels. The order can change, you can figure out how to handle this, but if some fields are missing it is indicating that possibly the actual data can be mixed up to. XML might be trickier to check as it has nested structures, but one could use validation against a DTD. If additional price element is added the code might still work, but the code doesn’t know if it’s using the right one anymore.

There are cases when you might not know all possible data formats, like what I noticed recently when importing some data from Amazon reports. Everything seemed fine when we were testing, but once we launched some products were reporting with wrong quantities. The type field, which we stupidly ignored because it always showed ‘Sellable’, apparently can also be different and when it’s different you should act differently. Obviously because we ignored it the data we imported didn’t make sense - what we should have done is validate the data and have our assumption about type field in place, this could have notified us about that unseen format.

Encoding

Encoding

We had this issue when working with the Edinburgh Festival API while migrating some import sources to a new location. Everything seemed fine and data was successfully imported, but after some time users pointed out that some of the characters we are returning are invalid Unicode sequences. After some investigation we found out that the conversion of ISO-8859-1 to UTF-8 was in fact incorrect and Windows-1252 was supposed to have been used.

Obviously it’s very unlikely that encoding of the data might change, but, you know, sometimes things happen when you least expect them. Annoyingly encoding issues are easiest to spot by humans, because any single-byte encoding can be applied and it will work, kind of, but only an actual person would notice that the text it’s showing doesn’t make sense. Luckily computers now can guess too, by for example running this:

<?php
$str = 'áéóú'; // ISO-8859-1
mb_detect_encoding($str);

However results of this function are pretty unpredictable sometimes, although you can still use it to detect if data is a valid Unicode string, for example:

<?php
$str = 'áéóú'; // ISO-8859-1
mb_detect_encoding($str, 'UTF-8'); // 'UTF-8'
mb_detect_encoding($str, 'UTF-8', true); // false

And if you are converting from a specific encoding to another one, say Unicode, test if result doesn’t contain invisible characters or any other impossible sequences in human readable text.

Conclusion

I can’t stress enough how import data validation really is. There are so many attack vectors exploiting incomplete or faulty validation you can never be 100% sure all cases are covered. But rather than building a blacklist, go with whitelist approach, because most likely it’s going to be better and if conditions change you can always fix it later.

Web Scraping is actually pretty easy

For some of our clients we worked on extracting or submitting data automatically from websites which didn’t have an API we could use. This and more is called web scraping. Since our announcement of SellerScout, which relies heavily on this, I received a list of questions how we actually do this. So here are some thoughts on how to get started in the interesting web scraping world.

This article talks about some basics, which will work fine for most cases. This is probably not even remotely close to how Google does this or how we do it at SellerScout. The reason being both of those systems work in much larger scale and use cases are different. For example relying on machine learning, text analysis and semantic search algorithms etc. are all the things you might be doing if you want to build something big.

Downloading the web

Scraper

It's all just scraping

Spiders are the small applications you are going to be writing. Usually they are self-contained and CLI-friendly scripts, which have some internal logic how to extract information from a specific website or websites. As an example, the script might go to website’s homepage, download all the category pages, download products list for each of them and extract a list of products in the store.

If you are a Python guy, you might want to look at Twisted or Scrapy, later being very easy to use. If it’s PHP you are using, combination of cURL and libxml will allow doing the same; I’m not aware of any PHP frameworks for this. For any other language, you should give a look at Google.

Depending on your task you will need to support different functionality. If the website is for logged in users only, you should configure cURL to use cookies jar and initialize the scraping with a request to login page. If you need to extract thousands of documents, have some logic to pause and resume the script, so if it crashes it can start from the last completed document rather than from the start. In any case, try to replicate the natural user behaviour on the site.

Is it legal? Depends. There is no strict answer and it varies on what data you are trying to extract. Some data can be copyrighted, for example original texts, so if you are scraping them and showing in your website - you are being a bad person. Stop! Ideally you should discuss this with your lawyer, which we did, and get some thoughts on how to proceed.

Getting blocked

Stop sign

One of the decisions you will need to make is how you are going to identify the spider - you can either replicate normal browser’s headers or introduce the spider by its name (eg. ’googlebot’). First one will allow you to stay undetected, probably, while later one is considered to be the correct way. From my experience, for anything small Firefox headers will work just fine.

Websites might still decide to block you though, and it’s something you might want to be prepared for. If you are identifying the spider by a name, you should respect robots.txt and stop crawling if you are being denied by that file. However the most likely blocking mechanism is to block your IP address, which is going to happen if you are being stupid. Really stupid.

You see, when people are browsing the web they request 1 page each 3 or 4 seconds, hence if you have a list of 1000 urls to download and you just start iterating over them and issuing requests… Well, you are easy to catch by just looking at the access logs. Don’t do this. Rather have a queue of urls to download and issue requests with a random delay from a range of 1 to 5 seconds. It’s going to take longer, but it will help to avoid problems.

This doesn’t scale though, you might say. And in fact you are right, because 5 seconds delay between each request limits the amount of content you can download per day. Luckily for you, I have a tip here too - use proxy servers. It’s going to require writing a requests scheduler, but if you need to download the same 1000 urls you might as well distribute them over a list of proxies each with their own delay times. The more proxies you have the amount of content you can download increases linearly.

Extracting the data

XPath

Once you have the HTML you want to process (to extract links to follow or to extract actual data), you might wonder how to actually do it. There couple of ways and libraries for this, however if you want to keep it simple using XPath or CSS-like queries is going work just fine. If you feel like it, and believe me sometimes there is no other way, you might go with using regular expressions for this, but that’s got problems I’m going to talk about just in a second.

I tend to go with XPath because it’s very easy to write and to debug. Furthermore there are various extensions you get for your browser which will allow creating those queries and test them on the actual website. I have worked on spiders for over a 100 different websites and XPath worked fine all the time, as long as…

The problem you will need to solve is how to process invalid HTML or XHTML markup. And from my experience, I’m yet to see a website with all pages being 100% valid. The more invalid it is, the harder it’s to fix those problems. There are libraries though, most famously BeautifulSoap, which will try to process invalid markup. They do have performance implications, but keep them in mind because you won’t be able to issue XPath queries on invalid syntax.

Now let’s get back to regular expressions. Theoretically they might look awesome, because they can extract data even if the HTML markup is invalid, however the problem is that soon they get complicated and very easy to break. XPath allows you to work on a DOM tree, hence if the website structure change they just stop working completely. Regexps on the other hand might still work, but produce very unpredictable results.

Conclusion

We, as a company, have a lot of experience web scraping the data and it’s actually very very easy. As long as you follow the logical rules and don’t try to over-complicate the data extraction, you could easily extract all news items, products or blog posts in 30 or so code lines of spider. I can talk on this for hours or days, so I might write more on this soon, because this is just a top of an iceberg.

Fork me on GitHub