Wingify
Engineering

We, at Wingify, handle not just our own traffic, but also the traffic of major websites such as Microsoft, AMD, Groupon, and WWF that implement Visual Website Optimizer (VWO) for their website optimization. VWO allows users to A/B test their websites and optimize conversions. With an intuitive WYSIWYG editor, you can easily make changes to your website and create multiple variations you can A/B test. When a visitor lands on your website, VWO selects one of the variations created in the running campaign(s) and the JavaScript library does the required modifications to generate the selected variation based on the URL visited seen by the visitor. Furthermore, VWO collects analytics data for every visitor interaction with the website and generates detailed reports to help you understand your audience behavior and provide deeper insight of your business results.

Here is a very high-level overview of what goes on behind the scenes:

How it started

Back in the days, we deployed one server in the United States that had the standard LAMP stack running on it. The server stored all changes made to a website using VWO app, served our static JS library, collected analytics data, captured visitor data, and saved it in a MySQL database.

This implementation worked perfectly for us initially, when we were serving a limited number of users. However, as our user base kept growing, we had to deploy additional Load Balancers and Varnish cache servers (each having 32GB of RAM and we had 8 such servers to meet our requirements) to make sure that we cache the content for every requested URL and serve back the content in the least possible time.

Gradually, we started using these servers only for serving JS settings and collecting analytics data, and started using Amazon’s CloudFront CDN for serving static JS library.

Issues we faced

This worked great for a while till we hit our traffic to more than 1k requests per sec. With so much of traffic coming in and the increasing number of unique URLs being tested, the system started failing. We experienced frequent cache misses and Varnish required more RAM to cope up with the new requirements. We knew we had hit the bottom-end there and quickly realized that it was time for us to stop everything and get our thinking caps back on to redesign the architecture. We now needed a scalable system that was easier to maintain, and would cater to the needs of our users from various geo locations.

The new requirements

Today, VWO uses a Dynamic CDN built in-house that can cater to users based in any part of the world. The current implementation offers us with the following advantages in comparison with other available CDNs:

Implementation challenges and technicalities

The core issue we had to resolve was to avoid sending the same response for all the requests coming from a domain or a particular account. In the old implementation, we were serving JSON for all the campaigns running in an account, irrespective of a campaign running on that URL. This loaded unnecessary JS code, which might not be useful for a particular URL, thereby increasing load time of the website. We knew how page-load time is crucial for online businesses and how it directly impacts their revenue. In the marketing world, the users are less likely to make purchases from a slow loading website as compared to a fast loading website.

It is important to make sure that we only serve relevant content based on the URL of the page. There are two ways to do this:

With the approach identified, we started looking for nodes that could do everything for us - generate dynamic JSON on the basis of request, serve static JS library, and handle data acquisition. Another challenge was to make these nodes a part of distributed system that spreads across different geographies, with no dependency on each other while making sure that the request is served from the closest location instead of nodes only in the US. We had written a blog post earlier to explain this to our customers. Read it here.

OpenResty (aka. ngx_openresty) our current workhorse, is a full-fledged web application server created by bundling the standard Nginx core with different 3rd-party Nginx modules and their external dependencies. It also bundles Lua modules to allow writing URL handlers in Lua, and the Lua code runs within the web server.

From 1 server running Apache + PHP to multiple nodes involving Nginx (load balancer) -> Varnish (cache) -> Apache + PHP (for cache miss + data collection), to the current system where each node in itself is capable of handling all types of requests. We serve our static JS library, JSON settings for every campaign and also use these servers for analytics data acquisition.

The following section describes briefly the new architecture of our CDN and how VWO servers handle requests:

  1. We use Nginx-Lua Shared Dictionary, an in-memory store shared among all the Nginx worker processes to store campaign specific data. Memcached is used as the first fallback if we have to restart the OpenResty server (it resets the shared dictionary). Our second fallback is our central MySQL database. If any request fails at any level, [the system] fetches it from the lower layer and responses are saved in all the above levels to make them available for the next request.
  2. Once the request hits our server to fetch JSON for the campaigns running on a webpage, VWO runs a regex match for the requested URL with the list of URL regex patterns stored in the Nginx-Lua shared dictionary (key being Account ID, O(1) lookup, FAST!). This returns the list of campaign IDs valid for the requested URL. All the regex patterns are compiled and cached in worker-process level and shared among all requests.
  3. Next, VWO looks up for the campaign IDs (returned after matching the requested URL) in the Nginx-Lua shared dictionary, with Account ID and Campaign ID as key (again an O(1) lookup). This returns the settings for all campaigns, which are then combined and sent with some additional data in response based on requests such as geo-location data, 3rd party integrations specific code, etc. We ensure that the caching layer does not have stale data and is updated within a few milliseconds. This offers us advantage in terms of validation time taken by most CDNs available.
  4. To ensure that the request is served from the closest server to the visitor, we use managed DNS services from DynECT that keeps a check on the response times from various POPs and replies with the best possible server IPs (both in terms of health and distance). This helps us ensure a failsafe delivery network.
  5. To ensure that the system captures analytics data, all data related to visitors, conversions and heatmaps is sent to these servers. We use Openresty with Lua for collecting all incoming data. All the data received at Openresty end is pushed to a Redis server running on all these machines. The Redis server writes the data as fast as possible, thereby reducing the chance of data loss. Next, we move data from the Redis servers to central RabbitMQ. This incoming data is then used by multiple consumers in various ways and stored at multiple places for different purposes. You can check our previous post Scaling with Queues to understand more about our data acquisition setup.

As our customers keep growing and our traffic keeps growing, we will be able to judge better about our system, how well it scales and what problems it has. And as VWO grows and becomes a better and better, we will keep working on our current infrastructure to improve it and adjust it for our needs. We would like to thank agentzh (YichunZhang) for building OpenResty and for helping us out whenever we were stuck with our implementation.

We work in a dynamic environment where we collaborate and work towards architecting scalable and fault-tolerant systems like these. If these kind of problems and challenges interest you, we will be happy to work with you. We are hiring!


We are excited to announce our sponsorship of The Fifth Elephant - a popular conference around the Big Data ecosystem. The conference will be held in Bangalore, India from 23rd to 26th July.

Our engineers will be present at the conference. If you are interested in our work, want to know more about what we are doing, want to work with us (we’re hiring), get some cool goodies or just want to say Hi!, please visit our booth (B7) or catch any of our team members. We’d love to talk to you!

We look forward to meeting you in Bangalore!


In November last year, I started developing an infrastructure that would allow us to collect, store, search and retrieve high volume data. The idea was to collect all the URLs on which our homegrown CDN would serve JS content. Based on our current traffic, we were looking to collect some 10k URLs per second across four major geographic regions where we run our servers.

In the beginning we tried MySQL, Redis, Riak, CouchDB, MongoDB, ElasticSearch but nothing worked out for us with that kind of high speed writes. We also wanted our system to respond very quickly, under 40ms between internal servers on private network. This post talks about how we were able to make such a system using C++11, RocksDB and Thrift.

First, let me start by sharing the use cases of such a system in VWO; the following screenshot shows a feature where users can enter a URL to check if VWO Smart Code was installed on it.


VWO Smart Code checker

The following screenshot shows another feature where users can see a list of URLs matching a complex wildcard pattern, regex pattern, string rule etc. while creating a campaign.


VWO URL Matching Helper

I reviewed several opensource databases but none of them would fit our requirements except Cassandra. In clustered deployment, reads from Cassandra were too slow and slower when data size would grew. After understanding how Cassandra worked under the hood such as its log structured storage like LevelDB I started playing with opensource embeddable databases that would use similar approach such as LevelDB and Kyoto Cabinet. At the time, I found an embedabble persistent key-value store library built on LevelDB called RocksDB. It was opensourced by Facebook and had a fairly active developer community so I started playing with it. I read the project wiki, wrote some working code and joined their Facebook group to ask questions around prefix lookup. The community was helpful, especially Igor and Siying who gave me enough hints around prefix lookup, using custom extractors and bloom filters which helped me write something that actually worked in our production environment for the first time. Explaining the technology and jargons is out of scope of this post but I would like to encourage the readers to read about LevelDB and to read the RocksDB wiki.


RocksDB FB Group

For capturing the URLs with peak velocity up to 10k serves/s, I reused our distributed queue based infrastructure. For storage, search and retrieval of URLs I wrote a custom datastore service using C++, RocksDB and Thrift called HarvestDB. Thrift provided the RPC mechanism for implementing this system as a distributed service accessible by various backend sub-systems. The backend sub-systems use client libraries generated by Thrift compiler for communicating with the HarvestDB server.

The HarvestDB service implements five remote procedures - ping, get, put, search and purge. The following Thrift IDL describes this service.

namespace cpp harvestdb
namespace go harvestdb
namespace py harvestdb
namespace php HarvestDB

struct Url {
    1: required i64    timestamp;
    2: required string url;
    3: required string version;
}

typedef list<Url> UrlList

struct UrlResult {
    1: required i32          prefix;
    2: required i32          found;
    3: required i32          total;
    4: required list<string> urls;
}

service HarvestDB {
    bool ping(),
    Url get(1:i32 prefix, 2:string url),
    bool put(1:i32 prefix, 2:Url url),
    UrlResult search(1:i32 prefix,
                     2:string includeRegex,
                     3:string excludeRegex,
                     4:i32 size,
                     5:i32 timeout),
    bool purge(1:i32 prefix, 2:i64 timestamp)
}

Clients use ping to check HarvestDB server connectivity before executing other procedures. RabbitMQ consumers consume collected URLs and put them to HarvestDB. The PHP based application backend uses custom Thrift based client library to get (read) and to search URLs. A Python program runs as a periodic cron job and uses purge procedure to purge old entries based on timestamp which makes sure we don’t exhaust our storage resources. The system is in production for more than five months now and is capable of handling (benchmarked) workload of up to 24k writes/second while consuming less than 500MB RAM. Our future work will be on replication, sharding and fault tolerance of this service. The following diagram illustrates this architecture.


Overall architecture

Discussion on Hacker News


e2e or end-to-end or UI testing is a methodology used to test whether the flow of an application is performing as designed from start to finish. In simple words, it is testing of your application from the user endpoint where the whole system is a blackbox with only the UI exposed to the user.

It can become quite an overhead if done manually and if your application has a large number of interactions/pages to test.

In the rest of the article I’ll talk about webdriverJS and Jasmine to automate your e2e testing, a combination which isn’t talked about much on the web.

What is WebDriverJS?

This was something which took me quite sometime to put my head around and I feel this was more or less due to the various available options for everything related to WebDriver.

So let’s take it from the top and see what its all about.

Selenium

As mentioned on the selenium website, Selenium automates browsers. That’s it.

This having support for almost all major browsers, is a very good alternative to automate our tests in the browser. So whatever you do in the browser while testing your application, like navigating to pages, clicking a button, writing text in input boxes, submitting forms etc, can be automated using Selenium.

WebDriver

WebDriver (or Selenium 2) basically refers to the language bindings and the implementations of the individual browser controlling code.

WebDriver introduces a JSON wire protocol for various language bindings to communicate with the browser controller.

For example, to click an element in the browser, the binding will send POST request on /session/:sessionId/element/:id/click

So, at one end there is the language binding and a server, known as Selenium server, on the other. Both communicate using the JSON wire protocol.

WebDriverJS

As mentioned, WebDriver has a number of bindings for various languages like Ruby, Python etc. JavaScript being the language of choice for the web, is the latest one to make it to the list. Enter WebDriverJS!

So as you might guess, WebDriverJS is simply a wrapper over the JSON wire protocol exposing high level functions to make our life easy.

Now if you search webdriver JS on the web, you’ll come across 2 different bindings namely selenium-webdriver and webdriverjs (yeah, lots of driver), both available as node modules. You can use anyone you like, though we’ll stick to the official one i.e. selenium-webdriver.

Say you have a JavaScript project you want to automate e2e testing on. Installing the bindings is as simple as doing:

npm install selenium-webdriver

Done! You can now require the package and with a lil’ configuration you can open any webpage in the browser:

var webdriver = require('selenium-webdriver');

var driver = new webdriver.Builder().
    withCapabilities(webdriver.Capabilities.chrome()).
    build();

driver.get('http://www.wingify.com');

To run your test file, all you do is:

node testfile.js

Note: In addition to the npm package, you will need to download the WebDriver implementations you wish to utilize. As of 2.34.0, selenium-webdriver natively supports the ChromeDriver. Simply download a copy and make sure it can be found on your PATH. The other drivers (e.g. Firefox, Internet Explorer, and Safari), still require the standalone Selenium server.

Difference from other language bindings

WebDriverJS has an important difference from other bindings in any other language - It is asynchronous.

So if you had done the following in python:

// pseudo code
driver.get(page1);
driver.click(E1);

Both statments would have executed synchronously as the Python (as every other language) API is blocking. But that isn’t the case with JavaScript. To maintain the required sequence between various actions, WebDriverJS uses Promises. In short, a promise is an object that can execute whatever you give it after it has finished.

But it doesn’t stop here. Even with promises, the above code would have become:

// pseudo code
driver.get(page1).then(function () {
	driver.click(E1);
});

Do you smell callback hell in there? To make it more neat, WebDriverJS has a wrapper for Promise called as ControlFlow.

In simple words, this is how ControlFlow prevents callback hell:

And so, it enables us to simply do:

// pseudo code
driver.get(page1);
// Implicitly add to previous action's then()
driver.click(E1);

Isn’t that awesome!

Controlflow also provides an execute function to push your custom function inside the execution list and the return value of that function is used to resolve/reject that particular execution. So you can use promises and do any asynchronous thing in your custom code:

var flow = webdriver.promise.controlFlow();

flow.execute(function () {
	var d = webdriver.promise.defer();
	do_anything_async().then(function (val) {
		d.fulfill(val);
	})
	return d.promise;
});

Quick tip: Documentation for JavaScript bindings isn’t that readily available (atleast I couldn’t find it), so the best thing I found to be useful was the actual WebDriverJS code. It heavily commented and is very handy while looking for specific methods on the driver.

Combining WebDriverJS with Jasmine

Our browser automation is setup with selenium. Now we need a testing framework to handle our tests. That is where Jasmine comes in.

You can install jasmine for your JavaScript project through npm:

npm install jasmine-node

If we were to convert our earlier testfile.js to check for correct page title, here is what it might look like:

var webdriver = require('selenium-webdriver');

var driver = new webdriver.Builder().
    withCapabilities(webdriver.Capabilities.chrome()).
    build();

describe('basic test', function () {
	it('should be on correct page', function () {
		driver.get('http://www.wingify.com');
		driver.getTitle().then(function(title) {
			expect(title).toBe('Wingify');
		});
	});
});

Now the above file needs to be run with jasmine-node, like so:

jasmine-node testfile.js

This will fire the browser and do the mentioned operations, but you’ll notice that Jasmine won’t give any results for the test. Why?

Well…that happens because Jasmine has finished executing and no expect statement ever executed because of the expectation being inside an asynchronous callback of getTitle function.

To solve such asynchronicity in our tests, jasmine-node provides a way to tell that a particular it block is asynchronous. It is done by accepting a done callback in the specification (it function) which makes Jasmine wait for the done() to be executed. So here is how we fix the above code:

var webdriver = require('selenium-webdriver');

var driver = new webdriver.Builder().
    withCapabilities(webdriver.Capabilities.chrome()).
    build();

describe('basic test', function () {
	it('should be on correct page', function (done) {
		driver.get('http://www.wingify.com');
		driver.getTitle().then(function(title) {
			expect(title).toBe('Wingify');
			// Jasmine waits for the done callback to be called before proceeding to next specification.
			done();
		});
	});
});

Quick tip: You might want to tweak the time allowed for tests to complete in Jasmine like so:

jasmine.getEnv().defaultTimeoutInterval = 10000; // in microseconds.

Bonus for Angular apps

Angular framework has been very testing focused since the very beginning. Needless to say, they have devoted a lot of time on e2e testing as well.

Protractor is a library by the Angular team which is a wrapper on WebDriverJS and Jasmine and is specifically tailored to make testing of Angular apps a breeze.

Checkout some of the neat addons it gives you:

  1. Apart from querying element based on id, css selector, xpath etc, it lets you query on basis of binding, model, repeater etc. Sweet!

  2. It has Jasmine’s expect function patched to accept promises. So, for example, in our previous test where we were checking for title:

driver.getTitle().then(function (title) {
	expect(title).toBe('Wingify');
});

can be refactored to a much cleaner:

expect(driver.getTitle()).toBe('Wingify');

And more such cool stuff to make end-to-end testing for Angular apps super-easy.

In the end

e2e testing is important for the apps being written today and hence it becomes important for it to be automated and at the same time fun and easy to perform. There are numerous tools available for you to choose and this article talks about one such tool combination.

Hope this helps you get started. So what are you waiting for, lets write some end-to-end tests!

Using a e2e testing stack you want to share? Let us know in the comments.


Our home-grown geo-distributed architecture based CDN allows us to delivery dynamic javascript content with minimum latencies possible. Using the same architecture we do data acquisition as well. Over the years we’ve done a lot of changes to our backend, this post talks about some scaling and reliability aspects and our recent work on making fast and reliable data acquisition system using message queues which is in production for about three months now. I’ll start by giving some background on our previous architecture.

Web beacons are widely used to do data acquisition, the idea is to have a webpage send us data using an HTTP request and the server sends some valid object. There are many ways to do this. To keep the size of the returned object small, for every HTTP request we return a tiny 1x1 pixel gif image and our geo-distributed architecture along with our managed Anycast DNS service helps us to do this with very low latencies, we aim for less than 40ms. When an HTTP request hits one of our data acquisition servers, OpenResty handles it and our Lua based code processes the request in the same process thread. OpenResty is a nginx mod which among many things bundles luajit that allows us to write URL handlers in Lua and the code runs within the web server. Our Lua code does some quick checks, transformations and writes the data to a Redis server which is used as fast in-memory data sink. The data stored in Redis is later moved, processed and stored in our database servers.


Previous Architecture

This was the architecture when I had joined Wingify couple of months ago. Things were going smooth but the problem was we were not quite sure about data accuracy and scalability. We used Redis as a fast in-memory data storage sink, which our custom written PHP based queue infrastructure would read from, our backend would process it and write to our database servers. The PHP code was not scalable and after about a week of hacking, exploring options we found few bottlenecks and decided to re-do the backend queue infrastructure.

We explored many options and decided to use RabbitMQ. We wrote a few proof-of-concept backend programs in Go, Python and PHP and did a lot of testing, benchmarking and real-world load testing.

Ankit, Sparsh and I discussed how we should move forward and we finally decided to explore two models in which we would replace the home-grown PHP queue system with RabbitMQ. In the first model, we wrote directly to RabbitMQ from the Lua code. In the second model, we wrote a transport agent which moved data from Redis to RabbitMQ. And we wrote RabbitMQ consumers in both cases.

There was no Lua-resty library for RabbitMQ, so I wrote one using cosocket APIs which could publish messages to a RabbitMQ broker over STOMP protocol. The library lua-resty-rabbitmqstomp was opensourced for the hacker community.

Later, I rewrote our Lua handler code using this library and ran a loader.io load test. It failed this model due to very low throughtput, we performed a load test on a small 1G DigitalOcean instance for both models. For us, the STOMP protocol and slow RabbitMQ STOMP adapter were performance bottlenecks. RabbitMQ was not as fast as Redis, so we decided to keep it and work on the second model. For our requirements, we wrote a proof-of-concept Redis to RabbitMQ transport agent called agentredrabbit to leverage Redis as a fast in-memory storage sink and use RabbitMQ as a reliable broker. The POC worked well in terms of performance, throughput, scalability and failover. In next few weeks we were able to write a production level queue based pipeline for our data acquisition system.

For about a month, we ran the new pipeline in production against the existing one, to A/B test our backend :) To do that we modified our Lua code to write to two different Redis lists, the original list was consumed by the existing pipeline, the other was consumed by the new RabbitMQ based pipeline. The consumer would process and write data to a new database. This allowed us to compare realtime data from the two pipelines. During this period we tweaked our implementation a lot, rewrote the producers and consumers thrice and had two major phases of refactoring.


A/B testing of existing and new architecture

Based on results against a 1G DigitalOcean instance like for the first model and against the A/B comparison of existing pipeline in realtime, we migrated to the new pipeline based on RabbitMQ. Other issues of HA, redundancy and failover were addressed in this migration as well. The new architecture ensures no single point of failure and has mechanisms to recover from failure and fault.


Queue (RabbitMQ) based architecture in production

We’ve opensourced agentredrabbit which can be used as a general purpose fast and reliable transport agent for moving data in chunks from Redis lists to RabbitMQ with some assumptions and queue name conventions. The flow diagram below has hints on how it works, checkout the README for details.


Flow diagram of "agentredrabbit"


Discussion on Hacker News