DOM Comparator is a JavaScript library that analyzes and compares two HTML strings, and returns back a diff object. It returns an output which is an array of operation objects.

DOM Comparator on Github

Here’s a simple example:

var stringA = '<ul><li class="active">list item 1</li><li>list item 2</li></ul>';
var stringB = '<ul><li>list item 1</li><li>list item 2</li></ul>';

// Compare the two strings
var result = VWO.DOMComparator.create({
    stringA: stringA,
    stringB: stringB
});

// Expect an array of VWO.Operation objects to be returned,
// the first one of which looks like below:
expect(result[0]).toEqual({
    name: 'removeAttr',
    selectorPath: 'UL:first-child > LI:first-child',
    content: {
        class: 'active'
    }
});

Motivation

The Campaign Builder is one of the core components of our A/B testing software VWO. It allows you to make changes to any website on the fly. Assuming the target website has a small snippet of VWO Smart Code (Javascript) inserted, the changes made by the user are applied when the A/B test is run. These changes are little snippets of jQuery operations that are applied on the client-end.

One of the major problems faced when applying such changes that they did not regard for dynamic content that might have been rendered by the client’s website’s backend. Let us consider a simple example:

Imagine somebody wanting to run an A/B test on all the product pages of an eCommerce website. He wants to modify the “Buy Now” button on all such pages and make it appear bigger and bolder, so that it captures the end-user’s attention better. He navigates to some product page, selects the button and tries to edit it. Assume that that button has markup that looks like below:

<a href="javascript:addToCart(16);" class="add_to_cart">Add to Cart</a>

The Campaign Builder provides an “Edit” operation, that opens up a rich text editor for the user to make changes to any element with ease. Assuming, he makes the text of the button bolder and changes the color to a bright red, here’s what the resulting markup would look like:

<a href="javascript:addToCart(16);" class="add_to_cart" style="font-weight:bold;color:red;">Add to Cart</a>

Internally, an Edit operation is identified by the element the operation is applied on, and the new markup provided by the user, which in this case is the above code. It means that if a Buy Now button is found on any page, it will be replaced with the above code. The jQuery code for such an operation would look something like this:

// A unique selector path to identify the element
var selector = '#product_description > P:first-child + P > A:first-child';
$(selector).replaceWith('<a href="javascript:addToCart(16);" class="add_to_cart" style="font-weight:bold;color:red;">Add to Cart</a>');

Notice how this would not only add the styles to that element, but also change its href to always execute addToCart(16); regardless of the product page the user is on. Essentially, the dynamic content rendered by the client’s backend has now been replaced with static content.

DOM Comparator to the Rescue

With DOM Comparator in place, the initial markup of the Edit operation above will be compared with the final one, and a differencewould be returned. The difference would contain the minimal changes necessary to be made to the target element, thereby impacting dynamic content as less as possible.

For the above example, here’s what the list of resulting operations would look like:

[{
    "name": "css",
    "selectorPath": "#product_description > P:first-child + P > A:first-child",
    "content": {
        "font-weight": "bold",
        "color": "red"
    }
}]

Live Demo

Click here to view a live demo.

What’s Next

The library is currently in a pre-alpha state. It works well for a good number of cases, but does not for a lot of others. And for certain complex cases, it might not be performant enough.

Our current plans are focused on improving the library as per the below priority list:

  • Correctness: For almost all the cases, the first priority is to get the output as correct as possible to the expectation. This has been our prime focus thus far.
  • Performance: Once we ensure the cases perform correctly, the next task is to profile and optimize for performance. Since tree comparison is a pretty complex operation, we will be looking into possibilities like spawning a worker for performing tasks, or delegating to a Node.js server for comparison.
  • Readability: For a complex algorithm, it is equally important for the code to be readable. In the coming versions, certain complex logic, especially in the classes VWO.DOMMatchFinder and VWO.StringComparator will be refactored from the point of view of readability.
  • Documentation: Writing a documentation is as hard as writing code, if not more, is what I have realized when documenting this project. Over time, we will spend some time improving the documentation, and also release a reference manual for the classes used.

Contributing

If you are interested in contributing to the project, we would love to hear from you. Just fork the repository and submit a pull request.

Further Reading

Head over to the documentation if you’d like to know more.


The Fifth Elephant is a popular conference in India around the Big Data ecosystem. It happened last week in Bangalore. And we were proud to sponsor the conference. To represent Wingify at the event, Ankit, Varun and I (Vaidik) attended the event. This blog post is an account of what we did at the event and what we learned by sponsoring, attending and being present at the conference as an organization.

The conference

The organizers reported that there were 960 registrations for the conference - quite a big number for a conference around something like Big Data. It seems most of the attendees were from Bangalore and not many had come from Delhi. There were talks around infrastructure for big data systems, machine learning, data mining, etc. A few talks out of the ones we attended were really good. There were a few that we wanted to attend but couldn’t because we were busy talking to people who wanted to talk to us and know more about us. Well that’s what conferences are about - attend a few good talks and meet a lot of people. Fortunately, HasGeek will put out the recorded videos of those sessions soon. Some talks were around interesting topics and some not so much. Shailesh’s talk on The Art of Data Mining was certainly the best talk out of all the talks that I managed to attend (personal opinion).

There were poster sessions for talks that the editorial panel found interesting but not interesting enough to select them for sessions. My proposal for Using Elasticsearch for Analytics (presentation slides) was selected for a poster session, which I presented on the 2nd day of the conference. It was interesting to see that multiple teams within Flipkart, Red Hat, a couple of other product startups and services companies were interested in doing the same. So we ended up having long discussions about how are using Elasticsearch for analytics it at Wingify and how they can use Elasticsearch to solve similar problems.

Our presence at the conference

We had a desk/booth which we managed to prepare nicely to catch everyone’s attention. We got a display next to our desk on which we played our super cool video that we put on vwo.com (a few people expressed that they really liked the video). I think that caught a lot of people’s attention. We also strategically placed our standees at places where it was clearly visible. Looking at other standees, I think ours was one of the best in design if not the best. We distributed our t-shirts and stickers, which seemed to attract a lot of people (more than once for the free t-shirt). A few people gave us compliments for the A-4 insert we distributed to all the participants at the time of registration. Thanks to Paras for helping out with the content and the design.

On the first day, an overwhelming number of people walked up to our booth. They were mostly unaware of our and our product’s existence. Many were blown away by the idea. Some not so much. But after talking to so many people, we figured that we were not absolutely correct about ignoring this conference as a place to promote the product and get prospective clients. Other than engineers, there were decision makers from large companies like Amazon, Citibank, Ebay and Lenovo who Ankit got a chance to speak with.

We were primarily at Fifth Elephant for the purpose of establishing the Wingify engineering brand and hiring. We were able to spread the word about what we do and got people interested about product and work. So on the front of establishing our engineering brand, we were somewhat successful - this was evident from the kind of conversations we had with people at our booth and the number of people who shared their contact details with us (this is not always very conclusive as free goodies attract people and the number can be deceiving). A lot of people were interested in understanding what kind of roles we are hiring for. These people were interested in data science and software engineering. Fingers crossed - we might get some applications soon and opportunity to work with some amazing people. People found our product cool - many did not know that something of this sort existed. I think this was one of the things that got them interested. However, some were sad to know that we are based out of Delhi instead of Bangalore, as I initially said that it seemed like most of the attendees were from Bangalore. That just says that the community in Delhi needs to work together to make Delhi more exciting for engineers.

Community

We engaged with the community through our booth and at other moments like lunch during the conference. We got the opportunity to make connections with different startups and individuals like Jabong, BloomReach, SupportBee, Inmobi, Flipkart, GlusterFS (and Red Hat), Qubole, Myntra, Aerospike and Slideshare. I might have missed some unintentionally. We got to learn about what they are doing and we shared what we are doing. Discussions were usually about the product, engineering, our tech stack, specific engineering problems, the team, work culture, community in Delhi, etc. People were exited to know our stack and what we are doing with it. We always knew that our tech stack is not the conventional stack but we realized that it’s uncommon and cool.

In the process, we had good some good discussions and connected with good people who we think will hopefully help us with solving some problems we are trying to solve - making friends of Wingify :)

We learned about new things

We have not always been very focussed at doing a lot with collected data. With our latest release, we have come up with a number of features that make use of large amount of data our systems collect and our solutions to these problems were rather unconventional. With the latest release and our plans, we will be making more and more use of collected data for deriving useful insights for our customers and building new features that will help our customers optimize and increase their conversions. Since we have plans to work with data, The Fifth Elephant was an important conference for us to learn about what exactly is going on in the Big Data universe and how we can make use of all that at Wingify. Aerospike NoSQL Database, Cachebox, Imobi’s Grill project are just a few things that we got introduced to and we may explore them in the future for our varied use-cases. It was interesting to see people are trying to leverage SSDs for solving different kind of problems. Aerospike is a NoSQL database optimized for SSD disks and claims to be extremely performant (200,000 QPS per node). CacheBox is an advanced caching solution that leverages flash storage for improving performance for databases.

Other than these systems, there were some learning around building big data infrastructure, real-time data pipelines and data mining. There was a talk on Lessons from Elasticsearch in Production by Swaroop CH from Helpshift. We have been using Elasticsearch at Wingify and it was interesting to see that we were not facing similar problems as they were. We took that as a sign to be cautious and be prepared for firfighting such problems. These were around using Elasticsearch’s Snapshot and Restore API (they say it doesn’t work) and performing rolling upgrades (which is the recommended way of doing upgrades). We never had such problems but we are now aware that others have had such problems and we can be better prepared.

To sum up

It was a great experience being at this conference. It was for the first time we attended a Big Data conference. This ecosystem in India seems to be big and growing and hopefully there will be better content at conferences like these in the future. Thanks to HasGeek for taking the initiative. We hope that the conference will continue to happen in the years to come.

If you were present at the conference and met us there, please do not hesitate to connect with us. If you have any questions to ask regarding our experiences, go ahead and leave comments and we will get back to you. If you like what we do at Wingify and want to join the force, we will be happy to work with you. We are hiring!

Photo Credits

The beautiful photopgraphs in this post have been provided by HasGeek. You can find more photographs of The Fifth Elephant at the following links:


We, at Wingify, handle not just our own traffic, but also the traffic of major websites such as Microsoft, AMD, Groupon, and WWF that implement Visual Website Optimizer (VWO) for their website optimization. VWO allows users to A/B test their websites and optimize conversions. With an intuitive WYSIWYG editor, you can easily make changes to your website and create multiple variations you can A/B test. When a visitor lands on your website, VWO selects one of the variations created in the running campaign(s) and the JavaScript library does the required modifications to generate the selected variation based on the URL visited seen by the visitor. Furthermore, VWO collects analytics data for every visitor interaction with the website and generates detailed reports to help you understand your audience behavior and provide deeper insight of your business results.

Here is a very high-level overview of what goes on behind the scenes:

How it started

Back in the days, we deployed one server in the United States that had the standard LAMP stack running on it. The server stored all changes made to a website using VWO app, served our static JS library, collected analytics data, captured visitor data, and saved it in a MySQL database.

This implementation worked perfectly for us initially, when we were serving a limited number of users. However, as our user base kept growing, we had to deploy additional Load Balancers and Varnish cache servers (each having 32GB of RAM and we had 8 such servers to meet our requirements) to make sure that we cache the content for every requested URL and serve back the content in the least possible time.

Gradually, we started using these servers only for serving JS settings and collecting analytics data, and started using Amazon’s CloudFront CDN for serving static JS library.

Issues we faced

This worked great for a while till we hit our traffic to more than 1k requests per sec. With so much of traffic coming in and the increasing number of unique URLs being tested, the system started failing. We experienced frequent cache misses and Varnish required more RAM to cope up with the new requirements. We knew we had hit the bottom-end there and quickly realized that it was time for us to stop everything and get our thinking caps back on to redesign the architecture. We now needed a scalable system that was easier to maintain, and would cater to the needs of our users from various geo locations.

The new requirements

Today, VWO uses a Dynamic CDN built in-house that can cater to users based in any part of the world. The current implementation offers us with the following advantages in comparison with other available CDNs:

  • Capability of handling almost any amount of requests at average response times of 50ms
  • Handles 10k+ request/sec per node (8GB RAM). We have benchmarked this system to handle 50k requests/sec per node in our current production scenario
  • 100% uptime
  • Improved response time and data acquisition as the servers are closer to the user, thus minimizing the latency and increasing the chances of successful delivery of data
  • Considerable cost savings as compared to the previous system
  • Freedom to add new nodes without any dependencies on other nodes

Implementation challenges and technicalities

The core issue we had to resolve was to avoid sending the same response for all the requests coming from a domain or a particular account. In the old implementation, we were serving JSON for all the campaigns running in an account, irrespective of a campaign running on that URL. This loaded unnecessary JS code, which might not be useful for a particular URL, thereby increasing load time of the website. We knew how page-load time is crucial for online businesses and how it directly impacts their revenue. In the marketing world, the users are less likely to make purchases from a slow loading website as compared to a fast loading website.

It is important to make sure that we only serve relevant content based on the URL of the page. There are two ways to do this:

  • Cache JSON for all the URLs and use cache like Varnish (the old system).
  • Cache each campaign running in an account and then build/combine the settings dynamically for each URL. This approach is the fastest possible way of implementation with least amount of resources.

With the approach identified, we started looking for nodes that could do everything for us - generate dynamic JSON on the basis of request, serve static JS library, and handle data acquisition. Another challenge was to make these nodes a part of distributed system that spreads across different geographies, with no dependency on each other while making sure that the request is served from the closest location instead of nodes only in the US. We had written a blog post earlier to explain this to our customers. Read it here.

OpenResty (aka. ngx_openresty) our current workhorse, is a full-fledged web application server created by bundling the standard Nginx core with different 3rd-party Nginx modules and their external dependencies. It also bundles Lua modules to allow writing URL handlers in Lua, and the Lua code runs within the web server.

From 1 server running Apache + PHP to multiple nodes involving Nginx (load balancer) -> Varnish (cache) -> Apache + PHP (for cache miss + data collection), to the current system where each node in itself is capable of handling all types of requests. We serve our static JS library, JSON settings for every campaign and also use these servers for analytics data acquisition.

The following section describes briefly the new architecture of our CDN and how VWO servers handle requests:

  1. We use Nginx-Lua Shared Dictionary, an in-memory store shared among all the Nginx worker processes to store campaign specific data. Memcached is used as the first fallback if we have to restart the OpenResty server (it resets the shared dictionary). Our second fallback is our central MySQL database. If any request fails at any level, [the system] fetches it from the lower layer and responses are saved in all the above levels to make them available for the next request.
  2. Once the request hits our server to fetch JSON for the campaigns running on a webpage, VWO runs a regex match for the requested URL with the list of URL regex patterns stored in the Nginx-Lua shared dictionary (key being Account ID, O(1) lookup, FAST!). This returns the list of campaign IDs valid for the requested URL. All the regex patterns are compiled and cached in worker-process level and shared among all requests.
  3. Next, VWO looks up for the campaign IDs (returned after matching the requested URL) in the Nginx-Lua shared dictionary, with Account ID and Campaign ID as key (again an O(1) lookup). This returns the settings for all campaigns, which are then combined and sent with some additional data in response based on requests such as geo-location data, 3rd party integrations specific code, etc. We ensure that the caching layer does not have stale data and is updated within a few milliseconds. This offers us advantage in terms of validation time taken by most CDNs available.
  4. To ensure that the request is served from the closest server to the visitor, we use managed DNS services from DynECT that keeps a check on the response times from various POPs and replies with the best possible server IPs (both in terms of health and distance). This helps us ensure a failsafe delivery network.
  5. To ensure that the system captures analytics data, all data related to visitors, conversions and heatmaps is sent to these servers. We use Openresty with Lua for collecting all incoming data. All the data received at Openresty end is pushed to a Redis server running on all these machines. The Redis server writes the data as fast as possible, thereby reducing the chance of data loss. Next, we move data from the Redis servers to central RabbitMQ. This incoming data is then used by multiple consumers in various ways and stored at multiple places for different purposes. You can check our previous post Scaling with Queues to understand more about our data acquisition setup.

As our customers keep growing and our traffic keeps growing, we will be able to judge better about our system, how well it scales and what problems it has. And as VWO grows and becomes a better and better, we will keep working on our current infrastructure to improve it and adjust it for our needs. We would like to thank agentzh (YichunZhang) for building OpenResty and for helping us out whenever we were stuck with our implementation.

We work in a dynamic environment where we collaborate and work towards architecting scalable and fault-tolerant systems like these. If these kind of problems and challenges interest you, we will be happy to work with you. We are hiring!


We are excited to announce our sponsorship of The Fifth Elephant - a popular conference around the Big Data ecosystem. The conference will be held in Bangalore, India from 23rd to 26th July.

Our engineers will be present at the conference. If you are interested in our work, want to know more about what we are doing, want to work with us (we’re hiring), get some cool goodies or just want to say Hi!, please visit our booth (B7) or catch any of our team members. We’d love to talk to you!

We look forward to meeting you in Bangalore!


In November last year, I started developing an infrastructure that would allow us to collect, store, search and retrieve high volume data. The idea was to collect all the URLs on which our homegrown CDN would serve JS content. Based on our current traffic, we were looking to collect some 10k URLs per second across four major geographic regions where we run our servers.

In the beginning we tried MySQL, Redis, Riak, CouchDB, MongoDB, ElasticSearch but nothing worked out for us with that kind of high speed writes. We also wanted our system to respond very quickly, under 40ms between internal servers on private network. This post talks about how we were able to make such a system using C++11, RocksDB and Thrift.

First, let me start by sharing the use cases of such a system in VWO; the following screenshot shows a feature where users can enter a URL to check if VWO Smart Code was installed on it.


VWO Smart Code checker

The following screenshot shows another feature where users can see a list of URLs matching a complex wildcard pattern, regex pattern, string rule etc. while creating a campaign.


VWO URL Matching Helper

I reviewed several opensource databases but none of them would fit our requirements except Cassandra. In clustered deployment, reads from Cassandra were too slow and slower when data size would grew. After understanding how Cassandra worked under the hood such as its log structured storage like LevelDB I started playing with opensource embeddable databases that would use similar approach such as LevelDB and Kyoto Cabinet. At the time, I found an embedabble persistent key-value store library built on LevelDB called RocksDB. It was opensourced by Facebook and had a fairly active developer community so I started playing with it. I read the project wiki, wrote some working code and joined their Facebook group to ask questions around prefix lookup. The community was helpful, especially Igor and Siying who gave me enough hints around prefix lookup, using custom extractors and bloom filters which helped me write something that actually worked in our production environment for the first time. Explaining the technology and jargons is out of scope of this post but I would like to encourage the readers to read about LevelDB and to read the RocksDB wiki.


RocksDB FB Group

For capturing the URLs with peak velocity up to 10k serves/s, I reused our distributed queue based infrastructure. For storage, search and retrieval of URLs I wrote a custom datastore service using C++, RocksDB and Thrift called HarvestDB. Thrift provided the RPC mechanism for implementing this system as a distributed service accessible by various backend sub-systems. The backend sub-systems use client libraries generated by Thrift compiler for communicating with the HarvestDB server.

The HarvestDB service implements five remote procedures - ping, get, put, search and purge. The following Thrift IDL describes this service.

namespace cpp harvestdb
namespace go harvestdb
namespace py harvestdb
namespace php HarvestDB

struct Url {
    1: required i64    timestamp;
    2: required string url;
    3: required string version;
}

typedef list<Url> UrlList

struct UrlResult {
    1: required i32          prefix;
    2: required i32          found;
    3: required i32          total;
    4: required list<string> urls;
}

service HarvestDB {
    bool ping(),
    Url get(1:i32 prefix, 2:string url),
    bool put(1:i32 prefix, 2:Url url),
    UrlResult search(1:i32 prefix,
                     2:string includeRegex,
                     3:string excludeRegex,
                     4:i32 size,
                     5:i32 timeout),
    bool purge(1:i32 prefix, 2:i64 timestamp)
}

Clients use ping to check HarvestDB server connectivity before executing other procedures. RabbitMQ consumers consume collected URLs and put them to HarvestDB. The PHP based application backend uses custom Thrift based client library to get (read) and to search URLs. A Python program runs as a periodic cron job and uses purge procedure to purge old entries based on timestamp which makes sure we don’t exhaust our storage resources. The system is in production for more than five months now and is capable of handling (benchmarked) workload of up to 24k writes/second while consuming less than 500MB RAM. Our future work will be on replication, sharding and fault tolerance of this service. The following diagram illustrates this architecture.


Overall architecture

Discussion on Hacker News