We have been using Elasticsearch for storing analytics data.
This data stored in Elasticsearch is used in the Post Report Segmentation
feature in VWO. So the amount of data getting stored in Elasticsearch is tied
up to the number of campaigns currently being run by our customers. And often
we need to have custom tooling to work with this data and the requirements of
such tooling are also not common. This blog post is about how we solved some
issues by building some missing blocks in the
Official Elasticsearch Python client while working on this project.
The code base where implementation of this feature (Post Report Segmentation)
lies is all written in Python. When we were starting out, we had to decide
which client to use because there were many out there. Eliminating some was
really easy because they were tied to certain frameworks like Tornado
and Twisted. And we were not sure which path to take initially so we
decided to keep things simple, avoid early optimization and not use any of
these framework heavily dependent on Non-Blocking IO. If we needed any of that
later, Gevent could be put to use (in fact that’s exactly what we did). Even
for the simpler way there were quite a few options. The deciding factors for
Maintenance commitment from the author
Considering all these factors, we decided to go with the Official Python
Client for Elasticsearch. And we didn’t really come across any issues and
problems according to our simple requirements. It is fairly extensible and
comes with some standard batteries included with it. For everything else, you
can extend it - thanks to its simple design.
It worked well for a while until we had to add some internal tooling where we
needed to work a lot with Elasticsearch’s Scroll API and
Elasticsearch’s Bulk API lets you club together multiple individual API calls
into one. This is used a lot in speeding up indexing and can be very useful if
you are doing a lot of write operations in Elasticsearch.
The way you work with Bulk APIs is that you
construct a different kind of request body for bulk requests and use the client
for sending that request data. The HTTP API that Elasticsearch exposes for bulk
operations is semantically different than the API for individual operations.
Consider this. If you were to index a new document, update an existing document
and delete another existing document in Elasticsearch, you can do it like so:
If you were to achieve the same thing using Bulk APIs, you would end up writing
code like this:
There is a ton of difference in how bulk operations work on the code and API
level as compared to individual operations.
The request body is considerably different in Bulk APIs as compared to their
The responsibility of properly serializing request body is now shifted to
the developer whereas this can be handled at the client level.
Serialization format itself is a mixup of JSON and new-line character
If you are depending a lot on bulk operations, these problems will bite you when
you start using it at a lot of places in your code. The flexibility of
manipulating bulk request bodies at will lacks with the current support for Bulk
The official client as well does not really take care of this issue - not
blaming because the author’s objective is to be as unopinionated as possible
and this also gave us the chance to do it our way instead of adopt an existing
implementation. We wanted to use Bulk API the same way we would use individual
APIs. And why shouldn’t it be the same! They are essentially individual
operations put together and executed on a different end-point.
Our solution for this was to provide a BulkClient which would allow you to
start a bulk operation, execute bulk operations in a way that you would execute
individual operations and then when you want to execute them together, it will
make the required request body and use the Elasticsearch client to make the
request. Exposing bulk operations in a way that semantically look the same as
individual operations required us to implement APIs similar to individual APIs
on a very high level in the BulkClient.
This is how the BulkClient works:
The next problem we faced was with Scroll API.
According to the documentation:
While a search request returns a single “page” of results, the scroll API can
be used to retrieve large numbers of results (or even all results) from a
single search request, in much the same way as you would use a cursor on a
Scroll API is helpful if you want to work with a large number of documents -
more like get them out of Elasticsearch.
The problem with Scroll API is that it requires you to do a lot of book
keeping. You have to keep scroll_id after every iteration to get the next set
of documents. Depending upon your application, there is probably no work
around. However, our use-case was to get a large number of documents all
together. You can do that without Scroll API as well i.e. by using the size
parameter where you can tell Elasticsearch how many documents to return and you
can ask it to return all documents by using the Count Search API first and then
passing the size, but that will usually time out (or at least it did for us).
So what we did was scroll Elasticsearch in a loop and do the book keeping in
the code. And that was simple as well until we had to do it at multiple places
- there was no uniform way to do that and a lot of code repetition was done as
Our solution to this problem was to create a separate wrapper API only for this
purpose and use that everywhere in our project. So we wrote a simple function
that would do the book-keeping for us and it could be used like so:
Iterator based Scrolling in elasticsearch-py
We must highlight that the official client also added support for iterator based
scrolling later in the official client as a helper. We had already started
using our solution in our project and we find ours is slightly different than
theirs. For more details, read the docs here.
SuperElasticsearch - elasticsearch-py with goodies!
Our solution to both the problems described earlier were based on the official
Elasticsearch client. After having solved these two problems, we figured that
instead of passing around the client object to our new API, it will be nicer if
we can use the new APIs in a way that it feels a part of the client itself. So
we went ahead and sub-classed the existing client class Elasticsearch to make
it easier to use the new APIs. You can use the sub-classed client
SuperElasticsearch like so:
This has also made it easy for us to do releases of SuperElasticsearch.
SuperElasticsearch does not depend on the official client in ways that it will
break compatibility with new releases of the official client, or if it will
then we can make the adjustments and come up with a new release. Basically it
has been written in a way to work with new versions of the official client with
minimum friction. If a new release of the official client comes out, then you
should be able to upgrade to the new Elasticsearch client without upgrading
SuperElasticsearch. This way we can try to keep developing SuperElasticsearch
at its own pace and release only when we have new features to release or when
it breaks compatibility. It also makes it easier for you to use the new APIs
because you get all of them with the client object itself.
After hosting the Meta Refresh Delhi Runup Event, it was time for us at Wingify to prep up for MetaRefresh. We were very excited to contribute back to the community by not just sponsoring MetaRefresh, but also by adding content to the conference through a talk and a workshop, both focused towards Web Performance.
We started off our journey from Delhi to Bangalore on 15th May, a day before the conference, and were welcomed by awesome weather at Bangalore. We took off early the next morning and grabbed our bags to march towards MLR Convention Centre, Bangalore to setup our company booth. The setup didn’t take much time, and we were ready to welcome fellow attendees to share more about Wingify through our stall.
While speaking to the attendees, many expressed their interest to get interviewed at Wingify. Usually, we redirect the interested candidates to mail their resume to firstname.lastname@example.org and follow the procedure, though this time, we gave it a unique touch, while using a Hack developed by Paras (our Founder & CEO), on a hack night. It was a mystery containing different hints that lead to the next clue, solved using browser’s developer console. It was great fun to watch attendees trying their best to crack the hints and unravel the mystery, though only few were able to solve it.
Some moments captured during Meta Refresh 2015:
A generic issue discussed in majority of the talks was regarding the maintenance of mobile web version of businesses after the successful creation of native apps on most popular mobile platforms. Several supporting /contradictory arguments were made with regard to this topic, though the most logical were in favor of supporting mobile web version as well. Several speakers shared their experience of the efforts involved in maintaining the web version or making the web experience as great as the one delivered through the native apps.
Performance was another major topic discussed in several talks, involving not just the networking performance of web applications, but also the rendering performance as well. Another big discussion revolved around achieving jank free performance while performing animations in not just the web applications, but games as well. Several techniques and approaches were discussed in the talks that shared the experience of speakers on the quest to achieve 60fps in web applications.
We had a great time being part of MetaRefresh 2015, and look forward to more such events, so stay tuned with our different social media channels (Twitter, Facebook) to meet us at another conference.
Giving back to the community has always been a priority at Wingify, be it through open sourcing internal projects or via organizing / sponsoring community events, the most recent being Meta Refresh Delhi Run-up Event organized and hosted by Wingify on 21st March 2015. Tony Simon from HasGeek was present from the MetaRefresh Team to help us host this event and help us make it more awesome.
Siddharth Deswal speaking on “How to Communicate Better with Marketing, Sales and Other 'Business' Types”
The event started on time (10:30am) with Tony introducing MetaRefresh, HasGeek and Wingify to the attendees. Siddharth Deswal, Marketing Guru at Wingify kickstarted the event with his talk on “How to Communicate Better with Marketing, Sales and Other ‘Business’ Types”, along with shots of humour. The talk started with the narration of his own experience of wearing different hats at Wingify with him helping different departments. He concluded on a great note saying that different departments shouldn’t be isolated and must focus on sharing and imparting knowledge to people from other departments, especially the ones who are interdependent; the best example being that marketing team should also try to understand the technical aspects related to feature development.
Apoorv Saxena describing browser evolution in his talk on “Hacking to be Performant?”
A pure technical talk related to web performance, started off with a poll to find out how many of the participants measure performance regularly and have it part of their deployment process, the feedback from the attendees depicted negligible measures taken to continuously monitor product performance. Next was the discussion of the reasons on why performance mattered, which was followed up with the discussion of various hacks that people employed to bring performance to their applications. The core part of this talk discussed the difference between using hacks versus following different approach during development, and how each of them paid in the long run.
Vipul Taneja speaking on “Landing Pages Optimization”
Next talk was presented by Vipul Taneja from AdSparkx media on “Landing Pages Optimization - Things you can do to ‘Test’”, with him briefing the attendees about his visit to Vegas and his observance about it during that time. His talk comprised of various techniques that his company uses to maximize ROI on different landing pages of businesses that hire them. The talk comprised of the discussion of both White Hat and Black Hat techniques as well for increasing landing page conversions.
Taruna Manchanda speaking on “How to optimize your webpages - lessons learnt from 101 VWO customers' A/B tests”
Next speaker was Taruna Manchanda, who shared her experiences and learnings while taking care of all paid acquisitions and customer case studies, as part of the Digital Marketing Team at Wingify. The attendees gathered great insights about how to best A/B Test a webpage along with the focus on what needs to be measured and how.
It was a great experience hosting this event. Thanks to HasGeek for helping us with organizing the event. We hope that the conference will continue to happen in the years to come.
If you were present at the run-up event and met us there, stay tuned with our different social media channels(Twitter, Facebook) to again be a part of another event going to be hosted by us. If you have any suggestions to make your experience better, go ahead and leave comments and we will get back to you. If you like what we do at Wingify and want to join the force, we will be more than happy to work with you. As always, we are looking for talented people to work with us!
We are proud to announce q-directives, a brand new and fast directive system for Angular.js, that takes the watcher optimization to a whole new level. It was a result of several jsperf tests and Chrome Timeline runs.
VWO is single-page application made entirely in Angular.js. When designing a detailed reporting system for campaigns in Angular.js, we faced troubles with rendering large amounts of data using Angular directives. In one of the report pages, the application had registered 15,000+ watchers, especially due to the way ng-repeat works.
With q-directives and a revamped directive system, the number of watchers for a q-repeat directive (replacement for the ng-repeat directive) was brought down to just 1. So whenever the list changes, only one watcher gets fired.
Below stats are a rendition of the Chrome (version 37) timeline for the following use case:
A table containing 216 rows repeated by q-repeat. Each row has about 10 columns containing about 50+ Angular directives each (Original). The optimized version has those Angular directives replaced with q-directives, and ng-repeat is replaced by q-repeat.
Data is collected over 5 samples for both Original and Optimized situations.
Initial table render
Optimized (+ disabling ngAnimate)
Sorting the table
Head over to this link for a usage documentation and API reference.
Elasticsearch is essentially a distributed search-engine but there have
been more than one example of companies and projects using Elasticsearch for
analytics instead of search. We, at Wingify, had similar requirements when we
decided to make our analytics more powerful to empower the customers of our
product, Visual Website Optimizer (VWO). This blog post is about how we
used Elasticsearch to make VWO’s user tracking a lot more powerful than it
For context, VWO is a tool that makes A/B testing of websites and mobile apps
so simple so that there is no engineering intervention involved to run new
A/B testing campaigns. Marketers and UI/UX designers do A/B testing to
improve online conversions and sales. VWO helps them with performing these A/B
tests with almost no engineering knowledge.
Since VWO is at the center of optimizing websites and mobile apps, this makes
user tracking important for our product - our users make use of the data we
collect to understand how their users (different segments of users) behave and
make optimization decisions accordingly. For example, in an A/B test campaign
with three variations, variation 2 might be winning for all the goals but for all
the users coming from India, variation 3 might be winning for all or some of the
goals. It should be possible for our customers to generate custom segmented
reports and observe these different behaviours.
So lets summarize how a campaign and its reporting should work:
A VWO customer may create multiple campaigns. These campaigns have more
than one variations (variations are variants of web pages or iOS apps with UI
changes) that our customer wants to A/B test against real-traffic.
Every campaign has more than one goals (goals are events that you want to track,
such as visiting a particular page, clicking a DOM element, submitting a
us to track.
variation and sends this data to our data collection end-points.
Our data backend stores every visit and conversion for all the defined goals per
variation. This is stored on a day-wise basis.
When the campaign’s report is accessed, the day-wise visitor and goal conversion data
is used in the statistics that go behind generating the report.
Reports are generated considering behaviour of all the users who became a part
of the campaign. However, our customers should have the flexibility to segment
reports on the basis of parameters like location, browser, operating system,
time range, query parameters, traffic type, etc.
In the prehistoric times
We used to store only counters in our database (we use MySQL) i.e. for goal per
variation, we used to store number of visitors and conversions. Here is some
So when our customers want to view the report, our application’s backend
will run some queries to generate aggregated metrics like total visitors per
goal per variation, total conversions per goal per variation, etc. which could
be taken care of using MySQL’s built-in functions and then do some
statistics at the application level to decide winning variations per goal.
Notice that in our first table where we store hits (visitors) and conversions,
we store total counters of these two metrics per goal per variation per day. In
the revenue table, we store every individual revenue per goal per variation
with the exact date they occurred on. We need these separately as we need to
calculate sum of squares of every revenue generated which is used in the
statistics. I am not going to delve in the statistics side of things because
that is out of scope of this article.
This worked pretty well for us for a while. It was all very simple and we had
to deal with aggregated data most of the times other than the case of revenue
where in we had to get every row of revenue for a particular campaign. At the
application level, it was essentially firing up a few MySQL queries that would
give us the aggregated and day-wise data and then use that data to statistically
find winning variations per goal.
But this setup had a major drawback. Our customers were restricted to the view
of reports we would expose them to. It was not possible to drill down and
understand how different segments of users are behaving as the complete picture
may not say it all about some different segments. For example, in an A/B test
campaign with three variations, variation 2 might be winning for all the goals
but for all the users coming from India, variation 3 might be winning for all
or some of the goals. Finding this out was only possible by running another
campaign targeted to users from India on the basis of a hunch to understand if
the results would differ. And many times the results would not differ and our
customers will lose visitors from their visitor quota.
Furthermore, our data storage had a few other problems like no fine grain
control over date and time range (it was all day-wise), we would store all
the counters according to our customers’ timezone (set at the time of account
creation) which means that changing timezone later would be possible but the
data collected earlier would be shown according to the previously selected
timezone. These were some major drawbacks to our way of storing visitor and
New Age Reporting
We knew that our existing MySQL based setup was not perfect but more
importantly we realized that it does not help our customers. We wanted to make
things simpler for our customers so that:
they could easily find important segments of users that behave
differently and run targeted campaigns for them if necessary.
they have finer control over date and time so that they can see reports
at different steps like months, days, hours, minutes, etc.
store everything in UTC so that we can take care of timezone changes at
Looking at our application requirements, we realized that we cannot work with
just aggregated data any more. We needed to start storing individual visitor’s
data and their corresponding conversions to achieve flexibility and giving the
power of slicing and dicing of the data in our customers’ hands.
We are also a pretty small team, which means that we wanted lesser headaches
about ops and maintaining the entire system in production. We wanted things to
be simple and as self-managed as possible.
Our specific requirements were:
Allow storage of individual visitor data with a lot of properties for
Allow filtering on all the stored fields for performing segmentation.
Allow full text search on a few fields.
Capable of storing events for lifetime of a customer account. This means that
we cannot delete visitor data as long as our customer is with us.
Getting consumable data out should be fast, or lets say not terribly slow. We
are okay with an average of 2-3 seconds to start with.
Fault tolerant system. Failing nodes should not bring the service down.
Scalable to handle our growing traffic, storage and other requirements.
We knew that Hadoop is the de-facto system in the Big Data universe but the
entire Hadoop system is so vast that getting started with it is not as easy.
There tons of different tools in the Hadoop ecosystem and just selecting the
right tools for your use-case may take a significant amount of time for
research, leaving the implementation time aside. Also, running a Hadoop cluster
is no piece of cake. There are so many moving parts that you are not completely
aware of as soon as you start. And performing upgrades of systems that have more
systems running with it will always be problematic. Further, tuning all these
systems to give an acceptable performance also seemed like a daunting task for a
team as small as our’s with no prior experience with such systems.
On top of the above problems that we got to know about Hadoop from our friends
working with it and from different blogs/websites, the task of implementing
the infrastructure requirements for Hadoop, building an implementation,
managing in production and then repeating the cycle for a team of 2 engineers
seemed like a daunting task.
We knew that life would be much easy if we keep things simple and we started
looking at other options.
Elasticsearch to the rescue
Having worked with Elasticsearch before for a smaller project and remembering
that I had watched Shay’s talk from Berlin Buzzwords where he mentioned
that Elasticsearch was also being used for analytics, we started looking at
Elasticsearch to solve our problems.
Elasticsearch supports filtering which we could use to filter visitors and
their conversions on the basis of a lot of properties that we wanted to
collect for every visitor. Filtering would be fast in Elasticsearch because you
can have indexes on every field if you want and since Elasticsearch
uses Lucene under-the-hood, we were confident about its indexing
capabilities. Elasticsearch supports full text search out-of-the-box.
This fits well with our basic application requirements. On top of this,
Elasticsearch supported Faceting (when we were evaluating, aggregations
frameworks was not there) which we could exploit for analytics. That means we
don’t even have to get all the data out of Elasticsearch to our application
layer. Elasticsearch is capable of giving us an aggregated view of the data we
This was just amazing for us. We were able to build a PoC within two weeks. The
next couple of months were spent on understanding Elasticsearch better,
optimizing our implementation, testing Elasticsearch against production load
and tuning it for the same.
In the meantime, Elasticsearch released 1.0.0 with aggregation framework and we
quickly moved from using Facets (see Faceted Search) to Aggregations.
Aggregations proved to be very useful with revenue goals as we could just
ask Elasticsearch to give us sum of squares of individual revenues without
getting individual revenues out of Elasticsearch.
As pointed out earlier, we need to track individual users. How we do this is
we create a document for a unique visitor per account per campaign in
Elasticsearch. This document stores user meta data, data for segmentation and
goal conversion tracking data. A typical visitor document looks like this:
_id is the UUID of the visitor. Most of the other fields have
information extracted out from the IP address, the User Agent, the URL and the
All the fields except a few are some fields with their types correctly set.
Indexes are maintained on all of them so that visitor documents can be filtered
according to the values in these fields.
But there are a few fields that are interesting:
Let’s look at each of them one-by-one.
query_params is an array of objects for storing query parameters and their
respective values. This is of type nested because our customers may want to
find all visitors and their conversions who visited pages with certain query
parameters. Consider a scenario where you want to find all visitor documents
with query parameter param1 and val2. A simple bool must query with
term query would return the above document if query_params was not nested
because it would find one of the two query_params.param values to be equal
to param1 and the one of the two query_params.val values to be equal to
val2 but we know that param1 never had val2 as its value. This happens
because each object in query_params array is not considered as an individual
component of the document. nested types solve this problem. Read more about
nested documents and relations in Elasticsearch in this blog post.
converted_goals_info is also an array of objects for storing information
of individual goal conversions. Here we store goal_id of the converted goal,
the time of conversion as a DateTime field and another field that we will
shortly discuss. This field is also of nested type for the same reason as
converted_goals_info.facet_term and variation_goals_facet_term need to
be discussed together because their values are constructed in a similar way.
They in particular don’t hold any new information. In the beginning of the
post, we saw how we used to store aggregated visitor and conversion count per
goal per variation per day. We still need that data out of Elasticsearch in a
similar way for our statistics. The day-wise problem gets solved by using
day-wise buckets in aggregations framework. The next problem is getting visitor
counts per variation per goal. In MySQL terms, we would want to run a GROUP BY
query on variation and goal_id column. In Elasticsearch, we can do
something similar by using Terms Aggregation using Scripts. The problem
with this approach is that if you have a large number of documents, your
script will get evaluated on all of them and Elasticsearch is not really a
script execution engine (no matter which scripting plugin you use). What you
can do instead is push the result of a script at the time of indexing and then
simply run Terms Aggregation on it. We saw massive performance boost by doing
this performance hack.
Every document gets saved under the doc_type for the account that campaign
belongs to i.e. every account on VWO has a separate doc type.
From performance point-of-view, Elasticsearch has very fast indexing and
querying capabilities. It is a distributed system - you can deploy a cluster
of nodes in production which stores indexes in a distributed fault-tolerant
way to give you performance benefits. Increase the number of replicas per shard
and you can scale reads and queries. This can be done after creating an index as
well. Elasticsearch does not allow changing of number of shards though. But
there is a sweet work around for that. Just create a new index with more shards
and use aliases, and you can now scale indexing as well.
From our experience with working on large data sets which need to be queried on
an ad-hoc basis and have low latency requirements and from our learning from
Shay’s talks (1, 2, 3), we understood that a data storage
system meant to store a lot
of data will scale for your reads and querying requirements well if you can
shard your data well according to the variable that determines the growth of
that data. For example, if you are using any database for storing machine logs,
you should be able to shard your data probably according to time because you
would want to query the most recent data and if you have to do it from the all
the data you ever collected, then your old data will only become a performance
bottleneck. So a possible sharding strategy could be sharding data according to
Our requirement was similar. We get visitor data which we could easily shard
on monthly basis. And since this data would keep on growing, we can just add new
indexes every month and place the new data in these indexes. However, which
index a visitor document goes to is not determined by the timestamp of the
visitor but it is determined by the date of creation of the campaign. Why? Our
customers view campaign reports i.e. when a campaign report is opened, we want
to get data for that campaign only. So it would make sense to have all the data
for a campaign reside only in one index because we wouldn’t want to look into
multiple indexes for generating report of one campaign. If we decided to put
visitor documents in different indexes depending upon time of visit, we would
have faced the following problems:
A campaign may run for more than a month, so visitor documents for a campaign
may be in more than one indexes and we would not have any way to know which
all indexes without keeping a track of it separately as to which indexes have
visitors for a given campaign. This would be painful.
Since visitors also convert goals and we store conversion data in visitor
documents, it would be very difficult for us to find which index to find the
visitor document in so that we can add conversion tracking related data in the
These problems get solved when we restrict all visitor data for a given campaign
to go in one index only. So for account_id 123 that has two campaigns -
campaign 1 (created in January 2015) and campaign 2 (created in February 2015),
the visitor documents for both will be created in the indexes for January 2015
and February 2015 respectively.
Another big advantage of this is that we can adjust the number of shards every
month. So if we are seeing a trend of more visitors getting tracked month after
month, in the next month we can create a new index with more shards than the
previous month’s index.
Since documents are stored in a particular shard in an index, Elasticsearch
needs to decide which shard to put the document in. Elasticsearch use a hashing
algorithm that is used for shard selection and Elasticsearch uses document’s ID
by default for determining which shard that document goes into. This is called
routing a document into a shard. This may work fine in some cases. But the
drawback of this default routing strategy is felt when you have a large number
of shards and also when you have to serve a lot of queries. The drawback is that
Elasticsearch now needs to search every shard in an index for all the documents
matching a given query, wait for the results, aggregate them and then return the
final result. So for a given query, all shards get busy.
This can be controlled by using a better routing strategy. In our case, we
generate reports of a campaign of a given account. It would be ideal that one
account does not limit report generation of another account. So instead of going
with the default routing strategy, we decided to route documents on the basis of
account_id. So now, when a campaign report is generated for a given account,
the query hits only a single shard, leaving all other shards available for
serving other queries and also freeing up CPU resources. After moving to this
routing strategy, we saw a significant reduction in CPU usage in our cluster.
From operations and management point-of-view, Elasticsearch is fault tolerant -
indexes can be sharded and replicated and distributed in a cluster.
Elasticsearch distributes shards and their replicas on different nodes in the
cluster so that if a node fails, Elasticsearch promotes replicas to be the
primary shards and moves shards and replicas in the cluster to balance the
cluster. What is really amazing is that Elasticsearch also gives control over
placement of shards in a cluster so that it is easy for you to separate hot
data from cold (historic) data easily. We have not had the need to use this
feature yet, but it is good to know that we can do this if at all historic data
becomes a performance problem. Chances are that it will become a problem but
probably much later.
Although Elasticsearch made it really easy for us to push out something like
this with so much ease (and remember we had no experience building something
like this before) and we love Elasticsearch for that, we did find a few things
with it that we think limits us.
The facet term hack for avoiding running scripts works great but then it’s also
limiting if you want to add new features in your application that rely on
different scripts that were not added at the time of indexing. This means that
you will have to re-index all your data if you want to support this new
feature or just provide this feature on new data.
Lack of JOINS becomes limiting. As of now we push the conversion data in
visitor document. But it would have been ideal if we could independently index
conversions data in a separate index or doc type.
We don’t know how to solve these problems yet or if Elasticsearch team has any
plans for bringing something new that fixes these problems. It will open
Elasticsearch to a lot more possibilities if JOINS were possible. But we also
understand that it’s not a simple problem to solve and Lucene and Elasticsearch
were not made keeping these use-cases in mind. Nevertheless, we hope to see
these improving in the future, especially because a lot of companies are using
Elasticsearch for analytics as well.
Elasticsearch has been great for us and it proves that you don’t always need
Hadoop for building analytics depending upon your requirements. The amazing
thing is that we feel Elasticsearch is amazing when it comes to scaling when
limited by resources - horizontal scaling is extremely simple. But it will work
for you or not depends entirely on your requirements.
Elasticsearch already works with Hadoop, which is being further developed to
expand the use-cases it can support. This gives us a lot of confidence as we
will add more features to VWO’s user tracking in the future and we know that we
will not be limited by our decision to use Elasticsearch.