At Wingify, we recently began an initiative by the name Engineering Talkies where our engineering teams share their experiences, repertoire of best practices, and learnings. Think of it as a knowledge sharing session between people within and across teams.

The last talk was about Web Application Security where I spoke about the best practices to follow for tightening the security of our web applications.

It started with a little introduction to my infosec journey and how companies deal with security researchers. Along the way I also touched upon why startups, specifically, should care about security.

Information Security is not just about following some best practices checklist, it’s all about lateral thinking.

We explored OWASP’s Top 10 Vulnerabilities considering possible attacks and techniques to shore up our defense against them. Next we went through some real world examples of common security threats and how the risks can be mitigated and flaws addressed.

In my preparation for the talk, I had conducted an internal security audit of our product application and the security measures we have put in place for VWO that should be followed for other products under the Wingify umbrella. Yes, we are coming up with some cool new products other than VWO, the beloved A/B Testing Tool, Stay tuned.

Slides from the talk on security at Wingify Engineering Talkies:

It’s very useful for companies to perform internal security audits and today we understand a little better about why we sometimes need to slow down feature development and clean up a bit.

We take security very seriously at Wingify. If you find any security vulnerability, please report it to [email protected]. We will respond as quickly as we can to any security issues identified.

Coding is always fun at Wingify, be it a Wingify Camp or a Fun Friday. And to add to the fun, in a Fun Friday Code In the Dark was organized by Wingify on 15th April 2016 in both Wingify’s Delhi and Pune office. The event is a front-end (HTML, CSS) competition created by the folks at Tictail, where each contestant compete to implement a website design given only a screenshot. The catch is, no previews of the results are allowed during the implementation, and no measuring tools can be used. There are some simple rules of the event:

  • Duration: 15 min
  • Technology: HTML/CSS
  • No previews
  • One champion

The event started at 5 o’clock in the evening and was divided into two slots, each slot having 9 participants. The rule was to choose two from each slot so that final four compete in the final round. Along with that, to add to the fun and frolic, a loud music was played to motivate the contestants to write the code as fast as they can. The lights were turned off to create an awesome atmosphere to complement the theme of the event. The only light flashing was that of TV and Laptop screens which added a magical appearance to the room.

Ankit Jain and Kushagra Gour starting the event
Contestants coding in the dark

After the completion of the allotted time, two participants from each team were selected based on the audience voting.

A final round was organized between selected “Fantastic Four” participants from the 2 slots which were Sparsh Gupta, Hemkaran Raghav, Ashish Bardhan, Dheeraj Joshi.

Meanwhile, some people were busy in eating delicious sandwiches and having some beer to quench their thirst.

Meanwhile in Wingify’s Pune office… /

Next was to decide the winner, and after voting by everyone finally we had the winners.

  1. Sparsh Gupta (Our CTO) - Winner
  2. Hemkaran Raghav - Runner Up

Also, The winners from Pune office were:

  1. Paras Chopra (Our CEO) - Winner
  2. Rachit Gulati - Runner Up

Yeah, our CEO & CTO are code-in-dark experts :) Checkout the awesome prizes that winners got:

Sparsh Gupta and Hemkaran Raghav getting his prize

It was a great experience to participate in the event. Thanks to Kushagra Gour for organizing the awesome blossom event. We hope that this type of events will continue to happen in the time to come.

You can watch a glimpse of the event here:

If you want to know more about the events and happenings at Wingify, follow us on (Twitter, or Facebook). If you have any suggestions for different type of events that we can organize, go ahead and leave comments and we will get back to you. If you like what we do at Wingify and want to join the force, we will be more than happy to work with you. As always, we are looking for talented people to work with us!

Few weeks ago, we did a redesign of our product - VWO. It wasn’t a complete overhaul from scratch, but some major design decisions were taken in the existing design based on the feedback we have received from users since we launched v3.0. This post is about a cool trick we used to achieve a task in that redesign project.

The task (or issue)

One of the most principle decisions we made was regarding the main layout of the app. It wasn’t about changes in placement of content, but actually about the UI semantics. It mostly translated to color changes to bring a sense of how any screen in the app is structured and how all components on the page relate to each other. Here is a comparison of the before & after designs:

Old Design

New Design

Note in the new design how different sections on the screen are more distinguished with definite boundaries and background in contrast to old design where all the page content was on a single grey surface. The old design reflected in the architecture as well - every main module got the complete page structure (except the main top header and left navigation) along with it. Eg. A Campaign module (page) in the above screenshot comprises markup of the page title section, tab menu, main content and sidebar. What I am trying to put forth is that a transition between modules causes the complete module content (mentioned sections) to disappear and appear again. This was fine with old design as we need to keep the base layout (the single grey surface) intact and custom content can transition over it. But the new design bought an issue with this approach. The base layout was no more just a single grey surface, rather it got split into 4 separate distinguishable sections:

  1. white page title section
  2. grey tab menu
  3. white main content section
  4. grey sidebar

And all these section’s markup being part of every main module’s markup, would fade in/out during page transitions which was unacceptable as the common page layout (white grey sections) would itself keep getting fade in/out along with the custom content inside them - bad experience!

The “Trick”

The most trivial approach to retaining the page layout sections during transitions would have been to create those sections in the main markup instead of every module bringing its own 4 sections. And every module change would have simply substituted appropriate custom content inside those constant 4 sections on the page. But this would have meant a major change in the module architecture increasing the scope of the redesign project. Heres how we tackled this issue…

We used the above mentioned solution but instead of dividing the content into 4 sections at root level, we created an illusion of having 4 sections always present on the page - using pseudo element & background gradients! Heres how:

So basically the pseudo structure always stays on the screen with all the custom content going and coming over it and giving an illusion that custom content renders inside those sections - just what we wanted for the end user!

Final result and code

In the End

This trick (or hack as one may call) helped us achieve the desired UX without actually modifying the base module architecture and it has been working really great so far without any compromises made. Hacks are not always bad after all…its just about evaluating what is best when.

In the GOF book, the interpreter pattern is probably one of the most poorly described patterns. The interpreter pattern basically consists of building a specialty programming language out of objects in your language, and then interpreting it on the fly. Greenspun’s Tenth Rule describes it as follows:

Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.

In essence, the interpreter pattern consists of dynamically generating and transmitting code at run time instead of statically generating it at compile time.

However, I believe that modern functional programming provides us some alternatives that provide functionality approaching that of embedding a lisp interpreter in our runtime, but also with some measures of type safety. I’m going to describe Free Objects, and how they can be used as a substitute for an interpreter.

At Wingify, we have several important interpreters floating around in our (currently very experimental) event driven notification system. In this post I’ll show how Free Boolean Algebras can drastically simplify the process of defining custom events.

Algebraic structures

In functional programming, there are a lot of algebraic structures that are used to write programs in a type-safe manner. Monoids are one of the simplest examples - a monoid is a type T together with an operation |+| and an element zero[T] with the following properties:

a |+| (b |+| c) === (a |+| b) |+| c   //associativity
a |+| zero === a                      //zero

A type is a monoid if it contains elements which can be added together in an associative way, together with a zero element. A number of common structures form monoids - integers (with a |+| b = a + b ) are a simple example. For instance:

3 + (5 + 7) === (3 + 5) + 7 === 15
-17 + 0 === -17

But many other data structures also obey this law. Lists and strings, using |+| for concatenation and either [] or "" as the zero element, are also monoids. Monoids are commonly used as data structures to represent logs, for example.

Another algebraic structure is the Boolean Algebra. This is a type T with three operations - &, | and ~, with a rather larger set of properties:

a & (b & c) === (a & b) & c
a | (b | c) === (a | b) | c
~~a === a
a & (b | c) === (a & b) | (a & c)
a | (b & c) === (a | b) & (a | c)
a & a === a
a | a === a

A boolean algebra also has both a zero and one element, satisfying zero & _ === zero, zero | x === x, one & x === x and one | _ === one. There are many common boolean algebras as well - Boolean of course, but also fixed-length bitmaps (with operations interpreted bitwise), functions of type T => Boolean (here f & g = (x => f(x) && g(x)), etc).

There are quite a few more algebraic structures - monads provide another example. But I’m going to leave the trickier ones for another post.

Side note: Boolean Algebras are Monoids Too

One of the interesting facts about abstract algebra is that many of these structures interact with each other in interesting ways. For example, any boolean algebra also has two monoids built into it. The operations & and one satisfy the laws of a monoid:

a & (b & c) === (a & b) & c
a & one === a

Similarly, the operations | and zero also satisfy the monoid laws:

a | (b | c) === (a | b) | c
a | zero === a

Event predicates - take one

In our experimental (i.e., you can’t use it yet) event based targeting system, I wanted to create an easy way for users to trigger events. I.e., I want to be able to define a formula and evaluate whether it is true or false for some event. E.g.:

((EventType == 'pageview') & (url == ''))
   | ((EventType == 'custom') & (custom_event_name == 'pricing_popup_displayed'))

This can be represented in Scala pretty straightforwardly:

sealed trait EventPredicate {
  def matches(evt: Event): Boolean
  def &(other: EventPredicate) = And(this, other)
  def |(other: EventPRedicate) = Or(this, other)
case class EventType(kind: String) extends EventPredicate {
  def matches(evt: Event) = EventLenses.eventType.get(evt) == kind

We also need boolean operators:

case class And(a: EventPredicate, b: EventPredicate) extends EventPredicate {
  def matches(evt: Event) = a.matches(evt) && b.matches(evt)


Unfortunately, we have more than one type of predicate. We had quite a few requirements, in fact:

  1. We want to compile some predicates to Javascript so they can be evaluated browser side.
  2. We want to define compound predicates for the convenience of the user. E.g. GACampaign(utm_source, utm_campaign, ...) instead of URLParam("utm_source", "email") & URLParam("utm_campaign", "ilovepuppies") & ..., but we’d also like to avoid re-implementing in multiple places things like parsing URL params.
  3. We actually have multiple types of predicate - EventPredicate, UserPredicate, PagePredicate and we’d like to avoid duplicating code to handle simple boolean algebra stuff. We’d also like to avoid namespace collisions, so we’d need to do AndEvent, AndUser, etc.
  4. We also need to serialize these data structures to JSON, so it would be great if we could not duplicate code around things like serializing And___, Or___, etc.

The simple object-oriented approach described above doesn’t really satisfy all these requirements.

The Free Boolean Algebra

Ultimately, what I really want to do is the following. I want to define a set of objects, e.g.:

case class EventType(kind: String) extends EventSpec
case class URLMatch(url: String) extends EventSpec

Then I want to be able to build a boolean algebra out of them with some sort of simple type constructor. Given an object created with this type constructor, I then need to be able to make various algebra preserving transformations.

Luckily the field of abstract algebra provides a generic solution to this problem - the Free Object. A free object is a version of an algebraic structure which has no interpretation whatsoever - it’s a purely symbolic way of representing that algebra. But the important thing about the free object is that it gives interpretation almost for free.

More concretely, a Free Object is a Functor with a particular natural transformation. I.e., for any type T, there is a type FreeObj[T] with the following properties:

  1. For any object t of type T, there is a corresponding object t.point[FreeObj] having type FreeObj[T]. I.e., objects outside the functor can be lifted into it.
  2. Let X be another object having the same algebraic structure (e.g., X is any boolean algebra). Then for any function f: T => X, there is a natural transformation nat(f): FreeObj[T] => X with the properties that (a) nat(f)(t.point) = f(t) and (b) nat(f) preserves the structure of the underlying algebra.

Preserving the structure of the underlying algebra is important - this means that for a boolean algebra, nat(f)(x & y) === nat(f)(x) & nat(f)(y), nat(f)(x | y) === nat(f)(x) | nat(f)(y), etc.This property of preserving the structure is called homomorphism.

This bit of mathematics is, in programming terms, the API of our FreeObject. This API allows us to turn any type into a monoid/boolean algebra/etc, and it guarantees that no information whatsoever is lost by doing so.

How to use it

We’ve developed a library which includes Free Boolean Algebras called Old Monk. It’s named after the finest rum in the world. Old Monk also builds on top of Spire, which provides various abstract algebra type classes in Scala.

To create a simple Boolean algebra for events, we import the following:

implicit val freeBoolAlgebra = FreeBoolListAlgebra //There are multiple variants
type FreeBool = FreeBoolList
import spire.algebra._
import spire.implicits._

We then define our underlying type:

sealed trait EventSpec
case class CookieValue(key: String, value: String) extends EventSpec

Finally, we define the predicate type:

type EventPredicate = FreeBool[EventSpec]

Combining objects is now straightforward, and uses Spire syntax:

val pred = CookieValue("foo", "bar").point[FreeBool] | URLParam("_foo", "bar")
val pred2 = ...
val pred3 = ~pred & pred2

(There is an implicit in oldmonk which is smart enough to turn the URLParam object into an EventPredicate, but not smart enough to apply to the first one.)

That’s great - we’ve now got a boolean algebra. But how do we use it?

Evaluating a predicate

To evaluate a predicate, we need to use the nat operation. Recall that the type of nat is:

def nat[T,X](f: T => X): FreeBool[T] => X

So to use this, we simply need to define how our function f operates on individual EventSpec objects:

def evalEventSpec(evt: Event): (EventSpec => Boolean) = (e:EventSpec) =>
  e match {
    case CookieValue(k, v) => EventLenses.cookie(k).get(evt).getOrElse(false)
    case URLMatches(url) => EventLenses.url.get(evt) == url

Then by the magic of nat, we can evaluate predicates:

def evaluateEventPredicate(pred: EventPredicate, event: Event): Boolean =

The logic is that evalEventSpec(event) has type EventSpec => Boolean. The operation nat lifts this to a function mapping EventPredicate => Boolean, and then this function is applied to the actual predicate.

Due to the laws of the Free Boolean Algebra, we know that this method must evaluate things correctly. I.e., imagine we had a predicate a.point[FreeBool] | b.point[FreeBool].

By the second law of free Boolean algebras, the homomorphism property, we know that:

nat(f)(a.point[FreeBool] | b.point[FreeBool]) ===
  nat(f)(a.point[FreeBool]) | nat(f)(b.point[FreeBool])

By the first law, we know that:

nat(f)(a.point[FreeBool]) === f(a)
nat(f)(b.point[FreeBool]) === f(b)

Substituting this in yields:

nat(f)(a.point[FreeBool] | b.point[FreeBool]) ===
  nat(f)(a.point[FreeBool]) | nat(f)(b.point[FreeBool]) ===
  f(a) | f(b)

Thus, the nat function has faithfully created a way for us to evaluate our predicates.

Translating predicates

Consider one of our other requirements - we want to build convenience predicates for the user, but we don’t want to duplicate work to evaluate them.

To handle this case, we’d tweak the underlying definition of our predicates a bit:

sealed trait EventSpec

sealed trait PrimitiveEventSpec extends EventSpec
case class CookieValue(key: String, value: String) extends PrimitiveEventSpec

sealed trait CompoundEventSpec extends EventSpec
case class GACampaignMatches(source: String,campaign: String)

We’ll approach this problem in two ways. First, we’ll build a translation layer - a way to translate FreeBool[EventSpec] => FreeBool[PrimitiveEventSpec]. Then we’ll build the evaluation layer - a way to compute FreeBool[PrimitiveEventSpec] => Boolean. With this structure, we only need to define evaluation on the primitives.

The translation is actually very simple with nat. First we define a mapping from EventSpec => FreeBool[PrimitiveEventSpec], and then we use nat to lift this function to FreeBool:

def primitivizeSpec(es: EventSpec): FreeBool[PrimitiveEventSpec]) = (es: EventSpec) match {
  case (x: PrimitiveEventSpec) => x.point[FreeBool]
  case (c: CompountEventSpec) => c match {
    case GACampaignMatches(source, campaign) =>
      (URLParam("utm_source", source) : FreeBool[PrimitiveEventSpec]) & (URLParam("utm_campaign", campaign) : FreeBool[PrimitiveEventSpec])

val primitivize: FreeBool[EventSpec] => FreeBool[PrimitiveEventSpec] = nat(primitivizeSpec _)

Then we would define evaluation the same as above:

def evalPrimitiveEventSpec(evt: Event): (EventSpec => Boolean) = (e:EventSpec) =>
  e match {
    case CookieValue(k, v) => EventLenses.cookie(k).get(evt).getOrElse(false)
    case URLMatches(url) => EventLenses.url.get(evt) == url

def evaluatePrimitiveEventPredicate(pred: PrimitiveEventPredicate, event: Event): Boolean =

Finally we would define evaluation as:

def evaluateEventPredicate(pred: EventPredicate, event: Event): Boolean =

Partial Evaluation

Another cool trick this approach gives us is partial evaluation. Suppose we gain partial information about a predicate, but it’s incomplete. For instance, we know that evaluate(a) should be True but we don’t know what evaluate(b) should be.

Concretely, suppose we have a function:

def partialEvaluate(e: EventSpec): Option[Boolean] = ...

We can then partially evaluate our predicates:

def partiallyEvaluatePredicate(e: EventPredicate) =
  nat( (e:EventSpec) => {
    partialEvaluate(e).fold( e )(x => {
      if (x) { True } else { False }

Then, supposing we know a to be true but b is unknown, this will evaluate to:

partiallyEvaluatePredicate(a & b) ===
  partialEvaluate(a) & partialEvaluate(b) ===
  True & b ===

This is useful to us in a variety of cases. Often times we’ll have a predicate which combines information known server side, and other information which is only known in the browser. Partial evaluation lets us compute the server side information, substitute this result in, and have a resulting predicate which depends only on browser-side information.

The browser-side predicate can then be rendered to javascript and evaluated in the browser directly. This is pretty straightforward, in fact:

def evaluateServerSide(e: ServerSideEventSpec): Boolean = ...

def browserify(e: EventPredicate): BrowserSideEventPredicate =
  nat( (e:EventSpec) => e match {
      case (b:BrowserSideEventSpec) => b.point[FreeBool] : BrowserSideEventPredicate
      case (s:ServerSideEventSpec) => if (evaluateServerSide(s)) {
         TruePred : BrowserSideEventPredicate
       } else {
         FalsePred : BrowserSideEventPredicate


Free objects are a great way to build generalized interpreter patterns. Just as the FreeMonad (called simply Free in Scalaz) enables one to build generalized stateful computations, abstracting away the actual state, FreeBool allows us to build generalized predicates and manipulate them in a straightforward manner.

More generally, if you find yourself re-implementing the same algebraic structure over and over, it might be worth asking yourself if a free version of that algebraic structure exists. If so, you might save yourself a lot of work by using that.

Other Free Objects

One important free object is the FreeMonoid. It turns out that the functor List[_] is actually a Free Monoid. This can be shown by defining nat for a list:

def nat[A,B](f: A=>B)(implicit m: Monoid[B]): (List[A] => B) =
  (l: List[A]) =>,y) => m.append(x,y))

Essentially, the natural transformation consists of taking each element of the list, applying the function f to it, and then appending the elements.

A somewhat more interesting free algebra is the FreeGroup. A Group is a Monoid, but with an additional operation - inversion. Inversion - denoted by ~x - has the important property that for any x, (~x) |+| x = zero and x |+| (~x) = zero. I.e., appending two elements together can always be undone by appending a new element.

For an example of a group, consider the integers - x |+| y = x + y, and ~x = -x.

The type FreeGroup[A] then consists essentially of a List[A], with the caveat that a and ~a` cannot occur adjacent to each other in the list.

Similarly, a FreeMonad is a way of taking any Functor and getting an abstract monad out of it. This is implemented in scalaz, so the naming is a little different. Given an object x: Free[S,A] (for S[_] a Functor), x has the method foldMap[M[_]](f: S ~> M)(implicit M: Monad[M]): M[A]. This method implements the natural transformation. In the language we are using here, we could define nat as follows:

def nat[S,M,A](f: S[A] => M[A])(implicit m: Monad[M]): (Free[S,A] => M[A]) =
  (x:Free[S,A]) => x.foldMap(f)(m)

As an illustrated example of how free monads work, this article discusses how to represent a Forth-like DSL with the FreeMonad and then interpret it via a mapping from Free => State.

Free Monad. Free Monads are Simple Deriving the Free Monad

We have been using Elasticsearch for storing analytics data. This data stored in Elasticsearch is used in the Post Report Segmentation feature in VWO. So the amount of data getting stored in Elasticsearch is tied up to the number of campaigns currently being run by our customers. And often we need to have custom tooling to work with this data and the requirements of such tooling are also not common. This blog post is about how we solved some issues by building some missing blocks in the Official Elasticsearch Python client while working on this project.

The code base where implementation of this feature (Post Report Segmentation) lies is all written in Python. When we were starting out, we had to decide which client to use because there were many out there. Eliminating some was really easy because they were tied to certain frameworks like Tornado and Twisted. And we were not sure which path to take initially so we decided to keep things simple, avoid early optimization and not use any of these framework heavily dependent on Non-Blocking IO. If we needed any of that later, Gevent could be put to use (in fact that’s exactly what we did). Even for the simpler way there were quite a few options. The deciding factors for us were:

  1. Maintenance commitment from the author
  2. Un-opinionated
  3. Simple design

Considering all these factors, we decided to go with the Official Python Client for Elasticsearch. And we didn’t really come across any issues and problems according to our simple requirements. It is fairly extensible and comes with some standard batteries included with it. For everything else, you can extend it - thanks to its simple design.

It worked well for a while until we had to add some internal tooling where we needed to work a lot with Elasticsearch’s Scroll API and Bulk APIs.

Bulk API

Elasticsearch’s Bulk API lets you club together multiple individual API calls into one. This is used a lot in speeding up indexing and can be very useful if you are doing a lot of write operations in Elasticsearch.

The way you work with Bulk APIs is that you construct a different kind of request body for bulk requests and use the client for sending that request data. The HTTP API that Elasticsearch exposes for bulk operations is semantically different than the API for individual operations.

Consider this. If you were to index a new document, update an existing document and delete another existing document in Elasticsearch, you can do it like so:

from elasticsearch import Elasticsearch

client = Elasticsearch(hosts=['localhost:9200'])
client.index(index='test_index_1', doc_type='test_doc_type',
client.update(index='test_index_3', doc_type='test_doc_type',
              id=456, body={
                  'script': 'ctx._source.count += count',
                  'params': {
                      'count': 1
client.delete(index='test_index_2', doc_type='test_doc_type',

If you were to achieve the same thing using Bulk APIs, you would end up writing code like this:

from elasticsearch import Elasticsearch

client = Elasticsearch(hosts=['localhost:9200'])

bulk_body = ''

# index operation body
bulk_body += '{ "index" : { "_index" : "test_index_1", "_type" : "test_doc_type", "_id" : "1" } }\n'
bulk_body += '{ "key1": "val1" }\n'

# update operation body
bulk_body += '{ "update" : {"_id" : "456", "_index" : "test_index_3", "_type" : "test_doc_type"} }\n'
bulk_body += '{ "script": "ctx._source.count += count", "params": { "count": 1 } }\n'

# delete operation body
bulk_body += '{ "delete" : { "_index" : "test_index_2", "_type" : "test_doc_type", "_id" : "123" } }'

# finally, make the request

There is a ton of difference in how bulk operations work on the code and API level as compared to individual operations.

  1. The request body is considerably different in Bulk APIs as compared to their individual APIs.
  2. The responsibility of properly serializing request body is now shifted to the developer whereas this can be handled at the client level.
  3. Serialization format itself is a mixup of JSON and new-line character separated string.

If you are depending a lot on bulk operations, these problems will bite you when you start using it at a lot of places in your code. The flexibility of manipulating bulk request bodies at will lacks with the current support for Bulk APIs.

The official client as well does not really take care of this issue - not blaming because the author’s objective is to be as unopinionated as possible and this also gave us the chance to do it our way instead of adopt an existing implementation. We wanted to use Bulk API the same way we would use individual APIs. And why shouldn’t it be the same! They are essentially individual operations put together and executed on a different end-point.

Our solution for this was to provide a BulkClient which would allow you to start a bulk operation, execute bulk operations in a way that you would execute individual operations and then when you want to execute them together, it will make the required request body and use the Elasticsearch client to make the request. Exposing bulk operations in a way that semantically look the same as individual operations required us to implement APIs similar to individual APIs on a very high level in the BulkClient.

This is how the BulkClient works:

from elasticsearch import Elasticsearch

client = Elasticsearch(hosts=['localhost:9200'])

bulk = BulkClient(client)
bulk.index(index='test_index_1', doc_type='test_doc_type',
bulk.delete(index='test_index_2', doc_type='test_doc_type',
bulk.update(index='test_index_3', doc_type='test_doc_type',
            id=456, body={
                'script': 'ctx._source.count += count',
                'params': {
                    'count': 1
resp = bulk.execute()

Scroll API

The next problem we faced was with Scroll API.

According to the documentation:

While a search request returns a single “page” of results, the scroll API can be used to retrieve large numbers of results (or even all results) from a single search request, in much the same way as you would use a cursor on a traditional database.

Scroll API is helpful if you want to work with a large number of documents - more like get them out of Elasticsearch.

The problem with Scroll API is that it requires you to do a lot of book keeping. You have to keep scroll_id after every iteration to get the next set of documents. Depending upon your application, there is probably no work around. However, our use-case was to get a large number of documents all together. You can do that without Scroll API as well i.e. by using the size parameter where you can tell Elasticsearch how many documents to return and you can ask it to return all documents by using the Count Search API first and then passing the size, but that will usually time out (or at least it did for us). So what we did was scroll Elasticsearch in a loop and do the book keeping in the code. And that was simple as well until we had to do it at multiple places

  • there was no uniform way to do that and a lot of code repetition was done as well.

Our solution to this problem was to create a separate wrapper API only for this purpose and use that everywhere in our project. So we wrote a simple function that would do the book-keeping for us and it could be used like so:

def scrolled_search(es, scroll, *args, **kwargs):
    Iterator for Elasticsearch Scroll API.

    :param es: Elasticsearch client object
    :param str scroll: scroll expiry time according to Elasticsearch Scroll API

    ... Note:: this function accepts `*args` and ``**kwargs`` and passes them
               as they are to :meth:`` method.


es = Elasticsearch(hosts=['localhost:9200'])
for docs in scrolled_search(es, '10m', index='tweets'):
    for doc in docs:
        print doc

Iterator based Scrolling in elasticsearch-py

We must highlight that the official client also added support for iterator based scrolling later in the official client as a helper. We had already started using our solution in our project and we find ours is slightly different than theirs. For more details, read the docs here.

SuperElasticsearch - elasticsearch-py with goodies!

Our solution to both the problems described earlier were based on the official Elasticsearch client. After having solved these two problems, we figured that instead of passing around the client object to our new API, it will be nicer if we can use the new APIs in a way that it feels a part of the client itself. So we went ahead and sub-classed the existing client class Elasticsearch to make it easier to use the new APIs. You can use the sub-classed client SuperElasticsearch like so:

from superelasticsearch import SuperElasticsearch

client = SuperElasticsearch(hosts=['localhost:9200'])

# Example of using Scrolled Search
for doc in client.itersearch(index='test_index', doc_type'tweets',
    # do something with doc here
    print doc

# Example of using Bulk Operations
bulk = client.bulk_operation()
bulk.index(index='test_index_1', doc_type='test_doc_type',
bulk.delete(index='test_index_2', doc_type='test_doc_type',
bulk.update(index='test_index_3', doc_type='test_doc_type',
            id=456, body={
                'script': 'ctx._source.count += count',
                'params': {
                    'count': 1
resp = bulk.execute()

This has also made it easy for us to do releases of SuperElasticsearch. SuperElasticsearch does not depend on the official client in ways that it will break compatibility with new releases of the official client, or if it will then we can make the adjustments and come up with a new release. Basically it has been written in a way to work with new versions of the official client with minimum friction. If a new release of the official client comes out, then you should be able to upgrade to the new Elasticsearch client without upgrading SuperElasticsearch. This way we can try to keep developing SuperElasticsearch at its own pace and release only when we have new features to release or when it breaks compatibility. It also makes it easier for you to use the new APIs because you get all of them with the client object itself.

SuperElasticsearch is available on Github.