I have been working with Apache Kafka for more than 4 years now and have seen it evolve from a basic distributed commit log service (Something very similar to Transaction log or Operation log) to a full fledged tool for data pipelining and become the backbone of data collection platforms. For those who don’t know about Kafka, it was developed by LinkedIn, and was open sourced in early 2011. It is a distributed pub-sub messaging system that is designed to be fast, scalable and durable. Like other pub-sub messaging systems, Kafka maintains stream(s) of messages in topic(s). Producers are special processors that write data to Topics while, Consumers read from topics, to store data to extract some meaningful information that might be required at a later stage. Since Kafka is a distributed system, topics are partitioned and replicated across multiple nodes. Kafka lets you store streams of messages in a fault-tolerant way and allows processing these streams in near realtime.

Apache Kafka has gone through various design changes since its inception, Kafka 0.9 came out with support of High Level Consumer API, which helped in removing dependency of Apache Zookeeper. It is now only used to manage metadata of topics created in Kafka. Also, in case some Kafka node goes down or rebalance is triggered due to addition of new nodes, Zookeeper runs the leader election algorithm in a fair and consistent manner. For versions less than 0.9 Apache Zookeeper was also used for managing the offsets of the consumer group. Offset management is the mechanism, which tracks the number of records that have been consumed from a partition of a topic for a particular consumer group. Kafka 0.10 came out with out of the box support for Stream Processing. This streaming platform enables capturing flow of events and changes caused by these events, and store these to other data systems such as RDBMS, key-value stores, or some warehouse depending upon use case. I was really happy and took it for a run by doing some counting aggregations. The aggregation was fast and I hardly had to write 50 lines for it. I was very happy and impressed with results. I streamed around 2 million events in around a minute on my laptop with couple of instances only. But I never got a chance to use it in production for a year or so.

Around 3 months back when our team started stress testing backend stores by generating a lot of data, our backend stores started to give up due to the high number of insertion and updates. We didn’t have the choice to add more hardware as we were already using a lot of resources and wanted a solution that fits our current bill. Our data team had lot of discussions and I heard a lot of people talk about things like Apache Samza, Apache Spark, Apache Flink etc. Because, we have a small team, adding another component in technology stack was not a good idea and I didn’t want team to spend time learning about these technologies with product release around the corner. Since our data pipeline is built around Kafka, I started playing around with data. The idea was to convert multiple updates to the backend stores into a single update/insert to ensure that number of hits that our DB is taking is reduced. Since we process a lot of data, we thought about windowing our events based on time and aggregating them. I started to work on it and in matter of hours my streaming application was ready. We started with 1 minute window and we were surprised with the result. We were able to reduce DB hits by 70%. YES 70 PERCENT!!!!!!

Here are the screenshots from one of our servers that show the impact of window aggregation.

Before Aggregation
After Aggregation

With streaming capabilities built into it, Apache Kafka has become one of the most powerful tool that allows you to store and aggregate data at insane speed. And we’ll see a gain in its adoption in coming years.

Let’s see how Kafka Streams work

Kafka Streams allows us to perform stream processing, hence requires some sort of internal state management. This internal state is managed in state stores which uses RocksDB. A state store can be lost on failure or fault-tolerant restored after the failure. The default implementation used by Kafka Streams DSL is a fault-tolerant state store using

  • An internally created and compacted changelog topic (for fault-tolerance)
  • One (or multiple) RocksDB instances (for cached key-value lookups). Thus, in case of starting/stopping applications and rewinding/reprocessing, this internal data needs to get managed correctly.

KStream and KTable

KStream is an abstraction of a record stream of Key-Value pairs. So if you have a click stream coming in, and you are trying to aggregate session level information, the key will be session id and the other information will be the value. Similarly for URL level aggregation, a combination of URL and session will be the key.

KTable is an abstraction of a changelog stream from a primary-keyed table. Each record in this stream is an update on the primary-keyed table with the record key as the primary key. The aggregation results are stored in KTable. Intermediate aggregation uses a RocksDB instance as key-value state store that also persists to local disk. Flushing to disk happens asynchronously to keep it fast and non blocking. An internal compacted changelog topic is also created. The state store sends changes to the changelog topic in a batch, either when a default batch size has been reached or when the commit interval is reached. A pictorial representation of what happens under the hood is given below

Kafka Streams Internal Functioning
*Image is taken from Apache Kafka documentation

Kafka Streams commit the current processing progress in regular intervals. If a commit is triggered, all state stores need to flush data to disc, i.e., all internal topics needs to get flushed to Kafka. Finally, all current topic offsets are committed to Kafka. In case of failure and restart, the application can resume processing from its last commit point.

Let’s understand this with help of an example

Imagine a stream of such events coming to server for a very high traffic website. Let’s assume there is a big web gaming platform where 50K-80K concurrent users generate about 80K-120K events per second and there is a requirement to find following things:

  • Number of clicks user has done in a session
  • Total Pages he has viewed in a session
  • Total amount of time user has spent in a session.

Let the JSON structure be as follows:

{
  "uuid":"user id",
  "session_id": "some uuid",
  "event": "click/page_view",
  "time_spent":14
}

Ingestion at above mentioned pace in a DB or ensuring that these events gets stored in DB in itself is a challenge. A lot of hardware will be required to cope with this traffic as it is. Hence, it doesn’t make sense to store data directly in DB. A streaming application is a very good fit here. A streaming application is going to leverage the fact that for most of the user the clicks and page views will be concentrated in a time window. So it is possible that in 5 minutes a user might be clicking x times and giving y pageviews on an average. We can introduce a 5 minute window and club these request to form a single equivalent DB request. Hence reducing (x+y) hits to 1 hit in a window of 5 minutes. Thus reducing the traffic to 1/(x+y) of what was coming earlier.

I have written a Sample Kafka Streams Project to make it easier for you to understand. Let’s take a look at sequence diagram below. This diagram shows how various components of sample project interact with each other.

Kafka Streams Sequence Diagram

All this flow is defined with the help of Kafka Streams DSL, the code snippet is given below

//Defining Source Streams from multiple topics.
KStream<String, ClickStream> clickStream = kStreamBuilder.stream(stringSerde, clickStreamSerde,
     Main.TOPIC_PROPERTIES.getProperty("topic.click.input").split(","));

//Kafka Streams DSL in action with filtering and cleaning logic and passing it through aggregation collector
clickStream
     .filter((k,v) -> (v!=null))
     .map((k, v) ->
           new KeyValue<>(v.getSessionId(),v))
     .through(stringSerde, clickStreamSerde, Main.TOPIC_PROPERTIES.getProperty("topic.click.output"))
     .groupBy((k, v) -> k, stringSerde, clickStreamSerde)
     .aggregate(ClickStreamCollector::new, (k, v, clickStreamCollector) -> clickStreamCollector.add(v),
           TimeWindows.of(1 * 60 * 1000), collectorSerde,
           Main.TOPIC_PROPERTIES.getProperty("topic.click.aggregation"))
     .to(windowedSerde, collectorSerde, new ClickStreamPartitioner(), Main.TOPIC_PROPERTIES.getProperty("topic.click.summary"));

It’s worth noting that for each step we need to define a serializer and deserializer. In above code snippet

  • stringSerde: Defines the Serialization and Deserialization for String
  • clickStreamSerde: Defines the Serialization and Deserialization for Raw click Data
  • collectorSerde: Defines the Serialization and Deserialization for RocksDB intermediate storage.
  • windowedSerde: Defines the serialization and Deserialization for Kafka Windowed Aggregation storage

Its very easy to implement streams over Kafka and it can be leveraged to reduce the DB traffic and for other applications, where windowing or sessionization makes sense. You can play around with this project and in case you want to reach out to me or have any doubt please drop your queries in comments section.

Happy Streaming..!


For the past couple of years, we have been using require.js for module loading and Grunt for automating tasks on front-end, for one out of many projects we have in Wingify. The project has a huge code-base and has many independent components inside it with some shared utilities. Also, there was no concrete build system which could be scaled upon adding new components.

Require.js was being used for good code-structuring, managing modules and their loading. All the different modules were having their own require-config.js file to define rules for a particular module.

Grunt was being used for automating different tasks required to speed up mundane work. We had a number of tasks like the require-amdclean task, concatenating different script / CSS files, minification of files, cache-busting mechanism and so on.

Following are some benefits we were getting from the require-amdclean task:

  • We didn’t have to include require.js in production, thus, saving some bytes.
  • Generation of single js file entirely in Vanilla JavaScript.
  • Get rid of file size/source code readability concerns.
  • It was a great fit to be used as a standalone Javascript library, which is exactly our case.

Everything was working as expected but maintenance, performance, and scale were the issues. We had so many healthy discussions regarding improving things and thus we thought of upgrading our tech stack too. Also, as I mentioned we didn’t have a concrete build system; it was the right time to investigate further. We were ready to spend some quality time in researching technologies which could fit in our build system. Gaurav Nanda and I took a break from our daily chores and read many articles/blogs and the not-so-useful official docs to get a good command over various technologies. Migrating from Grunt to Gulp wasn’t helping us since build time was nearly the same. The task which took a lot of time was the require-amdclean task, taking around 10 seconds even for adding just a single character like ; while working in the development environment.

Migrating from NPM to Yarn - First step towards a new journey

After reading about Yarn, the team was really curious to play with this yet new package manager aka dependency manager. When we benchmarked the results, we were literally stunned by the time difference between NPM and Yarn in fetching up resources. Yarn achieves this speed by introducing parallelism and its performance and security via maintaining a yarn.lock file.

For a total of 34 packages in total, the following stats would please your eyes too :)

[email protected] [email protected]

Stats when we did a Fresh Install

Package manager Time taken
npm 3 minutes 12 seconds
yarn (without yarn.lock file) 1 minute 33 seconds
yarn (with yarn.lock file) 16 seconds

Running the commands with already installed packages

Package manager Time taken
npm 7 seconds
yarn (with yarn.lock file) 6 seconds

Yarn offers a lot more besides its fast speed, security, and reliability. Check these commands Yarn offers.

Since we were using bower too, our first step was to port all the dependencies and dev-dependencies listed in our bower.json file to package.json. This was a time-consuming task since we had a huge list of packages. After successful porting of packages and validating the version numbers with the previous packages, we were all set to switch to Yarn. This also helped in keeping just one file for managing packages. We are no longer using bower. Even bower’s official site recommends using Yarn and Webpack :)

Why switch to Webpack 2

It wasn’t an easy task to accomplish since Webpack is a module bundler rather than a task runner. We were so accustomed to using task runners along with the old-fashioned require.js based module management that it took a good amount of time figuring out how to proceed with our mini-app’s new build system.

Apart from the numerous benefits of using Webpack, the most notable features, especially for our codebase and the build system, were:

  1. Easy integration with npm/yarn and seamless handling of multiple module formats. We now use two of its kind, one is UMD and the other one is this target option (we have such a requirement).
  2. Single main entry and one single bundled output - exactly what we needed.
  3. Cache busting(hashing) - Very very easy to implement and get benefitted.
  4. Building different, independent, and standalone modules simultaneously. Thanks to parallel-webpack!
  5. Using webpack-loaders -
    • babel-loader - so that we could start writing ES6 compatible code even with our require.js module management system.
    • eslint-loader - which allows identifying and reporting on patterns found in ECMAScript/JavaScript code
    • css-loader - for bundling CSS

Converting to Webpack 2 - A transcendent journey ahead

In the beginning, it looked like just porting the require.js configuration to Webpack and we’re done. A big NO! This thought was absolutely wrong. There were so many scenarios we had to deal with. We will discuss this in detail as we move along.

First thing first, a clear understanding of what exactly Webpack is and how does it bundle the modules are must. Simply copy-pasting the configuration file from the official website and tweaking it won’t help in a long run. One must be very clear regarding the fundamentals on which Webpack is built upon.

Problems which we needed to tackle were:

  1. Different modules in the same app, having different configuration files.
  2. Webpack config should be modular in itself and be able to run multiple configs at once so that we should be able to add/remove a new module easily without affecting any existing one.

Installing Webpack

Via Yarn (recommended)

yarn add --dev webpack

Via NPM

npm install webpack --save-dev

Configuration -

A basic configuration file looks like:

// Filename: webpack.config.js

const path = require('path');
const webpack = require('webpack');
module.exports = {
  context: path.resolve(__dirname, 'src'),
  entry: {
    app: './app.js',
  },
  output: {
    path: path.resolve(__dirname, 'dist'),
    filename: '[name].bundle.js',
  },
};

Check this for knowing the role of each key.

Since we needed to support different modules we had to have different config files for each of our module.

// Filename webpack.config.js

/**
 * Method to return a desired config with the necessary options
 * @param  {Object} options
 * @return {Object} - Desired config Object as per webpack 2 docs
 */
function executeWebpackConfig(options) {
  return {
    devtool: options.devtool === '' ? options.devtool : 'source-map',
    entry: options.entry,
    output: options.output,
    module: options.module,
    resolve: options.resolve,
    plugins: options.plugins || []
  };
}

// Add/remove different modules' corresponding config files
let multipleConfigs = [
  // For building single bundled JS file
  require('./build/module-A/webpack.main'),
  // Corresponding bundled CSS file
  require('./build/module-A/webpack.main.assets'),

  require('./build/module-B/webpack.main'),
  require('./build/module-B/webpack.main.assets'),

  require('./build/module-C/webpack.main'),

  require('./build/module-D/webpack.main'),
  require('./build/module-D/webpack-main.assets')
];

multipleConfigs.map((config) => {
  return executeWebpackConfig(config);
});

module.exports = multipleConfigs;

The above configuration is capable of handling n number of modules. Different modules will have at least one bundled JS file as the output. But we also needed to have a bundled CSS file corresponding to each module. So, we decided to have two different config files for every module which has both JS and CSS bundling, one for bundling JS and other for managing assets and bundling CSS files. Tasks like copying files from src to dist, updating the JS file name with a cache-busting hash(prod build) in the index.html file and so on were taken care of inside the assets config file.

The above-mentioned break-down of a module into JS and CSS bundling helped us in having a clean, modular, and scalable approach for our new build system. We also used parallel-webpack to speed up our build by running independent modules in parallel. But be very careful using it, since it spawns a new thread for each different task, which basically uses the different cores of a machine to process. Also, there should be a cap on the number of parallel-tasks to prevent overshooting of CPU usage.

Extraction of common stuff for reusability and maintainability

Let’s discuss Webpack module-rules and resolve-aliases which play a significant role, before advancing further with the creation of common webpack-configuration helper methods.

module rules - Create aliases to import or require certain modules more easily. This basically tells how to read a module and to use it.

We used expose-loader and imports-loader depending on the use-case.

expose-loader - adds modules to the global object. This is useful for debugging or supporting libraries that depend on libraries in globals.

imports-loader - is useful for third-party modules that rely on global variables like $ or this being the window object. The imports loader can add the necessary require(‘whatever’) calls, so those modules work with Webpack.

This is an obvious thing that we had same third-party libraries, wrappers over external libraries, and self-baked useful utilities shared across different modules. This means that our module-specific webpack config file would have the same set of repeated rules and aliases. Code duplication might seem a good fit here for readability but is really painful to maintain in a long run.

Let’s discuss how we managed to share the common module rules and resolve aliases across the different modules.

Below is a generic utility file’s code which has two methods. One outputs whether a passed argument is an Object and the other one outputs whether it’s an array.

// Filename: GenericUtils.js

module.exports = {
    isObject: function (obj) {
        return Object.prototype.toString.call(obj) === '[object Object]';
    },
    isArray: function (arr) {
        return Object.prototype.toString.call(arr) === '[object Array]';
    }
};

Here’s a list of common rules and aliases defined explicitly in a separate file.

// Filename: webpack.common-module-rules-and-alias.js

const path = require('path');
let basePath = path.join(__dirname, '/../');

module.exports = {
    alias: {
        // Common thrid-party libraries being used in different modules
        'pubSub': basePath + 'node_modules/pubsub/dist/ba-tiny-pubsub.min',
        'select2': basePath + 'node_modules/select2/dist/js/select2.full.min',
        'acrossTabs': basePath + 'node_modules/across-tabs/dist/across-tabs.this',
        // ....more

        // Common self-baked utilities
        'utils': 'lib/player/utils',
        'storage': 'lib/player/storage',
        // ....more

        // Common services
        'auth': 'lib/Auth',
        'gaUtils': 'lib/GAUtils',
        'DOMUtils': 'lib/DOMUtils',
        'arrayUtils': 'lib/ArrayUtils',
        // ....more

        // Common constants
        'AnalyticsEventEnum': 'lib/constants/AnalyticsEventEnum',
        'MapTypeEnum': 'lib/constants/MapTypeEnum',
        'segmentAnalyticsUtils': 'lib/analytics/SegmentAnalyticsUtils',
        // ....more
    },

    rules: [
        { test: /jQuery/, loader: 'expose-loader?$' },
        { test: /pubSub/, loader: 'expose-loader?pubSub!imports-loader?jQuery' },
        { test: /select2/, loader: 'expose-loader?select2!imports-loader?jQuery' },
        { test: /acrossTabs/, loader: 'expose-loader?AcrossTabs' },
        // ....more

        { test: /utils/, loader: 'expose-loader?utils' },
        { test: /storage/, loader: 'expose-loader?storage' },
        // ....more

        { test: /auth/, loader: 'expose-loader?auth' },
        { test: /gaUtils/, loader: 'expose-loader?gaUtils' },
        { test: /DOMUtils/, loader: 'expose-loader?DOMUtils' },
        { test: /arrayUtils/, loader: 'expose-loader?arrayUtils' },
        // ....more

        { test: /AnalyticsEventEnum/, loader: 'expose-loader?AnalyticsEventEnum' },
        { test: /MapTypeEnum/, loader: 'expose-loader?MapTypeEnum' },
        { test: /segmentAnalyticsUtils/, loader: 'expose-loader?segmentAnalyticsUtils' },
        // ....more
    ]
};

We now had a common file where we could easily add/update/remove any rule and its corresponding alias. Now we needed to have a utility which combines the common rules and aliases with the already defined rules and aliases in a particular modules’ config file.

// Filename: rulesAndAliasUtil.js

const moduleRulesAndAlias = require('./webpack.common-module-rules-and-alias');
const genericUtil = require('./genericUtil');

module.exports = {
    mergeRulesAndUpdate: function(testRules, config) {
        if (testRules && config && config.module && config.module.rules &&
            genericUtil.isObject(config) &&
            genericUtil.isArray(testRules)
        ) {
            testRules.concat(moduleRulesAndAlias.rules);
            for (let i = 0; i < testRules.length; i++) {
              config.module.rules.push(testRules[i]);
            }

            return config;
        }
        return config;
    },
    mergeAliasAndUpdate: function (aliases, config) {
        if (aliases && config && config.resolve &&
            genericUtil.isObject(aliases) && genericUtil.isObject(config)
        ) {
            let allAliases = Object.assign(aliases, moduleRulesAndAlias.alias);

            config.resolve.alias = allAliases;
            return config;
        }

        return config;
    }
};

Time to write our module specific config file. We’ll demonstrate just one config file i.e. for moduleA and the others would look exactly the same except the options’ value as per module.

Here’s the full webpack config file for moduleA.

// Filename: webpack.moduleA.js

const path = require('path');
const webpack = require('webpack');
const env = require('./../webpack.env').env; // Just to get the env(dev/prod), discussed in detail later

const rulesAndAliasUtil = require('./utils/rulesAndAliasUtil');

let basePath = path.join(__dirname, '/../');
let config = {
  // Entry, file to be bundled
  entry: {
    'moduleA': basePath + 'src/path/to/moduleA-entry.js',
  },
  devtool: env === 'build' ? 'source-map' : false,
  output: {
    // Output directory
    path: basePath + 'dist/moduleA',
    library: '[name]',
    // [hash:6] with add a SHA based on file changes if the env is build
    filename: env === EnvEnum.BUILD ? '[name]-[hash:6].min.js' : '[name].min.js',
    libraryTarget: 'umd',
    umdNamedDefine: true
  },
  module: {
    rules: []
  },
  resolve: {
    alias: {},
    modules: [
      // Files path which will be referenced while bundling
      basePath + 'src',
      basePath + 'node_modules',
    ],
    extensions: ['.js'] // File types
  },
  plugins: []
};

// Following requirejs format - define how will they be exposed(via expose-loader or exports-loader) and their dependenices(via imports-loader)
let testRules = [
  { test: /jQuery/, loader: 'expose-loader?$' },
  { test: /base64/, loader: 'exports-loader?Base64' },
  { test: /ModuleSpecificEnum/, loader: 'expose-loader?ModuleSpecificEnum' }
];

// Following requirejs format - define the paths of the libs/constants/vendor specific to this moduleA only
let moduleAlias = {
  'jQuery': 'moduleA/vendor/jquery-3.1.0',
  'base64': 'moduleA/vendor/base64',
  'ModuleSpecificEnum': 'moduleA/constants/ModuleSpecificEnum'
}

config = rulesAndAliasUtil.mergeRulesAndUpdate(testRules, config);
config = rulesAndAliasUtil.mergeAliasAndUpdate(moduleAlias, config);

module.exports = config;

This is a complete webpack config file for bundling JS file for moduleA. While configuring it, we defined different options, each one has its own purpose. To know more about each option, please refer this.

Webpack loaders

Webpack enables the use of loaders to preprocess files. This allows us to bundle any static resource way beyond JavaScript.

We introduced two loaders for bundling JS resources inside our app.

  1. babel-loader - This package allows transpiling JavaScript files using Babel and Webpack. Thanks to babel-loader as we are fearlessly writing ES6 code and updating our mundane code.
  2. eslint-loader - This package allows identifying and reporting on patterns found in ECMAScript/JavaScript code.

Since we needed these two loaders for all our modules, we defined them in the same file we discussed earlier - rulesAndAliasUtil.js

// Filename: rulesAndAliasUtil.js

let defaultLoaders = [{
  enforce: 'pre', // to check source files, not modified by other loaders (like babel-loader)
  test: /(.js)$/,
  exclude: /(node_modules|moduleA\/vendor|moduleB\/lib\/lodash-template.min.js)/,
  use: {
    loader: 'eslint-loader',
    options: {
      emitError: true,
      emitWarning: true,
      failOnWarning: true, // will not allow webpack to build if eslint warns
      failOnError: true // will not allow webpack to build if eslint fails
    }
  }
}, {
  test: /(\.js)$/,
  exclude: /(node_modules)/,
  use: {
    // babel-loader to convert ES6 code to ES5
    loader: 'babel-loader',
    options: {
      presets: ['env'],
      plugins: []
    }
  }
}];

And updating the method: mergeRulesAndUpdate as follows

mergeRulesAndUpdate: function(testRules, config) {
    if (testRules && config && config.module && config.module.rules &&
        genericUtil.isObject(config) &&
        genericUtil.isArray(testRules)
    ) {
        testRules.concat(moduleRulesAndAlias.rules);
        for (let i = 0; i < testRules.length; i++) {
          config.module.rules.push(testRules[i]);
        }

        // Default babel-loader and eslint-loader for all js-modules
        config.module.rules = config.module.rules.concat(defaultLoaders);

        return config;
    }
    return config;
}

This was all about bundling of JS modules. The same approach was followed for different modules. Now we were left with the bundling of our CSS files and the obvious chores like copying, replacing, etc.

Webpack Bundling of CSS files

// Filename: webpack.moduleA.assets.js

const fs = require('fs');
const path = require('path');
const glob = require('glob-all');
const env = require('./../webpack.env').env;
const EnvEnum = require('./../constants/Enums').EnvEnum;

// To remove unused css
const PurifyCSSPlugin = require('purifycss-webpack');
// Copy Assests to dist
const CopyWebpackPlugin = require('copy-webpack-plugin');
// To generate a file in JSON format so that the hash appended can be later read by another file like one css file is used in multiple files so its hash needs to be stored somewhere to be read so that it can be replaced in corresponding `index.html` files
const ManifestPlugin = require('webpack-manifest-plugin');
const CleanWebpackPlugin = require('clean-webpack-plugin');
// For combining multiple css files
const ExtractTextPlugin = require('extract-text-webpack-plugin')
// Minify css files for env=build
const OptimizeCssAssetsPlugin = require('optimize-css-assets-webpack-plugin');

// Replace filename if env=build since hash is appended for cache bursting
const replacePlugin = require('./../utils/webpack.custom-string-replace.plugin');

let buildPlugins = [];
let basePath = path.join(__dirname, '/../');

if (env === 'build') {
  // minify css files if env is build i.e. production
  buildPlugins.push(new OptimizeCssAssetsPlugin({
    cssProcessorOptions: {
      safe: true
    }
  }));
}

module.exports = {
  // Entry, files to be bundled separately
  entry: {
    'css-file-1': [
      basePath + 'src/styles/canvas/common.css',
      basePath + 'src/styles/canvas/mobile.css',
      basePath + 'src/styles/canvas/main.css'
    ],
    'css-file-2': [
      basePath + 'src/styles/app.css',
      basePath + 'src/styles/player/player.css',
      basePath + 'src/styles/mobile.css',
      basePath + 'node_modules/select2/dist/css/select2.min.css'
    ]
  },
  devtool: '',
  output: {
    // Output directory
    path: basePath + 'dist/styles/',
    // [hash:6] with add a SHA based on file changes if the env is build
    filename: env === 'build' ? '[name]-[hash:6].min.css' : '[name].min.css'
  },
  // Rules for bundling
  module: {
    rules: [{
      test: /\.css$/i,
      use: ExtractTextPlugin.extract({
        use: [{
          loader: 'css-loader',
          options: {
            // ExtractTextPlugin tries to process url like in backgroun-image, url(), etc. We need to stop that behavior so we need this option
            url: false
          }
        }]
      })
    }]
  },
  resolve: {
    alias: {},
    modules: [],
    extensions: ['.css'] // only for css file
  },
  plugins: [
    // Cleaning specific folder, maintaining other modules dist intact
    new CleanWebpackPlugin([basePath + 'dist/styles'], {
      root: basePath
    }),
    // File to generated to read hash later on
    new ManifestPlugin({
      fileName: 'manifest.json'
    }),
    // Copy css/images file(s) to dist
    new CopyWebpackPlugin([{
      from: basePath + 'src/images',
      to: basePath + 'dist/images/'
    }]),
    // Bundling of entry files
    new ExtractTextPlugin(env === 'build' ? '[name]-[hash:6].min.css' : '[name].min.css'),
    // To remove unused CSS by looking in corresponding html files
    new PurifyCSSPlugin({
      // Give paths to parse for rules. These should be absolute!
      paths: glob.sync([
        path.join(basePath, 'src/moduleA/*.html'),
        path.join(basePath, 'src/moduleA/canBeAnyFile.js'),
        path.join(basePath, 'src/moduleB/*.html'),
        path.join(basePath, 'src/moduleC/*.js')
      ]),
      purifyOptions: {
        whitelist: [ '*select2-*' ] // If classes are added on run-time, then based on the pattern, we can whitelist them, to be always included in our final bundled CSS file
      }
    })
  ].concat(buildPlugins)
};

The above configuration outputs two bundled CSS files i.e. css-file-1.min.css & css-file.min.css, and css-file-1-8fb1ed.min.css & css-file-2-6ed3c1.min.css if it’s a prod build.

We are using ExtractTextPlugin, which extracts text from a bundle, or bundles, into a separate file, along with css-loader

We faced a very weird issue and thus worth mentioning here explicitly. ExtractTextPlugin tries to process URL like in background-image, url(), etc. We need to stop that behavior so we need to set url:false inside the options like:

options: {
     url: false
}

Few more plugins that we are using are:

  1. CleanWebpackPlugin - to remove/clean the styles folder inside the build folder before building

  2. ManifestPlugin - for generating an asset manifest file with a mapping of all source file names to their corresponding output file This plugin generates a JSON file so that the hash appended(prod build) after a JS file can be later read by another file. Eg. one CSS file is shared among different modules so its hash needs to be stored somewhere to be read later by other modules to update the hash in their corresponding index.html files.

  3. CopyWebpackPlugin - to copy individual files or entire directories to the build directory

  4. PurifyCSSPlugin - to remove unused selectors from the CSS. This plugin was a must for us. So, what we were doing in this entire project earlier was to copy-paste the Parent projects CSS file to this independent project. We followed the same approach because of time-constraints but found this amazing plugin which automatically removes the unused CSS from the bundled CSS files based on the paths of files which uses it. We can even whitelist selectors if classes are appended on run-time or for any other reason. But it is highly recommended to use the PurifyCSS plugin with the Extract Text plugin which we discussed above.

  5. OptimizeCssAssetsPlugin - to optimize/minimize CSS assets

This was all about bundling of CSS file.

Last step - Automated scripts and provision to execute module-specific build

First, we created a file to read arguments that could be read in our webpack.config.js file via a package.json script.

// Filename: webpack.env.js

// Webpack doesn't pass Webpack env in env variable when using multiple configs, so writing custom code
let argv = process.argv || [],
  // Loop over process arguments and check for --env.mode
  envArgv = argv.filter(function (arg) {
    return arg.indexOf('--env.mode') > -1;
  }),
  targetModuleArgv = argv.filter(function (arg) {
    return arg.indexOf('--env.module') > -1;
  }),
  env, targetModules = '';

// If match fould, spilt so that exact value can be extracted like 'build'/'local'
if (envArgv && envArgv.length) {
  env = envArgv[0].split('=')[1];
}

if (targetModuleArgv && targetModuleArgv.length) {
  targetModules = targetModuleArgv[0].split('=')[1];
}

module.exports = {
  env,
  targetModules
};

We tweaked our main webpack.config.js to make it module-aware.

// Filename: webpack.config.js

const targetModules = require('./build/webpack.env').targetModules;

function executeWebpackConfig(options) {
  return {
    devtool: options.devtool === '' ? options.devtool : 'source-map',
    entry: options.entry,
    output: options.output,
    module: options.module,
    resolve: options.resolve,
    plugins: options.plugins || []
  };
}

// Module specific configuration files
let multipleConfigs = [];

if (targetModules) {
  let modules = targetModules.split(',');

  for (var i = 0; i < modules.length; i++) {
    if (modules[i] === 'moduleA') {
      multipleConfigs.push(require('./build/moduleA-tasks/webpack.moduleA'));
      multipleConfigs.push(require('./build/moduleA-tasks/webpack.moduleA.assets'));
    }
    if (modules[i] === 'moduleB') {
      multipleConfigs.push(require('./build/moduleB-tasks/webpack.moduleB'));
      multipleConfigs.push(require('./build/moduleB-tasks/webpack.moduleB.assets'));
    }
    if (modules[i] === 'moduleC') {
      multipleConfigs.push(require('./build/moduleC-tasks/webpack.moduleC'));
    }
    if (modules[i] === 'moduleD') {
      multipleConfigs.push(require('./build/moduleD-tasks/webpack.moduleD'));
       multipleConfigs.push(require('./build/moduleD-tasks/webpack.moduleD.assets'));
    }
  }
} else {
  multipleConfigs = [
    require('./build/moduleA-tasks/webpack.moduleA-main'),
    require('./build/moduleA-tasks/webpack.moduleA.assets'),

    require('./build/moduleB-tasks/webpack.moduleB'),
    require('./build/moduleB-tasks/webpack.moduleB.assets'),

    require('./build/moduleC/webpack.moduleC'),

    require('./build/moduleD-tasks/webpack.moduleD'),
    require('./build/moduleD-tasks/webpack.moduleD.assets')
  ];
}

multipleConfigs.map((config) => {
  return executeWebpackConfig(config);
});

module.exports = multipleConfigs;

In our package.json file, we created different scripts for running either a development build or production-ready build(minification, cache-busting, and purification) and either to run build for all modules or for just selective modules.

// Filename: package.json

"scripts": {
  "install":      "yarn install --ignore-scripts",
  "build":        "webpack --optimize-minimize --bail --env.mode=build",

  "dev":          "webpack --progress --colors --watch --env.mode=dev --display-error-details",
  "dev-nowatch":  "webpack --progress --colors --env.mode=dev --display-error-details",

  "dev-moduleA":  "webpack --progress --colors --watch --env.mode=dev --env.modules=moduleA",
  "dev-moduleB":  "webpack --progress --colors --watch --env.mode=dev --env.modules=moduleB",
  "dev-moduleC":  "webpack --progress --colors --watch --env.mode=dev --env.modules=moduleB",

  "dev-moduleAB": "webpack --progress --colors --watch --env.mode=dev --env.modules=moduleA,moduleB",
  "dev-moduleBC": "webpack --progress --colors --watch --env.mode=dev --env.modules=moduleB,moduleC",
  "dev-moduleAC": "webpack --progress --colors --watch --env.mode=dev --env.modules=moduleA,moduleC",

  "lint":         "eslint 'src/**/*.js'  --cache --config .eslintrc --ignore-path .eslintignore",
  "lint-fix":     "eslint 'src/**/*.js' --fix  --cache --config .eslintrc --ignore-path .eslintignore"
}

Upgrading to [email protected]

According to Sean T. Larkin in the release blog post: “webpack 3: Official Release!!”, migrating from webpack 2 to 3 should involve no effort beyond running the upgrade commands in your terminal. We are using [email protected] and [email protected] now :)

Last but not the least - Stepping towards a long journey

This was just the beginning of stepping towards researching different technologies and upgrading our tech stack. We have now gradually started writing ES6 code for that particular project. The experience was tremendous and the team is now working on evaluating other sections where the change could gradually take a form.

Helpful resources

Feedback

Should you have any feedback regarding this article, please share your thoughts via comments.

If you like this article, do share it :)


“What is the most resilient parasite? Bacteria? A virus? An intestinal worm? An idea. Resilient… highly contagious. Once an idea has taken hold of the brain it’s almost impossible to eradicate. An idea that is fully formed - fully understood - that sticks; right in there somewhere.” – Cobb(Leonardo DiCaprio), Inception

What is DevFest?

On September 9th we had the first instance of our Wingify DevFest. It started with a simple idea, to have a community of fellow techies where everyone could meet, learn something new, share ideas and inspire one another. But we didn’t just want to end here. We wanted to have a day where people could celebrate and have a good time. Thus, the Wingify DevFest was born.

How did we plan for it?

Though the DevFest happened on 9th September, the preparations had started much before that. In fact, the whole structure of DevFest underwent drastic iterations since we’d first started working on it. Initially, we had simply planned on having a set of internal team members of Wingify as speakers. The rationale behind this was, this being our first DevFest having internal speakers we would help us have a good grasp of the speakers and their content. It would also be easier to organize because we could skip the overhead of finding external speakers. This idea was soon scrapped because we would have had to compromise the interest of our teammates as most of the internal talks had already been watched by the team. The other extreme plan was to have all external speakers, which too was soon ruled out because of the logistics involved. We also knew that some of our own internal speakers had good content which the world should definitely see. Finally, we agreed upon having an all external speakers list and keep the internal speakers as backup, should the need arise any time. And thank God we did, because as you’ll soon find out, we did have to use the fallback.

Amidst the initial confusion of finding the ideal number and type of speakers, there was still an extreme clarity within the organizing team about the other events that we wanted to have. More on that later.

Deciding on the theme

Organizing the first of a series always has its own set of challenges and uncertainties. For us, the main challenge, which was a crucial factor in almost all of our decision making, from the topic for the DevFest to even deciding what swag should we have, was identifying our target audience. Unlike some major tech cities like Bangalore, Hyderabad where the majority of folks are working professionals, Delhi has a beautiful eclectic mixture of working professionals and college students. In fact, the number of engineering colleges in Delhi are mind blowing. This translates to the fact that in most of the meetups and communities there’s a mixture of both the streams. Extrapolating from this fact, we concluded that we too could expect a mixture of both the classes. The challenge with that was to find a theme suitable enough to resonate with all the members. Performance, Reliability and Security was the perfect topic because everyone, at some point in the college/professional life, has had a requirement to know deeper about it. With a balanced set of talks on this theme, we could achieve a point which would keep both the parties interested.

Picking speakers

With the topic of the DevFest clear, finding speakers was the next challenge, or so we thought. On 27th July we started campaigns on several social media channels, meetups and also word of mouth to find the best tech speakers in Delhi. It was a 15-day campaign and by the time it ended we were ecstatic. There were more than 20 entries and some even tried to register after the deadline. Not bad for the first time 🙂. After several meetings and discussions, we finally narrowed down to 3 final speakers. We had even sent them the invitation. Too easy, we thought. One week before the event 2 of our speakers backed out because of inevitable issues. There was a DEFCON 1 emergency declared in our nation! Everyone went on a rampage. Well, maybe I’m exaggerating a bit, it wasn’t DEFCON 1 because we didn’t have nuclear weapons, but you get the drift. In that frenzy we sought out the internal speakers. Things could’ve gone really south if we didn’t have an existing plan B. Though, we eventually ended up having four speakers instead of three because an earlier backed out speaker managed to join back and, so, we were more than happy to re-adjust the schedule. These were the speakers who finally spoke

  1. Atul Agarwal (co-CEO, AdPushup) as the Keynote Speaker
  2. Saurabh Shandilya spoke on ToR 101
  3. Deepak Pathania spoke on Performance Optimization for the mobile web
  4. Neha Sharma spoke on Web apps and Performance
  5. Manish Gill spoke on Gyaan in Scalability

Organising interactive events

At Wingify, we frequently have internal technical events that keep our wits sharp. Since one of the inherent idea of the DevFest was to keep it interactive for everyone, what better way than to include a few of these events in the schedule as well. Selecting the events was as easy as looking back at list of previous year’s events and adding the ones which were liked by majority of team members. The finalists were Code in the Dark and Capture the Flag.

The Day

The day before The Day we stayed back late in the office. The previous week had been taxing because of the whole speakers backing out fiasco and also because the organising team had been really busy releasing the new VWO Conversion Optimization Platform to general public! Thus, there were several logistics that had to be taken care of on the last day. Everyone went late that day yet returned back early the next morning.

September 9th, our spirits were high. No, we weren’t high (at least not until the events lasted), we were giddy. Everything was set. The initial slow pouring of the attendees soon gained pace and by 11 am our office was packed and ready for some action. It was a good mixture of energetic college folks and knowledgeable professionals, each trying to find like-minded counterparts to talk ideas. Thanks to Akash Tyagi, we had some really cool banners installed all over the place. In fact, right from the beginning he had been the guy who’d designed the banners, logos and social media cards etc., which everyone greatly admired.

Atul Agarwal had accepted our request to be the Keynote Speaker for the event. His talk on performance, reliability and security, was full of wisdom that he had garnered on his journey to make AdPushup a successful and formidable ad-revenue optimization company in its space. He went on about how most companies, in a haste to launch feature after feature, often forget the aspects of performance, reliability and security, which later bites them back. Sometimes overlooking such aspects costs companies a fortune and, even worse, respect of their clients.

Immediately succeeding him was Saurabh Shandilya, who spoke about the ToR network. His talk cleared some of the misconceptions that people have about ToR and through his articulate speech he managed to convince many people to try it out. Not only that, he even managed to convince some folks who’d already tried it earlier and given up, to give it another shot.

Next in line was Deepak Pathania. Although Deepak says that it was his first ever talk, we have our doubts. We’ve seen seasoned speakers get uncomfortable on the stage but Deepak didn’t break a sweat. He spoke about the Google Amp project and why it’s a viable optimization strategy for your mobile pages. He also went ahead and gave some example on how to quickly start a project with Google Amp.

With a quick lunch, after Deepak’s talk, it was Capture the Flag time! Dheeraj Joshi from the organising team, had managed to craft some mind-tickling questions for the participants to rack their brains on. For the next two hours everyone was glued to the event, trying to find ways to get to the hidden flags. At the end of the day, Capture the Flag turned out the be the star attraction of the DevFest.

Succeeding the CTF, was Neha Sharma who spoke about Web Apps and Performance. Neha is a tech speaker and founder of the renowned JSLovers Community and we were lucky to have her in the list of speakers. Given the breadth of her topic and the limited time she had for her talk, she could only give an abridged snippet of how developers can improve their website’s’ performance by using several best practices.

After Neha it was Manish Gill’s turn. Manish is a fellow Wingifighter who rose up to the challenge to speak at the DevFest when some external speakers had backed out. He works in the Data Layer team in Wingify, the team which manages the performance and scalability of data collection and retrieval aspect of our application. Having worked on challenging scalability problems and having experience in giving public talks, he was the ideal candidate to represent Wingify. Manish did deliver an insightful talk about how we’ve used Postgres and Kafka to scale to the tune of 20k requests per second.

We finally finished the day with Code in the Dark. It was a long long day, and we’re glad we chose to end with it. Our in-house DJ, Ashish Bardhan, played the best of the best Techno music that we could’ve asked for. The dark settings along with the laser lights and the music set the right ambience to get the adrenaline pumping. It was intense! By the time the Code in the Dark ended everyone was rejuvenated.

All that, in one day. Achievement level: 50,000.

How did we fare?

There were many things we did well, and there were many things we could’ve done better. Our sound system, definitely, frustrated some of the speakers and audience members. It malfunctioned multiple times and broke the flow of the speakers. We should’ve also provided a visual timer for the speakers so they could keep a track of their talk. It wasn’t the smoothest event, I agree but what doesn’t kill you makes you stronger. With these learnings we’ll be better prepared to have a smoother DevFest next time.

Some moments captured during the DevFest:

Conclusion

Our quest to have a community of like-minded people has just started. The first instance of the DevFest has been a stepping-stone for us and it’ll only get better from here. Stay tuned for the next DevFest. It’s going to be legen….. wait for it!

PS: A big shoutout to the members of the organising team; Akash Tyagi, Dheeraj Joshi, Jatin Makhija, Kushagra Gour, Sahil Bathla and also the volunteers for all the hard work they’ve put into making the DevFest a success.


Shipping a bug-free feature is always important in every release. To ensure this, we do quality analysis(QA) at various points of the feature cycle. To facilitate an efficient QA, we also maintain certain environments for our app, each serving a different purpose. We have the following environments to be specific:

  1. Production - The actual live app.
  2. Staging - A replica of the production where final sign-off QA is done just before going live.
  3. Test - A quick deployable environment which can be used by developers to share the WIP feature branch with anyone in the company or among other developers.

With multiple features in development simultaneously and multiple environments to deploy, automated deployment becomes very important to ensure frictionless and fast feature lifecycle. In this post, I’ll try to explain how to manage all these environment deployments through automation, especially for our product VWO.

Tests

As mentioned above, tests are very lightweight environments which developers generally create to share their WIP feature branch with other developers, QA or someone from marketing/product to gather feedback. Our app consists of various components: frontend, main-backend and various other micro-services. So each test environment is a combination of different branches from each of the constituent components. For example our app have following components: frontend, backend and Service-1. So our tests can look like:

Test #1 - master (frontend) + feature-notifications (backend) + master (service-1)

Test #2 - feature-auth (frontend) + feature-auth (backend) + master (service-1)

And as these tests should have a unique sharable URL, they can be given names like: feat1.vwo.com or heatmap-optimizations.vwo.com

Deployment

To deploy such a test we have a job on Jenkins. As you may have guessed already, the inputs to this job are:

  1. Name of the test instance
  2. Frontend branch
  3. Backend branch
  4. Service-1 branch

Once this job runs, it pulls on all the above 3 branches on a remote server, does some configuration changes and creates a virtual host to work on testname.vwo.com.

More automation

Now, even this job would require the developer to open Jenkins webapp, go to job page, put in inputs and then run it. But we avoid that too - enter Ramukaka! Ramukaka is our Skype bot (that we have open-sourced as well) which we use for various grunt tasks, such as running a Jenkins job!

With Ramukaka in the picture, our test deployment looks like so:

Note: We have 3 components and have only 2 branches are specified. That is because the developer can skip a component if the branch to be deployed is default i.e. master. Also, the same command just pulls the latest changes in case the test instance already exists.

Neat, right?

Staging

Staging has primarily 2 differences from test:

  1. There is a single staging unlike multiple tests.
  2. There are some more build steps involved compared to a test.

So it’s similar to a test deployment, except that before deploying it required the developer to build his/her branch like so:

Note: While building a branch we also inform the job about the environment to build for (eg. stagingapp above) because right now the code needs to be a bit tweaked according to the domain its deployed on.

And once Ramukaka confirms a successful build, the developer can deploy the staging with that branch:

Some more commands

As I had mentioned, we have just one staging (single gateway to production). Therefore, each deployment overwrites the previous deployment. And so it becomes important that developers do not overwrite each other’s deployment by mistake. To prevent this, we have an additional command in Ramukaka called currentBranch. Through this command anyone can check which branch is deployed for a particular component on the staging. Eg. if I need to check the frontend branch on staging, I would do so:

Now the developer can take appropriate actions based on the deployed branch.

Production

The production is no different from the staging. Once the final round of testing is done by the QA team on staging, there are 3 things that need to be done to deploy the app on production:

  1. Build the branch
  2. Create a tag for release on master branch
  3. Deploy the tag on the server

All the 3 tasks are handled through a single command on Ramukaka:

And the frontend gets deployed on production, just like that!

Note: Right now only the frontend deployment is automated for production. But we plan to do it for all the components of the app.

Going Ahead

All this deployment automation saves us a huge amount of time. And we know we can save more. Using similar automation for every component of the app is something we plan to do next. Also better logging and monitoring of these environments is on the list.

How do you manage multiple environments? We would love to hear about your deployment techniques if you want to share in the comments.

Until next time!


About PyData

I recently got an opportunity to speak at the PyData, Delhi. PyData is a tech group, with chapters in New Delhi and other regions, where Python enthusiasts share their ideas and projects related to Data Analysis and Machine Learning.

Talks at PyData

There were three talks at PyData, namely Machine Learning using Tensor Flow, Data Layer at Wingify and mine, Learning Data Analysis by Scraping Websites. All the talks were thorough and excellent! In the talk, Data layer at Wingify by Manish Gill 🤓, he talked about how we handle millions of requests at Wingify.

Some of Images of the PyData Meetup Hosted by Wingify.

Background About My Talk

Let me give you a little background. It was the Friday before the PyData Meetup/Conference. Our engineering team was doing its daily tasks. I had just grabbed a coffee to alleviate my laziness. Suddenly, our engineering lead came and asked us whether anyone could present on a topic at the PyData that we were to organise the very next day. An initial speaker, who had confirmed earlier, backed out at the last moment because he had fallen sick. I could see that most of the team members tried to avoid volunteering in such a short notice and also probably because the next day was a Saturday (though this is my personal opinion). But I had something different on my mind and during this planning or confusion, I volunteered for it 🤓. I had a project that I had done, back when I was learning Python. So I offered to present it. He agreed to it and asked me to keep the presentation ready.

Preparing the Project & Slides

That Friday night, I started searching for the old files which I had used. Finally, I found all of them on my website, downloaded them and ran the code. It worked like a charm 😍. Yeah! I quickly created the slides around it, and after finishing, smiled and went to sleep at 4.30 am.

Little About the Basics of My Talk.

The presentation that I gave was on Learning Data Analysis by Scraping Websites. During my college days, we heavily used the BeautifulSoup Library in Python to scrape websites for the many personal projects. During this project, I got the idea to scrape data from the websites which aggregated movies related data. By doing that, I thought that I could create a list of all movies that I must definitely watch. The movies had to satisfy the following criteria:

  1. Release date >= 2000
  2. Rating > 8

It was not the best idea at that time to scrape websites and then analyse(Data frame). But I learned a lot of things by scraping data from the website using Beautifulsoup, then analyzing data using Pandas, visualizing data using MatplotLib (a Python library) and finally coming to conclusion about my movies recommendation.

Coming back to the objective - Finding and sorting the movies released between 2000-2017 in the order of relevance (I didn’t want to watch movies < 2000). Below is the code to scrape IMDB for movies data from 2000-2017.

from bs4 import BeautifulSoup
import urllib2
def main():
    print("** ======  Data Extracting Lib -- by Promode  ===== **")
    testUrl = "http://www.imdb.com/search/title?at=0&count=100&\
    groups=top_1000&release_date=2000,2017&sort=moviemeter"
    pageSource = urllib2.urlopen(testUrl).read()
    soupPKG = BeautifulSoup(pageSource, 'lxml')
    titles = soupPKG.findAll("div",class_='lister-item mode-advanced')
    mymovieslist = []
    mymovies = {}
    for t in titles:
        mymovies = {}
        mymovies['name'] = t.findAll("a")[1].text
        mymovies['year'] = str(t.find("span", "lister-item-year").text).replace('','')
        mymovies['rating'] = float(str(t.find("span", "rating-rating").text)\
        .replace('','')[0:-3])
        mymovies['runtime'] = t.find("span", "runtime").text
        mymovieslist.append(mymovies)
    print mymovieslist
if __name__=="__main__":
    main()

Click here to have a look at the full source code.

You can see the trends like Maximum Rating - Sorted by Rating, Year Vs Rating Trend

DataFrame - Rating is Set as Index


Maximum Rating - Sorted by Rating


Year Vs Rating Trend

Take away from the Talk

With this method, you would have winner’s data from the data set. For example, suppose you want to create a Cricket Team(IPLT20) which has the maximum probability to win the match, what you can do is parse the IPLT20) website for last 5 years’ data and select the top 5 batsmen and 6 bowlers 😎.

Conclusion

I totally understand that this may not be the best project for the data analysis. I am still learning and I showed what I had done. I believe that it served my purpose.

I will be doing more research on data analysis in Python. Thanks for reading this. Below is my talk slides:

Slides:

Slides :-