CSV export of Ecto models

This post is about converting Ecto models to a csv file using this CSV library

Ecto is a database DSL for Elixir and is part of the Phoenix framework. Recently we had a requirement to export one of the tables as CSV (you can never escape such export requirement as long as there is excel right?)

Here is an ecto model we wanted to export

  defmodule Message do
    use MyApp.Web, :model

    schema "messages" do
      field :message, :string
      field :status, :string
      field :uuid, Ecto.UUID

It is a pretty simple model backed by table called “messages” which has the columns message, status and uuid

Get the data to be exported from the database

messages = Message order_by(:inserted_at) |> Repo.all

And trying to convert it to CSV

  |> CSV.encode
  |> Enum.to_list

does not work because the CSV encoder does not know how to deal with the Message type. So, define an encoder function for the Message type

defmodule Message do
  defimpl CSV.Encode, for: Message do
    def encode(cm, env \\ []) do
      [cm.message, cm.status, cm.uuid]
        |> Enum.map(fn(v) -> CSV.Encode.encode(v, env) end)
        |> Enum.join(",")

We are creating an array of all the needed column values and encoding them individually. Then, joining them to form the CSV for the given message struct

But this is not enough. CSV.encode/2 expects a stream of data in a tabular format, and encodes it to RFC 4180 compliant CSV lines. By that it means the data should be in the format


So, wrap the array of Ecto Message objects into an array of array having a message object per row


Do that and pass it to the csv encoder

messages = Message order_by(:inserted_at) |> Repo.all
  |> Enum.map(fn(m) -> [m] end)
  |> CSV.encode
  |> Enum.to_list

The resulting string will be the CSV of Ecto message instances

Rate Limiting the right way

While dealing with external services you would want to throttle requests. In Go, its pretty simple to implement it with something like this.

While this works for tens of operations a second, this does not scale for 1000s of operations a second. We have been working with an AMQP queue that needs to be drained with back pressure. Our downstream system is sending SMSs using a gateway and can only send a prescribed amount of SMSs but it is greater than tens of operations a second. There is a supplementary time package which provides rate limiting that works with token buckets.

The token bucket is an algorithm used in packet switched computer networks and telecommunications networks. It can be used to check that data transmissions, in the form of packets, conform to defined limits on bandwidth and burstiness (a measure of the unevenness or variations in the traffic flow). It can also be used as a scheduling algorithm to determine the timing of transmissions that will comply with the limits set for the bandwidth and burstiness. – Wikipedia

So you start off by creating a limit which defines the interval. For example, if you would like for 2000 events a minute.

You would then create a limiter that would setup the burst rate per second. So in our case you would want something like this.

Then you can reserve a token and sleep for the duration indicated by the reservation.

You can verify if things are going fine by printing the value every second.

You can also reserve multiple tokens at a time and do as many operations. This will ensure that the scheduling overhead on the goroutine is avoided.

Customizing create_react_app without ejecting

As you might be aware the create_react_app is all the new hotness these days. While it eliminates the need for complex boilerplate, it leaves little room to for customization. We are in a weird fix where we have to use our react app within a Rails view. The layout belongs to Rails but we want to host the react app within it. However, the create-react-app assumes (rightly so) that it would be used in an independent static project.

I would like to use that without running an npm run eject on the project. All we needed was to generate a manifest as a part of the build so that we can use that to change references to the CSS and JS tags in the layout. Fortunately there exists a webpack plugin called the webpack-manifest-plugin for doing precisely that. So we wrote a build script that uses that and placed it in script/build.js within the react repo.

process.env.NODE_ENV = 'production'

var webPackProd = require('react-scripts/config/webpack.config.prod');
var ManifestPlugin = require('webpack-manifest-plugin');

webPackProd.plugins = [new ManifestPlugin()].concat(webPackProd.plugins);


We then changed package.json to run this script instead of the standard react-scripts build for the npm run build command.

  "name": "newreactapp",
  "version": "0.0.1",
  "private": true,
  "devDependencies": {
    "react-scripts": "0.6.1"
  "dependencies": {
    "react": "^15.3.2",
    "react-dom": "^15.3.2",
    "webpack-manifest-plugin": "^1.0.1"
  "scripts": {
    "start": "react-scripts start",
    "build": "node ./scripts/build.js",
    "test": "react-scripts test --env=jsdom",
    "eject": "react-scripts eject"

Voila. Now when we have a manifest.json in our build directory that looks a bit like this.

  "main.css": "static/css/main.9a0fe4f1.css",
  "main.css.map": "static/css/main.9a0fe4f1.css.map",
  "main.js": "static/js/main.247a9e3c.js",
  "main.js.map": "static/js/main.247a9e3c.js.map",
  "static/media/logo.svg": "static/media/logo.5d5d9eef.svg"

Of course this is a hack and would not work very well if the structure of the template changes. I would really like more knobs and dials to configure create-react-app than the current all or nothing approach.

Integrating React Native into an existing Swift project

React Native is really nice. Its Cmd-R style development is refreshing compared to XCode’s build, deploy and run cycle. But if you want to use things like CoreData, NSOperationQueues or any of the other awesome libraries in the iOS world, you have to write bridges. In this blog post we will be focusing on adding React Native to an existing swift application. React Native’s secret is in its RCTRootView. So just like for the browser, React focuses on being a good view library and can be mixed with your existing swift application.

We’ll be using the XCode’s Single View Controller Swift project template for demonstration. React Native’s site has excellent documentation how you would go about such an integration. I add a few more things on top like loading from compiled / hosted JSBundle, caching the JSBundle and manually reloading React code after refreshing the cache. You can download the sample code from Github.

After you have your project and set it up with cocoapods, go ahead and initialize your package.json. My package.json looks like this. I have placed my package.json in the same location as my .xcodeproj and .xcworkspace.

Then go ahead and run npm install on the command line to install the necessary dependencies. This will create a node_modules folder. You can add this folder to your .gitignore file much like Pods directory for cocoapods.

You would then have to add React and its subspecs to your Podfile. My Podfile looks like this.

This adds references to the React Native pods and links them to your project as frameworks. Don’t forget to pod install after this. You can then add the entry point to the react native view. This file will be called index.ios.js by convention and we will be referring to it from our ViewController.

You can then start the react-native server by running npm start at the command line from the folder that has the file package.json. This should start a server that should be able to serve our react native bundle from localhost:8081.

I have used a simple View controller which has two views a RCTRootView that hosts the React Native app and a UIButton that will let me reload the code using the RCTBridge object of the RCTRootView. This is what my ViewController code looks like.

Running the app should present the app with React Native view loaded.

Bridging Swift views to React Native

Now this turned out to be trickier than I originally thought it would. React Native has a RCTViewManager class that expects you to return a View. Now the view can be configured with properties and you would have to export view using Obj-C macros. For the purposes of illustration, I am going to create a simple view that has a label and takes in property called message. It would then simply display the value of the message in a label concatenated with the word “Swifted”. Yeah… i made that nonsensical term up just now. I am going to create a view called CustomRNView.

Notice that I am exposing the swift class to the ObjC world using the @objc annotation. Now I have to write a RCTViewManager class that returns me this view.

This class CustomRNViewManager simply has a view method that returns a CustomRNView object. Also note that this class has also been annotated with the @objc annotation.

Now we have to expose it to the react world with the the RCT_EXTERN_MODULE and RCT_EXPORT_VIEW_PROPERTY macros. You will have to create the CustomRNView.h and CustomRNView.m files. The header declares our custom UIView as a RCTView and the implementation will export our CustomRNViewManager.

You do not have to put anything in the bridging header as we are loading rest of the React Native specs as frameworks.

Now, we have to get our CustomRNView to our javascript side. You can now use the requireNativeComponent method to refer to the exported module.

It should now display the CustomRNView within the react native view.

Caching the jsbundle

For the performance reasons, you might want to distribute the bundle with your app or better still load it the first time from a URL and then cache it to your documents directory. When you want to refresh then you can redownload the bundle and reload the code via RCTBridge. I have done precisely that in my ViewController.swift.

This will download the the file once and load it from the documents directory. You will also notice that the app launch time is significantly faster. You might want to consider using services like Microsoft CodePush to automatically download the latest code as you push changes to your code. We are very excited with this approach as we can be more agile with our existing swift apps by leveraging React Native and its code push strategies without having to go through the app compile, deploy and run cycles. Let us know your thoughts and opinions.

Refactoring Swift

We are building a mobile app for enterprise asset management. One of the methods in our SyncManager class had a very complex merge method that merged the items in core data and the items received from the server. To give you a perspective this is what the method looked like.

This method takes in sorted business objects (Work Orders) from core data as the first argument and sorted JSON array on work order id as the second argument and does swift’s equivalent of pointer increments to merge these lists together. The code has two vars and three while loops. While the code works per se, reasoning about a bug by say forgetting to increment a pointer is very hard.

The goal for our refactor is zero vars and a functional way to subtract the lists based on object keys. This is what the transformed functional code looks like.

By most measures, the refactored code looks a bit better than the original. We have removed a lot of complexity from various branching and looping and have made the interface consistent and readable.

Swift does not provide the subtract and indexUniqueOn methods. I wrote a bunch of functions for extending the SequenceType.

The subtract method takes in another collection and two lambdas that lets it arrive on the common key. This lets us subtract items of one sequence based on the one of the derived keys of the other sequence. The Swift’s type system lets us set constraints on generics that help us express this beautifully. Swift’s type system is actually quite nice. And this is what the tests for the extension look like.

Experience Report – Introducing an experienced Ruby Developer to Clojure

I recently introduced Steve to clojure and wanted a good example to show the power of clojure. Steve is a lead developer for our client and has has worked extensively with Ruby and values the expressiveness and the productivity of Ruby. He also has a working knowledge of Java and is interested in clojure but not very convinced about its utility and is generally wary of lisps. I wanted to build an example that demonstrates the syntax, the expressiveness and the concurrency features of clojure. I also wanted to highlight some aspects of the language like threading, function composition and so on which are either absent from Ruby or make it hard to express.

I have tried introducing clojure to other people and have favored the approach of short examples like the ones on Clojure docs. For example, I always used a (Thread/sleep 1000) to demonstrate a long running job for pmap. In fact, clojure docs on pmap uses a similar example. While this convinces the people on why such concurrent systems are important, they fail to see how it can be useful in a real world scenario.

So this time, instead of using several short academic examples, I decided to write a simple program that mimics a real world scenario that can leverage concurrency. The problem I chose to demonstrate was to make a API call to reddit to get top content on the clojure subreddit and do a count of articles by domain and extract the titles of the articles using html scraping using enlive. This problem is not too complex at the same time its not as trivial as (Thread/sleep 1000).

We started off by getting the reddit feed. I initially used slurp for its instant gratification value and showed that this was similar to Ruby’s open method in the open-uri library. Reddit performs rate limiting on API requests and determines it based on the User-Agent header. So I used (clj-http)[https://github.com/dakrone/clj-http] to fetch the JSON with a different user agent.

(require '[clj-http.client :as client])
(require '[clojure.data.json :as json])

(def ^:const REDDIT-URL "http://reddit.com/r/clojure.json?limit=100")
(def ^:const headers {:headers {"User-Agent" "showoffclojure.core by vagmi"}})

(def parse-json
  (comp #(json/read-str % :key-fn keyword)
        #(client/get % headers)))

(parse-json reddit-url)

I wanted to demonstrate the functional nature of clojure. This demonstrates how simple functions can be composed using comp to higher level functions and how they can be invoked. This also shows how keywords are behave as functions in clojure and are quite prevalent when dealing with maps. I also briefly talked about the IFn interface.

(let [reddit-data (-> (parse-json REDDIT-URL)
  (map :data reddit-data))

To extract the data out of the returned json, I used the -> threading macro. Although, I mentioned macros, I did not spend too much time on them in the interest of brevity.

(require '[net.cgrand.enlive-html :as html])

(defn fetch-url [url]
  (-> url
      (client/get headers)
      ;; Java interop happens seamlessly

(defn extract-title [url]
    (first (map html/text
                (-> (fetch-url url)
                    (html/select [:html :head :title]))))
    ;; Exception handling is as easy as this
    (catch Exception e "unknown")
    (catch Error e "Unknown")))

We then wrote functions to extract the title of the page using enlive and I introduced how easy it was to do Java interop using Clojure.

;; this will have a structure like
;; {"infoq.com" 3 "self.Clojure" 4}
(defonce domain-summary (ref {}))

;; this will have the following keys
;; [{:article_url, :reddit_title, :actual_title}]
(defonce article-details (ref []))

(defn update-summary [entry]
  (let [title (extract-title (:url entry))
        domain-count (@domain-summary (:domain entry))]
      (if domain-count
        (alter domain-summary assoc (:domain entry) (inc domain-count))
        (alter domain-summary assoc (:domain entry) 1))
      (alter article-details conj {:url (:url entry)
                                   :reddit-title (:title entry)
                                   :actual-title title}))
    (print ".")

I then talked a bit about how clojure handles values vs identities and its opinions on managing state in a program. We decided to have two data structures domain-summary and article-details. domain-summary will be a map that will have the domain as the key and the count of the entries from the api as the value and the article-details will be a vector of map having the url, reddit-title and actual-title of the page. For every entry in the reddit api list, we will have to update both the domain-summary and the article-details structures. This segued nicely into STM and how clojure supports STM through refs. We put pretty . to indicate the progress and also to point out that IO operations should stay out of dosync blocks.

(let [reddit-data (-> (parse-json REDDIT-URL)
  (map (comp update-summary :data) reddit-data))

We then changed the previous map expression to extract the summary as well. Now the part that completely blew his mind was when I changed map to pmap and it magically utilized all the cores performing ~7x faster as I have 8 cores on my machine.

By the end of this session, Steve was quite convinced about using clojure for our production projects and picked up enough clojure over the weekend to port some of our core parts off ruby to clojure to evaluate it on a real word scenario. I also helped setup vim-fireplace on his machine.

I cleaned up the code a bit and checked it into github. So if you would like to show off clojure to your colleagues or friends, please feel free to use it. If you have any suggestions to make things simpler or demonstrate other concepts that would be helpful in selling clojure’s value proposition better, please send me a PR. I would like to keep the example simple (~ 50 – 60 lines).

List comprehension in Ruby

Having partitioned a large postgres table into multiple partitions, I wanted a quick way to dump the partitioned data into CSV files using the /copy command. As the tables are named tablename_yyyymm. I wanted an easy way to generate the yyyymm sequences. One way to do it is inner loops.

(2011..2015).map { |y| (1..12).map {|m| (y*100+m).to_s } }

But that looks quite inelegant. Most languages have a way to do list comprehension. For example, if you have a set [1,2,3] and another set [4,5,6], we can produce a list like [[1,4],[1,5],[1,6],[2,4],[2,5],[2,6],[3,4],[3,5],[3,6]] using list comprehension. We can consider map to be a degenerate case of list comprehension.

Here are a few code samples from some of my favorite languages.

; Clojure
user=> (for [x (range 1 4) y (range 4 7)] [x y])
([1 4] [1 5] [1 6]
 [2 4] [2 5] [2 6]
 [3 4] [3 5] [3 6])

This is how it is done in Haskell.

-- Haskell
Prelude> [[x,y] | x <- [1..3], y <- [4..6]]

But Ruby does not support it. So, I decided to add them in Ruby. Well, it is not quite the same but it is enough to get my job done. True array comprehension will respect laziness and support filtering operations. But for my use case, this is sufficient. update – I have updated the code as per @sgporras comment. To make the comprehension lazy and return a enumerator instead of a list.

def comprehend(*enums)
  cycles = enums.map(&:cycle)
  Enumerator.new do |comprehended|
    loop do
      value = cycles.map(&:peek)
      value = yield value if block_given?
      comprehended << value
      (cycles.length - 1).downto(0) do |index|
        break unless cycles[index].peek == enums[index].first
        raise StopIteration if index == 0
  end.lazy # requires Ruby > 2.0

You can call it with a block that will perform a map inline or call it just as a method to return just the elements that are generated.

irb> comprehend(1..3,4..6).to_a
=> [[1, 4], [1, 5], [1, 6],
    [2, 4], [2, 5], [2, 6],
    [3, 4], [3, 5], [3, 6]]
irb> comprehend(1..3,4..6) { |(x,y)| "#{x}-#{y}" }.to_a
=> ["1-4", "1-5", "1-6", "2-4", "2-5", "2-6", "3-4", "3-5", "3-6"]

I thought of refining the array and adding a comprehend method on the Array but I prefer a function implementation in a module rather than monkey patching the Array. Notice that the arguments x and y are pattern matched on the block supplied to the comprehend method.

Managing big enough data in postgres

Postgres can work quite well for large databases without having to look at NoSQL solutions. But if you find your tables getting larger than the physical memory, you might want to look for ways to optimize your peformance. In my current consulting assignment, I have to deal with a table that is 104 gigabytes on Amazon RDS running Postgres. This table is at the heart of the system and any query that as much as glances at the table sideways gets incredibly slow. To make matters worse, we had to run a migration on that table that changed the data type of one of the columns from bigint to varchar.

That would just not work. The one way to solve such situations is to partition the table. PostgreSQL has a feature called as table inheritance that allows child tables to share the same structure of the parent table and the child tables can participate in queries to the parent table. What is cool about this is that postgres also supports the concept of table spaces that lets multiple tables be in separate directories or mount points. So technically, you can mount more often accessed data on fast SSDs or an EBS volume with higher provisioned IOPS and move older data to slower low cost storage devices. This also has the additional benefit of reducing the size of working set and being able to perform seq scans in memory rather than index scans on disk. The data I was dealing with was largely time series data and time seemed like the obvious way to partition it. The first order of business was to export existing data to CSV files using the \copy command using limit and offset. The table I was dealing with contained close to about a 100 million rows with multiple JSON fields so I exported them a million at a time.

The next step was to create a clone table that had the same structure as the large table.

create table large_table_clone as 
   select * from large table where 1=2;

The you can create child tables by specifying the check condition and the table you want to inherit from.

create table large_table_clone_201504 (
  check(start_time >= date '2015-04-01' 
        start_time < date '2015-05-01')
) inherits(large_table_clone)

We can then create a trigger on the parent table to send records to the child table based on the inserted record’s start_time. The trigger would look something like this.

CREATE OR REPLACE FUNCTION partition_function()
  IF ( NEW.start_time <= DATE '2015-04-01' 
       NEW.start_time > DATE '2015-05-01' ) THEN
    INSERT INTO large_table_clone_201504 VALUES (NEW.*);
  ELSIF ( NEW.start_time <= DATE '2015-03-01' 
          NEW.start_time > DATE '2015-04-01' ) THEN
    INSERT INTO large_table_clone_201503 VALUES (NEW.*);
    RAISE EXCEPTION 'Date out of range';
LANGUAGE plpgsql;

CREATE TRIGGER large_table_clone_partition
BEFORE INSERT ON large_table_clone
FOR EACH ROW EXECUTE PROCEDURE partition_function();

The only trouble was I was not sure of the valid date range in my dataset and there was no hope of getting to it from the existing table. So I had to create tables on the fly as data came in.

create or replace 
function partition_function() 
returns trigger as
  table_master varchar := 'large_table_clone';
  table_part varchar := '';
  start_date date := null;
  end_date date := null;
  table_count int;
  table_part := table_master || 
                '_' || 
  start_date := date_trunc('month',NEW.start_time)::date;
  end_date := (date_trunc('month',NEW.start_time) 
               + '1 month')::date;

  execute format('select tablename from pg_tables 
                  where tablename=%L limit 1',table_part);

  get diagnostics table_count = ROW_COUNT;

  if table_count < 1 then
    execute format(
     'create table if not exists %I (
        check( start_time <@ [%L,%L]::tsrange)
      ) inherits (%I)', 
     table_part, start_date, end_date, table_master);
  end if;

  execute format(
   'insert into %I values (($1).*)', 
   table_part) using NEW;

  return null;
language plpgsql volatile
cost 100;

This dynamically creates tables as fills in the data as they are inserted into large_table_clone. In case you were wondering the COPY command respects the insert trigger so you can bulk load the data exported via CSV files to the new tables.

Since we have table inheritance setup, we can now run queries against the large_table_clone table and it will transparently run the queries across all the child tables and get us the result. But this puts us back where we started. If you really want to get the benefits of partitioning, you will have to use constraint exclusion.

set constraint_exclusion=on;
select * from large_table_clone 
where start_time between '2015-01-01' 
                         and '2015-02-15';

This will ensure that it only performs the queries on two of the partitions rather than all the partitions. The next obvious step was to create an index on large_table_clone_yyyy_mm partitions on the start_time and cluster the tables using that index. This worked in our cases as the records are usually immutable and older partitions can hence be safely clustered. This made the large table a lot more manageable and we can feel safe archiving old data without impacting the performance of the production system.

Creating pivot tables with Postgres

I have been using PostgreSQL for a while now and I have been growing fond of it over time. As a part of the Rails app that I was working with, we have been logging API requests to a table. The target systems which call the API do so in bursts and we wanted to find the frequency of these calls based on the hour of the day. All the API calls were logged in a table with the following structure.

| id | xml_content   | created_at | updated_at |
| 1  | <somecontent> | timestamp  | timestamp  |
| 2  | <somecontent> | timestamp  | timestamp  |
| 3  | <somecontent> | timestamp  | timestamp  |
| 4  | <somecontent> | timestamp  | timestamp  |

To have a better sense of when we receive these requests, I wanted to segregate them by the hour for a given month. This is fairly simple and we can write a query as below.

select (date_part('year',created_at)*100 + 
        date_part('month',created_at)) yearmonth,
       date_part('hour',created_at) hour_of_day,
       count(1) no_of_requests
from call_logs
group by date_part('year',created_at)*100 + 
order by yearmonth desc, hour_of_day

The results of the query look a bit like this.

| yearmonth | hour_of_day | no_of_requests |
| 201504    | 00          | 2500           |
| 201504    | 01          | 2200           |
| 201504    | 02          | 2123           |
| ...       | ...         | ...            |
| 201503    | 11          | 1282           |
| ...       | ...         | ...            |

Although this gives the necessary data that we need, it would be easier if the hours appeared as columns rather than rows. Our postgres query will have to change the shape of the result set based on the number of hours returned in the query. Postgres has an extension that provides crosstab functionality to create pivot table like structures called tablefunc. The syntax is a bit verbose but essentially the crosstab function has the following signature. There are other crosstab signatures but I will leave it as an exercise to the reader to explore the other signatures.

crosstab(dataset_query::text, category_query::text) 
-- returns setof records

We have already written our query to return the dataset. The dataset query should always produce rows with 3 columns. The first column takes the row key. The values in the second column become actual columns on which that table is pivoted and the values in the third column are associated with the values in the second column and appear as values under the column bearing the value same value in the cross tabe record set. As this returns a set of records, we will have to typecast the returned records set as a table. If you have a lot of categories returned by the second query, it becomes a bit of a pain to describe it. In our case, since the second column is hour numbers, it can take values from 0 to 23. So we use the generate_series function to generate the values.

Since our query has single quotes, we need to escape them properly. Postgres has dollar quoted strings that let us express queries within $tag$ tags and it auto escapes the quotes for us. This makes the query quite readable. So the final resulting query looks a bit like this.

select * from crosstab(
  -- dataset query
  $q$select (date_part('year',created_at)*100 +
    date_part('month',created_at)) yearmonth, date_part('hour',created_at)
  hour_of_day, count(1) no_of_requests 
  from call_logs
  group by date_part('year',created_at)*100 + date_part('month',created_at),
  date_part('hour',created_at) order by yearmonth desc, hour_of_day$q$,
  -- category query
  $q$ select * from generate_series(0,23)$q$) 
as (datepart float, h00 bigint, h01 bigint, h02 bigint, 
                    h03 bigint, h04 bigint, h05 bigint, h06 bigint, 
                    h07 bigint, h08 bigint, h09 bigint, h10 bigint, 
                    h11 bigint, h12 bigint, h13 bigint, h14 bigint,
                    h15 bigint, h16 bigint, h17 bigint, h18 bigint, 
                    h19 bigint, h20 bigint, h21 bigint, h22 bigint, 
                    h23 bigint);

This produces result that looks like this.

| yearmonth |  h00 | h01  | h02  | h03  | ... | h22  | h23  |
| 201504    | 2312 | 2323 | 2323 | 2311 | ... | 2223 | 1132 |
| 201503    | 2322 | 2323 | 2263 | 2311 | ... | 2223 | 1132 |

This extension makes it very easy to visualize tabular data with PostgreSQL using pivot tables.