The Path to Enlightenment in E-Commerce

The Path to Enlightenment in E-Commerce

Building an analytics organization is a long and difficult journey. Where should you invest your time and resources? Is it building a Data Warehouse, getting QlikView or Spotfire to host that shiny dashboard or maybe hiring crazy-smart data scientists to crunch your numbers?

There is no simple answer, but if it is a journey, then there must exist a map to give you a sense of direction. For e-commerce websites, it is possible to rank initiatives in the order of complexity:

  1. Report on your transactions.
  2. Understand the traffic that arrives to your website.
  3. Introduce Business Intelligence and Performance Reporting.
  4. Use Predictive Analyics to understand the drivers behind past, present and future events.
  5. Invest in Data Science to stay on the bleeding edge and push the boundaries.

Each area can be split into multiple levels of sophistication:

TRANSACTIONS:

  1. Use logs and transaction reports.
  2. Create relational databases to host your data.
  3. Build a centralized datawarehouse.

TRAFFIC:

  1. Use server logs and simple metrics.
  2. Install specialized tracking tools like Google Analytics, Adobe Omniture etc.
  3. Attribution Modeling
  4. Click-stream analytics

BUSINESS INTELLIGENCE:

  1. Simple spreadsheet reporting (Excel FTW!).
  2. Invest in reporting tools (QlikView, Tablau etc.) and build dashboards.
  3. Introduce 3-tier reporting structure: Tactical, Operational and Strategic.
  4. Provide self-service tools to the business.

PREDICTIVE ANALYTICS:

  1. A/B testing and optimization
  2. Business performance forecasting
  3. Econometric modeling (e.g. for digital and offline channels).
  4. Customer and Marketing analytics

DATA SCIENCE:

  1. Algorithmic marketing
  2. Recommendation engines
  3. Customer clustering
  4. Content personalization
  5. Social network analysis
  6. Text mining and information retrieval
  7. Real-time analytics

You do not need to tick all the boxes in one area to move to the next, but try not to jump head first into Data Science without having a solid foundation.

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Brains! Simulating a Zombie Outbreak

In my last blog post I wrote about SIR models for epidemiology. What better application than simulating a zombie outbreak?

I will use the following assumptions for my model:

  • 4 compartments: Survivors, Infected, Zombies, Dead.
  • 10 milion people in the population. No births and natural deaths.
  • Average contact rate between individuals is 7 (average number of people one is in direct contact with during a single day).
  • Start with a single Infected.
  • Infected do not pass the Zombie virus directly. They turn into a brain-hungry Zombie after 2 days of incubation.
  • When a Zombie bites it turns a Survivor into an Infected with probability of 0.4.
  • Zombies kill Survivors and Infected with probability of 0.4.
  • Most Survivors hide from Zombies, but some are fighting back. They will kill a Zombie with probability of 0.7, but with a reduced contact rate to 40%.
  • 10% of Survivors will also hunt down Infected and kill them with probability of 0.1.
  • Zombies "starve" to death after 14 days (actually, they die of exhaustion).

Let us turn these into model equations. Survivors become infected with rate \(\tau\) after being bitten by a Zombie, or die as a result of the Zombie attack with probability \(\kappa_{ZS}\):

\[\frac{dS}{dt} = – \tau \cdot S \cdot \frac{Z}{N} – \kappa_{ZS} \cdot S \cdot \frac{Z}{N}\]

The number of infected increases at the rate of \(\tau\), while Zombies and some Survivors hunt down Infected with rates \(\kappa_{ZI}\) and \(\kappa_{SI}\) respectively. Infected turn into zombies with rate \(\iota\).

\[\frac{dI}{dt} = \tau \cdot S \cdot \frac{Z}{N} – \iota \cdot I – \kappa_{ZI} \cdot I \cdot \frac{Z}{N} – \kappa_{SI} \cdot I \cdot \frac{S}{N}\]

Zombies are killed by survivors at the rate of \(\kappa_{SZ}\) and die of starvation at rate \(\sigma\):

\[\frac{dZ}{dt} = \iota I – \kappa_{SZ} Z \frac{S}{N} – \sigma Z\]

The are multiple ways to become Dead in our simulation: begin killed by a Zombie, killed by a Survivor if you are Infected or starving to death while as a brain-hungry monster:

\[\frac{dD}{dt} = \kappa_{ZS} \cdot S \frac{Z}{N} + \kappa{ZI} \cdot I \cdot \frac{Z}{N} + \kappa_{SZ} \cdot Z \cdot \frac{S}{N} + \kappa{SI} \cdot I \cdot \frac{S}{N} + \sigma \cdot Z\]

The code in R:

library(deSolve)
library(ggplot2)

N <- 10000000 # 10M in total population

i0 <- 1 # Initial infected group
s0 <- N - i0 # Initial survivor group (whole population minus infected)
z0 <- 0 # No zombies at the beginning
d0 <- 0 # Everyone alive at the beginning

total_contact_rate <- 7
transmision_risk <- 0.4

parameters <- c(N = N, 
                transmision = total_contact_rate * transmision_risk, 
                ZkillS = total_contact_rate * 0.4, 
                ZkillI = total_contact_rate * 0.4, 
                SkillZ = (total_contact_rate * 0.4) * 0.7, 
                SkillI = (total_contact_rate * 0.1) * 0.1,
                starveRate = 1/14, 
                incubation = 1/2)
state <- c(S = s0,I = i0, Z=z0, D = d0) # Initial state of variables

# Diff. Equations
SIZD <- function(t, state, parameters) {
  with (as.list(c(state, parameters)), {
    dS <- -transmision * S * (Z/N) - ZkillS * S * (Z/N)
    dI <- transmision * S * (Z/N) - incubation * I - ZkillI * I * (Z/N) - SkillI * I * (S/N)
    dZ <- incubation * I - SkillZ * Z * (S/N) - starveRate * Z
    dD <- ZkillS * S * (Z/N) + ZkillI * I * (Z/N) + SkillZ * Z * (S/N) + SkillI * I * (S/N) + starveRate * Z
    
    list(c(dS, dI, dZ, dD))
  })
  
}

# Sampling time from 0 to 250 step each 1 day
times <- seq(0, 250, by = 1)

# Solve the equation
out <- ode(y = state, times = times, func = SIZD, parms = parameters)
Infected in orange, Zombies in black.

Infected in orange, Zombies in black.

Survivors in green, Dead in red.

Survivors in green, Dead in red.

The plots above confirm what we already know from the movies: we are doomed to die in a zombie outbreak. There is however a way out – we just need to kill Zombies/Infected fast enough. By increasing the contact rate to 50% for Survivors killing Zombies we will keep the outbreak under control with approximately 70 casualties in total.

The dynamic of the model looks very simple with a very thin line between total apocalypse and survival. But on the other hand, isn’t this the reason why we love the Z-genre?

Feature image cc-by Eneas De Troya

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

SIR Epidemic Model

Mathematical epidemiology models are fascinating. Describing the evolution of a disease with equations feels almost unreal, especially when you realize they play key role in combating Ebola, influenza and many others.

The predominant way of modelling the spread of a disease is based on splitting patients into disjoint compartments and describing the flow of people between them. This class of models is called compartment models.

The most basic one is the SIR Model in which the population is split into: people Susceptible to the disease, the Infected and the group of people that have Recovered. The model assumes that a susceptible person becomes infected by having a contact with who is already sick. The disease lasts for some time and then individuals recover (or die).

Continue reading

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Quadratic Programming in Marketing

Mathematical optimization can help with marketing planning. Months ago I published a post on applying Linear Programming to marketing budget allocation. The key strength of Linear Programming is the ability to set investment constraints that are aligned with marketing strategy.

In the last few weeks I have been thinking about following up on the subject. This time looking into a more complex optimization method – Quadratic Programming. Linear Programming has an objective function that is a linear combination of parameters with a set of linear constraints. In Quadratic Programming, we minimize a quadratic function under a set of similar linear constraints.

Continue reading

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Key Highlights from “The Strategist”

I have just finished reading a great book – “The Strategist: Be the Leader Your Business Needs” by Harvard professor Cynthia Montgomery. Based on her legendary business strategy course, the book provides a unique framework for thinking as a strategists. Bellow you can find some key notes I made:

  1. Strategy is not about the destination, but about the journey.
  2. Understanding your industry is critical. A good strategy tells you how to respond to changes and forces in the market.
  3. Your strategy should focus on meeting an unsatisfied need or delivering something unique to the customer. Beating the competition is then a by-product.
  4. Strategy is about choice. What you decide to do, but more importantly, what you decide NOT to do.
  5. The simplest core strategy is about higher prices or lower costs.
  6. Ask yourself this question: if your company disappeared today, would people notice? Would the world be different?
  7. If your company is struggling – try going back to the core. Return to your roots.
  8. A great strategy is not a dream, but a consistent system of value creation with many mutually reinforcing parts.
  9. The primary job of a strategist is to set a plan in place and then create an organization to carry it out.
  10. Don’t expect you today’s advantage to last forever. A strategist’s job is to constantly reinvent and optimize the organization.
  11. Very often people are focusing on the urgent, but unimportant parts of the business. Instead – organize around not urgent but important tasks – building organizational capabilities, maintaining key relationships and inventing new strategies.
  12. As a leader you should “define reality, give hope”
Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Creativity App

Ideas for a side project are the worst. You just cannot get them out of your head. A few weeks ago an idea for a Computer Assisted Innovation platform appeared in one of my conversations. It made me remember a podcast I listened to years ago about companies that are in need of a structured innovation frameworks. Interestingly, there is a lot of tools and techniques for creativity and innovation available out there. The Six Thinking Hats or an amazing selection from the Thinkertoys book by Michael Michalko.

All good, by I always wanteds a piece of software to aid me in using them for idea generation. Nothing better than to build one… and maybe play with web app development at the same time. It is good to stretch that dev muscle from time to time (and build something more substantial than another statistical model). So – head on to http://creativity.kbartocha.com to see a basic app. So far only the Six Thinking Hats technique is available, but I will add more with time. The project was bootstrapped with Yeoman and uses Grunt, Bower, Bootstrap and AngularJS. Enjoy.

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

New Year’s Resolutions and Game Theory

Reading New Year’s Resolutions posts and tweets made me think about an interesting story from “The Art of Strategy” by Avinash Dixit. All of the resolutions we make to diet more, be more fit, learn a new language etc. are a game our Present Self plays with our Future Self. In this game, the Present Self commits to a task/resolution and expects our Future Self to hold on to it. There is only one problem – usually the Future Self has little incentive to do so.

From a Game Theoretic point of view, the Future Self has two possible moves – keep the promise or break it. Each with a certain payoff, either emotional, health related or monetary. Unfortunately, most of our Future Selves will go for the immediate gratification of eating a piece of cake (one more won’t hurt), not going to the gym (nah, it’s raining anyway) or ignoring the “Learn German” book that’s been lying around for months (I have other things to do). We often try to fool ourselves and invent false “motivators”. How many of us signed up to a gym scheme and then used it only once or twice (I’m guilty of this myself. BTW – that’s how these fitness schemes make money)?

In an essence – we have left our Future Self too many options to choose from. If you want to keep your New Year’s Resolutions – limit your Future Self moves and make the incentive to keep bigger than to break. The “Art of Strategy” tells an interesting story about an ABC show on Primetime where dieters need to lose 15 pounds (almost 7 kg) over 2 months. In the event they failed, their bikini photos would be published on Primetime and ABC’s website. Now that’s limiting your Future Self’s options! Since we cannot all take part in a TV show like this, we must come up with some other ways. Here are some ideas:

  1. When dieting, only keep healthy food in your fridge. Can you get someone else to do the shopping for you? Maybe your partner or best friend? Look around for companies that offer this service. If you are in London – these three came up at the top after a quick search on Google: Hello Fresh, Gusto and Pure Package.
  2. Try avoiding places that can tempt you – fast food restaurants, bars, pubs, that local patisserie.
  3. To boost your productivity block pages that eat up your time. Use browser plugins like StayFocusd.
  4. Staying on schedule to learn that second (or third, or fourth…) language is never easy. What if you adjust the website locking strategy? Change your computer and smartphone language to the one you are trying to learn. Block news websites other than those in the language you wish to learn.
  5. I don’t remember how many times I promised myself to wake up in the morning to jog before work (BTW turns out it’s a bad idea with sinusitis). Good tip – get someone else to jog with you. Setup a scheme where you both expect the other to show up. No one wants to let down a friend.
  6. Speaking of friends and some harder to crack motivational cases. Money makes to world go round. Why not commit a medium-sized amount of money and have one of your best friends spend it on buying something completely useless if you break your resolution? How about 100 VHS tapes or cheesy CDs. Best friends are best for this since they will often find sadistic pleasure in making sure you won’t beg that money back out of them. Oh! There are apps for this now – Stickk.

All the best in 2015!

Image (cc by-nc 2.0) Chris Chabot: https://www.flickr.com/photos/chrischabot

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

PaaS Building Blocks: Linux Kernel Namespaces

Introduction

Recent weeks brought a number of interesting articles on Hacker News regarding Docker – an open-source project to develop lightweight app containers. The idea is quite elegant – isolate user applications with their run-time environments into deployable packages. Such a setting should allow to easily to distribute apps and host a number of them inside a single environment. The technology to do it has been out there for a couple of years now and is the workhorse behind such popular PaaS providers like Heroku.

The traditional way of securely handling more than one application on one machine was to utilize tools like chroot or complex virtualization techniques along the line of VMware ESX, KVM or Xen. Unfortunately, both models have their serious limitations – chroot being way too simple with just a filesystem separation mechanism and VMs being too complex to maintain and with some substantial overhead from the guest OS.

Somewhere around the year 2007 and in Linux Kernel 2.6.19/2.6.24, a project inside the kernel aimed at providing two foundation stones for modern OS-level virtualization: namespaces and cgroups.  On top of that, LinuX Containers (LXC) were developed to run multiple isolated Linux systems (containers) on a single machine and now the recent addition to the bunch – Docker, provides a nice and easy way to put your app with all its dependencies in a single, standard “container” to be deployed. This series of posts will try to uncover the details behind all of this, starting with the most low-level.

Kernel Namespaces

Kernel Namespaces made a first glimpse appearance in the Linux kernel as far as 2002 with mount namespaces being introduced into kernel 2.4.19. The core concept is quite interesting – provide a way to have varying views of the system for different processes. What that means is to have the capability to put apps in isolated environments with separate process lists, network devices, user lists and filesystems. Such a functionality could be implemented inside the kernel without the need to run hypervisors or virtualization. The OS needs to keep separate internal structures and make sure they remain isolated. The mechanism itself looks quite simple on paper, but takes expertise and extra care to implement correctly.

Kernel Namespaces are created and manipulated using 3 basic syscalls:

  • clone() which creates a new process and allows the child to share parts of the execution context with its parent. This low-level function sits behind the well known fork().

  • unshare() which allows processes to disassociate parts of their execution context. The main use of this syscall is to modify the execution context of a process without spawning a new child process.

  • setns() which changes the namespace of the calling process.

While clone() and unshare() are standard linux syscalls, the setns() is a new addition dedicated solely to manipulate namespace membership. The namespace functionality has been introduced into both clone() and unshare() through a set of additional flags that indicate which namespaces are to be created for the child process (or which are to be disassociated from the caller).

Currently, the Linux Kernel implements 6 namespaces:

  • mnt – separates mountpoints and filesystems. Allows processes to have separate root partitions and mounts. Think of it as chroot on steroids. The clone() flag indicating that a mountpoint name space needs to be created is CLONE_NEWNS.

  • pid – separate process trees and environments. Each pid namespace has its own PID 1 init process. A single process can reside in only one pid namespace at a time, but note that pid namespaces create a hierarchy with processes in a given namespace being visible to processes in the pid namespace of the parent process. This also means that each process can have more than one PID number, one for each namespace it is visible in. The clone() flag to create a new pid namespace is CLONE_NEWPID. An important thing to remember is that some programs rely on the /proc contents to inspect running processes. In order to make that list reflect the true contents of a pid namespace, a new instance of procfs needs to be mounted over /proc.

  • net – a separate network stack. The default net namespace contains the loopback device and all physical cards present in the system while newly created net namespaces contain only a single loopback device. In order to allow one namespace to reach other networks or utilize physical devices, a virtual ethernet (veth) device can be setup. The veth device is composed of a pair of network devices that are linked together. Usually both are created in the default net namespace and one member of the pair is migrated into the newly created namespace to act as a tunel endpoint. Net namespaces can be easily manipulated with the ip netns command.

  • ipc – an isolated System V IPC namespace with isolated objects and POSIX message queues.

  • uts – a hostname namespace where each uts namespace provides different values by the uname() call: domain name, host name, os release etc. Quite useful when running multiple virtual servers.

  • user – a mechanism to have separate users, groups and capabilities lists. Each namespace holds a mapping of host system user ids to user ids in the namespace. This is the most recent addition to the Linux Kernel (since 3.8) and not all OS-es are user namespace aware.

Namespaces don’t have names. Instead, each gets a unique inode number at the time of its creation. In Linux Kernel 3.8 and higher, you can use the /proc filesystem to check in which namespace a given process resides. Simply go to /proc/<pid>/ns and you will see links for each namespace type.

User-space tools

Apart from the above mentioned syscall changes, a number of user-space tools have been provided.

The ip program provides a very elaborate mechanism to manipulate net namespaces from the commandline:

  • ip netns add <name> – will create a new net namespace and also triggers a creation of a file under /var/run/netns/<name> which can be used with setns().

  • ip netns del <name> – deletes a net namespace.

  • ip netns list – prints a list of all existing net namespaces.

  • ip netns pids <name> – lists all processes in a given net namespace.

  • ip netns exec <namespace> <program> – starts a program in a namespace.

Another important user-space tool to manipulate namespaces is the unshare command-line utility. It allows you to spawn a program in an isolated environment by specifying which namespaces are to be created, e.g. unshare -n bash will spawn a bash shell in a new net namespace.

 

Additional material

Checkout the LWN series of articles to learn more at http://lwn.net/Articles/531114/

 

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Agile Analytics – 11 tips to make your reporting/analytics more Lean

Agile and Lean methodologies have really revolutionized the way we build software… but what if we could apply the Agile Way to analytics and reporting?

  1. Treat each reporting/analysis request like a User Story. Identify the true goal of the report, imagine how you are going to present your results and what the end user is going to see.

  2. Work iteratively in sprints. Forget old “waterfall” models like CRISP-DM. Start small, talk to your end user, make sure you understand what he needs and that he knows what will be delivered at the end of each sprint. I have seen it time and time before – once a report/analysis is delivered, the end user has no idea what to do with the result.

  3. Start with an MVP (Minimum Viable Product). Do you really need to use all of that customer data for a simple Customer Lifetime Value analysis? How about starting with purchase history as the only input to your model? Present your results and iterate adding more and more variables. This will make your analysis actionable ASAP and help identify data gaps.

  4. Make your analysis actionable. Get rid of vanity metrics. “Page Views” means nothing. Look for metrics that capture the true essence of a problem. Unique Customers? Close… how about New vs Returning Customers? Path Length? Avg. visit time? Time between individual purchases or Churn Rate?

  5. Apply the Lean Startup principle. Pose hypotheses, identify key metrics and invent data experiments to check if you were right. Share the knowledge  by setting up an internal newsletter with results from last month’s work. Explain it clearly – get rid of all equations and put some pretty pictures in there… maybe you have got someone who can draw in the team? How about turning your results into an infographic?

  6. Get a whiteboard and start a Kanban Board. Start with sections like “Backlog”, “Idea gathering”, “Data extraction”, “Crunching”, “Reporting” or “Deployment”, but feel free to invent your own. Make sure you put workload constraints according to your team’s size.

  7. Always ask the “Five Why’s?”. If someone requests an analysis on what devices your customers are using, power up your inner child and ask the “Why?” question over and over again. Drill down to the root of the problem. Maybe you can do a slightly different report that would provide more/better insight?

  8. Get at least one programmer on board and build a Continuous Report Delivery system. Write tools that do all the boring pull-transform-aggregate processes automatically. This might look like an overkill, but trust me – it will repay once you have 20 recurrent reports to run every Monday. Use technologies and languages with rapid code delivery times (R, Python, Ruby or Perl)… I know C++ is fast, but maybe you don’t need to be that fast?!

  9. Host daily 10 minute stand-ups to catch up on stuff, but remember to regroup later with key people to discuss things in detail. It is easy to dwell into the nitty-gritty and bore people to death. Also – include everyone… even DBA’s and The Excel Guy – it is important to work and learn as a team.

  10. Data can be inaccurate, but you are not tasked with landing on the Moon. Deliver the MVP (initial results) as quickly as possible, but refine in each iteration as the data becomes better or you are able to include additional variables. You don’t always need to hit the problem straight between the eyes to get rid of it.

  11. Try a Test Driven Analytics approach. Maybe you could generate a synthetic data set to test if your model really captures the effect? Try doing a sensitivity analysis by introducing random noise into the synthetic data. Does that make the result change?

If you have any more ideas – feel free to reach me on Twitter @WhiteRavenPL

 

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)