Kailash Mansarovar Yaatra 2013

Heavy rains have disrupted normal life in Uttarakhand, a Himalayan state of India. Lot of tourists/pilgrims are stranded around temples at these places due to flooding and continuous downpour. We often associate summer with a season to travel, in India however summer brings Monsoon with it. So it’s not the best weather for traveling, however most of the religious events take place during these months as the mountains are inaccessible during winters. For example every year millions of pilgrims visit shrines like Kedarnath, Badrinath, Hemkund Sahib. This year a record rainfall has damaged these temples and has caused an emergency in the state.

Mt. Kailash

Another pilgrimage which heads to Mt. Kailash and Mansarovar lake, also happens during the same time. This is one of the highly coveted trip and there are very few permits issued each year. Mt. Kailash being in China requires lot of process and dealing with red tape. There is an application process and a lottery is drawn for selection process.

Given the bad weather right now I was wondering how this trip was progressing. Travelers (Yaatri’s) are sent in different batches for logistical reasons. Each batch consists of around 60 people. Interestingly enough I found a list of this years yaatri’s on ministry of external affairs website. List includes their name, age and father’s name. I thought of playing around and draw some statistics. Ideally this data should have been private and not disclosed as it contains personal information.

age distribution

age distribution

1. Average age of travelers
Mean is around 47 years with Standard deviation of 11 years. Oldest person is around 70 years old and youngest being 19 years.

most common first names

most common first names

2. Most common first names
There are 21 people with first name Ramesh followed by Rajesh and Sunil. We can infer these names were popular among the masses when people with the mean age were born.

most common last name

most common last name

most common father's last name

most common father’s last name

3. Most common Last names
Patel’s have a majority with 118 people with this last name, followed by Sharma, Kumar and Gupta. When we tried to find most occurring fathers last name we find Singh taken top spot. For a while it looked amusing but when manually checked against the data set, we find there is a big ambiguity on how we write a name in India. In north people prefer to write their first name followed by middle and last. In central and west last name precedes the first name. In south there is nothing like last common name, it’s your name followed by fathers/husbands name. On some occasions there is grandparents name also added to it. So we do not have a consensus on how to write a name. In above case we found lot of people have written their last name followed by first names in their fathers name category.

birthdays

birthdays

4. Birthdays
Many people seem to share birthdays, with as many as 28 people sharing 7 among them. Interestingly 62 people have their birthdays as 01-June followed by 01-July and 01-January (different years). In probability theory if we pick random values from a pdf they have to be independently and identically distributed, which doesn’t look the case here. It is possible that birth date present in these documents are not real but given later when official records were prepared. If we look at the plot we see first of each month look more likely day than any other in that month. We see lot of people born around June/July month. One analogy behind this could be that most schools open during this time. For a lot of people this could to be the first instance when they needed to officially declare their date of birth. So first of June becomes the most likely day.

5. Relations
We can also find out relationships in this dataset. In some parts of India women change their name after marriage. They put their husbands first name as their middle name. In South people generally have firstName and LastName. LastName is their fathers/husbands first name. So in a way we can create a whole ancestry tree if we have all the names and date of birth information. We tried to use this rule and check if we find any relations. Surprisingly all such relations appear in order in this list. We can also use age difference to find if this is a paternal relation as kids use their fathers name as their middle name too. This rule is however less applicable to North Indian names, in such cases we used last names in order and age difference in order to see if it represents a group. Inherent in our society there is a caste system. People usually marry in same caste or religion. We also tried to find a compatible last name pair which represent a common caste or group. However due to ambiguity in representation of father’s name we used both first and last name in the pair. From the results we can actually see such a pattern forming.

social  cluster after inference from data

social cluster after inference from data

Sadly enough as we hear the latest news, batches 2-10 of Mansarovar pilgrimage have been cancelled this year due to bad weather.

In case you want to do further analysis on this set.You may follow this Starter code.

Advertisements

hadoopsummit13 meetup

Yesterday, I attended a meet-up session on big data and machine learning. Hadoop summit 2013 is kicking off in San Jose and event organizers were able use it as an excuse catch hold of some big names/vendors in this field.

ted dunning

Ted Dunning on Apache Mahout

The first speaker for the night was Ted Dunning, who as everyone knows is guru in this field. He started off with an introduction on Apache Mahout, pointing out areas where Mahout is good and comparable to best performing implementations in other platforms. He spoke about different packages Mahout provides and how to utilise them best. For example Recommendation package has plethora of good online algorithms, but it performs poorly in classification tasks. He also spoke about math library in java, which can be used to do all vector/matrix manipulations like Python or Matlab. He also mentioned that these algorithms have both in memory and distributed implementation, so that will be something cool to checkout. Link to his slides.


Second talk was from Alpine data labs which sounded almost like a sales pitch to me. They showed their parallel implementation of SVM where the key was to apply an approximation technique to one of the computation of Lagrange multiplier coefficients. It was a good descriptive talk and got many people thinking about the inherent details of the algorithm.


0xdata started off with the theme of how they want to bring data science to masses and  help them get away from the direct confrontation with mathematics. Their product can interface with disparate sources like excel, R, SAS and extend the in memory implementations on to the distributed platform. They worked through an interesting proof of concept using a on-time-airline dataset http://stat-computing.org/dataexpo/2009/.

DMV Motorcycle written exam

This week I got my California driving license back in lieu of my old NY license. California laws are more strict and they make you write an exam before issuing you a license. You only need to answer 18 of them with maximum of 3 incorrect answers. There are tons of sample papers available on Internet to prepare for this exam. I also had to appear for another written exam for motorcycle. There are very few resources for practicing for this exam.

I found this link with few test papers. I did go through all of them and there were many questions from this set. However there were still few questions in the exam which I had no clue about. This exam has 25 questions and you cannot pass if you make more than 4 incorrect entries. I failed in my first attempt and could marginally pass in second. Here are the two exams I took that day (exam1) & (exam2). Hopefully they can help you prepare well for yours.

Decision Trees

Have you played that guessing game where they ask you 20 questions and guess what are you thinking about http://en.akinator.com/personnages/

In Decision trees we use a criteria to split our data into parts and finally classify the sequence into a class. Like in Akinator if they ask 20 question and each question has 2 options (either yes or no) we can classify/find 2^20 people at best.

akinator

Questions are asked in a way which would result in best split. Say if we have 10 people and we want this tree to have minimum depth, we would like to ask a question which would split this data into equal halves. In Machine learning terminology we call this Entropy or Information gain. Entropy is high when both options are equally likely. There are different ways in which this entropy function can be computed eg. variance entropy, gini, impurity. In layman’s term this function just returns how good our function will split the data at a particular node.

Algorithms like CART and ID3 are popular for solving tree based problems. This technique is however useful when we have nominal data, ie we have no way to measure or relate two values in terms of distance. They are incomparable. Like in a popular toy example they say given chances of rain, sun how likely one will go out to play. We enumerate all possible cases and respective values associated with those combinations.

decision tree enumeration for toy example

decision tree enumeration for toy example

Then we create a tree where our columns in the table becomes questions we ask at each node. Sequence in which we ask questions is learned from training sequence.

Follow this tutorial for computational example
http://www.cise.ufl.edu/~ddd/cap6635/Fall-97/Short-papers/2.htm

For working code in Java
https://github.com/saebyn/java-decision-tree.git

Plot a route on a map

Today I was working on plotting a set of geo-locations on google map. These lat/lon were part of a gpx file recorded on a motorcycle trip (auto-logs every x interval). So here is how I formulate the problem statement.

gpx route plot on google map

route plot

Problem:

Plot all geo-coordinates on a google map, centered and scaled for the selected route.

Solution:

  1. Parse the gpx file & extract required co-ordinates.
  2. Use Google Map V3 javascript library & plot all the points
  3. Step 2 would do the plot, however the map is required to be centered. In order to find the center location find min/max of lat/lon.
  4. 	 
    public static Double[] getCenterLatLon(List trackPoints){
    		Double[] center = new Double[2];
    		Double minLat = 999.0;
    		Double maxLat = -999.0;
    		Double minLon = 999.0;
    		Double maxLon = -999.0;
    		Double lat, lon;
    		for(TrackPoint tp:trackPoints){
    			lat = tp.getLatitude();
    			lon = tp.getLongitude();
    
    			if(minLat>lat){
    				minLat = lat;
    			}else if(maxLat<lat){ 
    				maxLat = lat; 
    			}
    
    			if(minLon>lon){
    				minLon = lon;
    			}else if(maxLon<lon){
    				maxLon = lon;
    			}
    
    		}
    		center[0] = (maxLat+minLat)/2;
    		center[1] = (maxLon+minLon)/2;
    
    		return center;
    	}
  5. Now our maps needs to be scaled & normalized to the size of our map canvas. This can be done by setting zoom option to the required level. On the map zoom level scales the map to by certain factor (*2). Read here for more on how google map zoom works. However above technique requires us to find the radius of our plot first. Here is a detailed writeup on understanding relation between earth’s curvature and lat/lons. We use Haversine forumla to compute our distance in miles.
    
    public static double distFrom(double lat1, double lng1, double lat2, double lng2) {
    	double earthRadius = 3958.75;
    	double dLat = Math.toRadians(lat2-lat1);
    	double dLng = Math.toRadians(lng2-lng1);
    	double sindLat = Math.sin(dLat / 2);
    	double sindLng = Math.sin(dLng / 2);
    	double a = Math.pow(sindLat, 2) + Math.pow(sindLng, 2)
    	            * Math.cos(Math.toRadians(lat1)) * Math.cos(Math.toRadians(lat2));
    	double c = 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1-a));
    	double dist = earthRadius * c;
    
    	return dist;
    }
    
    

    Above snippet is taken from here.

  6. Once we have the radius, we can translate this into a zoom level on map. I found it kind of an approximation technique.
    
    public static long googleRadiusToZoomLevel(Double radius){
    	return Math.round(16-Math.log(radius)/Math.log(2));
    }
    
  7. We are all set. Additionally you may want to scale size of your markers with change in zoom level. Here is one way of doing it.

Writing Business Plan

Last week I attended a session on writing business plan, organized by our school of management. I found it very useful, here is what we learned.
 
Business Plan
A business plan is a kind of wiki document, which is dynamic yet contains all the aspects of the product from business perspective. One mistake that people often make is thinking it as a grant proposal or a scientific paper specifying the specification of the product/idea. It should be very simple yet informative with business as prospect.

Business plan in short should specify following
1. Business description
2. What is the product or service
3. Who will buy it and why
4. How will you produce and develop it
5. What is your marketing strategy
6. Can you make profit out of it
7. Is your team competent enough, if not what do you plan to do

 

You generally start with a mission statement and objective. Don’t write ideas about changing the world or making things beautiful (Mark Zuckerberg is an exception). Talk from business perspective. The goal is to earn money. For instance an objective like “Earning $$$$ by YYYY & being top Z in KLMN market share.

 

Problem
A business is always about solving a problem and you make money in exchange of that solution. However in recent times we have seen things changing. Markets which never existed were created, for example iPAD. However you still focus on customer problems and how your product can solve them or make them efficient. As the author in UX Design for startups explains, always formulate your solution as Customer-Problem-Solution (CPS), where you start with who the customer is and their problems and how your solution helps them. You may also include profits with your solution to each problem.

 

Solution
Further it should include current stage of development, the further ahead you are better chances of funding. Also include specific milestones, more detailed and through the better. On internet the biggest factor is anyone can pirate your idea. Your strategy to protect it would play crucial role and will give confidence to investors in you.

 

Revenue
Revenue model is one single most point of interest for any investor. Be precise on who will pay for it (find your customers) & also find how much will they pay and what services/products do they pay for. How much profit do you really make?

 

Customers
Any business starts from customers. Before coming up with a solution verify if there is a problem in the first place. To identify this do a market assessment. First find how big is the opportunity, what is the upper bound. Once done find what portion of the market are you targeting on. Don’t make the mistake of considering whole market as your target area. Focus on small and define it clearly.

 

Once you identify your market, find how market will accept your product. There are different levels on which you can measure acceptance of your product/idea by market. 
1. You just believe
2. Validated with customer reviews/polls
3. Spoken/interviewed potential customers/affiliates
4. You received purchase orders or customers supported you on sites like kickstarter.
5. Generated $$$ in sales.
 
Competition
Finding your competition & your strategy to stand out is another important aspect. Tell what is so unique about your product/solution. Make a competitor matrix and show how you stand on that scale.

 

Marketing 
Now that you have done everything and have your product ready, how will your customers find you? Tell about your marketing strategy. Think from a customers perspective and ask yourself on how do you search for solution. If googling was one of the answer then a higher page rank or SEO keyword techniques can get you to your customers. If it’s an mobile app, advertising on Facebook may find you more users. Be specific about the geographic location, customer demographics and pricing strategy.
 

Operations
If you are an internet company tell about how you will develop your product, code repository cost, server costs. Also include additional workforce you will hire for sales or marketing.

 

Team
As they say investors will always invest on a A team with B idea than a B team with A idea. So team plays an important role. Put relevant experience, accomplishments and if there are any gaps, what is our strategy in filling that. Enlist key advisers in your relevant area.

 

Financial Plan
Should include income statement, balance sheet and cash flow statements, usually for 5 years. All your assumptions must be chained into the plan and how they affect your finances. In addition if you are looking for funding, put milestone on when and how. There are two ways of growth, organic and disruptive. Former mostly run’s on founder’s money to start with and then grows with profits. Latter seeks investment from the market and gets things done quicker. Identify which type you are & how they impact your business.

 

Writing Style
Write clearly and concisely. Do not overuse technical terms, support your statements with facts. In short consider this as your Graduate school essay writing exam and be assertive & convincing. In addition show your engagement with customers, focus on money & how committed you are. Be optimistic and look for big market share. Never assume/understate marketing by saying “The product is so cool that it will sell by itself”.

 

This is a living document, keep revising it every sprint. Always remember market keeps changing so adapt and succeed.

 

Few useful links

Automate filters for candid shots

These days we see lot of image based apps in market. In addition to improving image quality they provide with different filters which increases overall appeal of an image.

I am currently facing a problem where we need to automate this process. We need to find out which filter suits best for an image. This is different than Google Picasa’s “I am feeling lucky” feature. In that we try to fit our normal curve for image exposure to get best results. However here it’s more of an artistic choice. If we have enough data on images and filters applied to them, we can certainly build a model on top of it.

Can we find a setting which would appeal to most of the users? If we do the more important would be to personalize it. Find a setting which a particular user would love the most.