Kailash Mansarovar Yaatra 2013

Heavy rains have disrupted normal life in Uttarakhand, a Himalayan state of India. Lot of tourists/pilgrims are stranded around temples at these places due to flooding and continuous downpour. We often associate summer with a season to travel, in India however summer brings Monsoon with it. So it’s not the best weather for traveling, however most of the religious events take place during these months as the mountains are inaccessible during winters. For example every year millions of pilgrims visit shrines like Kedarnath, Badrinath, Hemkund Sahib. This year a record rainfall has damaged these temples and has caused an emergency in the state.

Mt. Kailash

Another pilgrimage which heads to Mt. Kailash and Mansarovar lake, also happens during the same time. This is one of the highly coveted trip and there are very few permits issued each year. Mt. Kailash being in China requires lot of process and dealing with red tape. There is an application process and a lottery is drawn for selection process.

Given the bad weather right now I was wondering how this trip was progressing. Travelers (Yaatri’s) are sent in different batches for logistical reasons. Each batch consists of around 60 people. Interestingly enough I found a list of this years yaatri’s on ministry of external affairs website. List includes their name, age and father’s name. I thought of playing around and draw some statistics. Ideally this data should have been private and not disclosed as it contains personal information.

age distribution

age distribution

1. Average age of travelers
Mean is around 47 years with Standard deviation of 11 years. Oldest person is around 70 years old and youngest being 19 years.

most common first names

most common first names

2. Most common first names
There are 21 people with first name Ramesh followed by Rajesh and Sunil. We can infer these names were popular among the masses when people with the mean age were born.

most common last name

most common last name

most common father's last name

most common father’s last name

3. Most common Last names
Patel’s have a majority with 118 people with this last name, followed by Sharma, Kumar and Gupta. When we tried to find most occurring fathers last name we find Singh taken top spot. For a while it looked amusing but when manually checked against the data set, we find there is a big ambiguity on how we write a name in India. In north people prefer to write their first name followed by middle and last. In central and west last name precedes the first name. In south there is nothing like last common name, it’s your name followed by fathers/husbands name. On some occasions there is grandparents name also added to it. So we do not have a consensus on how to write a name. In above case we found lot of people have written their last name followed by first names in their fathers name category.

birthdays

birthdays

4. Birthdays
Many people seem to share birthdays, with as many as 28 people sharing 7 among them. Interestingly 62 people have their birthdays as 01-June followed by 01-July and 01-January (different years). In probability theory if we pick random values from a pdf they have to be independently and identically distributed, which doesn’t look the case here. It is possible that birth date present in these documents are not real but given later when official records were prepared. If we look at the plot we see first of each month look more likely day than any other in that month. We see lot of people born around June/July month. One analogy behind this could be that most schools open during this time. For a lot of people this could to be the first instance when they needed to officially declare their date of birth. So first of June becomes the most likely day.

5. Relations
We can also find out relationships in this dataset. In some parts of India women change their name after marriage. They put their husbands first name as their middle name. In South people generally have firstName and LastName. LastName is their fathers/husbands first name. So in a way we can create a whole ancestry tree if we have all the names and date of birth information. We tried to use this rule and check if we find any relations. Surprisingly all such relations appear in order in this list. We can also use age difference to find if this is a paternal relation as kids use their fathers name as their middle name too. This rule is however less applicable to North Indian names, in such cases we used last names in order and age difference in order to see if it represents a group. Inherent in our society there is a caste system. People usually marry in same caste or religion. We also tried to find a compatible last name pair which represent a common caste or group. However due to ambiguity in representation of father’s name we used both first and last name in the pair. From the results we can actually see such a pattern forming.

social  cluster after inference from data

social cluster after inference from data

Sadly enough as we hear the latest news, batches 2-10 of Mansarovar pilgrimage have been cancelled this year due to bad weather.

In case you want to do further analysis on this set.You may follow this Starter code.

hadoopsummit13 meetup

Yesterday, I attended a meet-up session on big data and machine learning. Hadoop summit 2013 is kicking off in San Jose and event organizers were able use it as an excuse catch hold of some big names/vendors in this field.

ted dunning

Ted Dunning on Apache Mahout

The first speaker for the night was Ted Dunning, who as everyone knows is guru in this field. He started off with an introduction on Apache Mahout, pointing out areas where Mahout is good and comparable to best performing implementations in other platforms. He spoke about different packages Mahout provides and how to utilise them best. For example Recommendation package has plethora of good online algorithms, but it performs poorly in classification tasks. He also spoke about math library in java, which can be used to do all vector/matrix manipulations like Python or Matlab. He also mentioned that these algorithms have both in memory and distributed implementation, so that will be something cool to checkout. Link to his slides.


Second talk was from Alpine data labs which sounded almost like a sales pitch to me. They showed their parallel implementation of SVM where the key was to apply an approximation technique to one of the computation of Lagrange multiplier coefficients. It was a good descriptive talk and got many people thinking about the inherent details of the algorithm.


0xdata started off with the theme of how they want to bring data science to masses and  help them get away from the direct confrontation with mathematics. Their product can interface with disparate sources like excel, R, SAS and extend the in memory implementations on to the distributed platform. They worked through an interesting proof of concept using a on-time-airline dataset http://stat-computing.org/dataexpo/2009/.

DMV Motorcycle written exam

This week I got my California driving license back in lieu of my old NY license. California laws are more strict and they make you write an exam before issuing you a license. You only need to answer 18 of them with maximum of 3 incorrect answers. There are tons of sample papers available on Internet to prepare for this exam. I also had to appear for another written exam for motorcycle. There are very few resources for practicing for this exam.

I found this link with few test papers. I did go through all of them and there were many questions from this set. However there were still few questions in the exam which I had no clue about. This exam has 25 questions and you cannot pass if you make more than 4 incorrect entries. I failed in my first attempt and could marginally pass in second. Here are the two exams I took that day (exam1) & (exam2). Hopefully they can help you prepare well for yours.

Decision Trees

Have you played that guessing game where they ask you 20 questions and guess what are you thinking about http://en.akinator.com/personnages/

In Decision trees we use a criteria to split our data into parts and finally classify the sequence into a class. Like in Akinator if they ask 20 question and each question has 2 options (either yes or no) we can classify/find 2^20 people at best.

akinator

Questions are asked in a way which would result in best split. Say if we have 10 people and we want this tree to have minimum depth, we would like to ask a question which would split this data into equal halves. In Machine learning terminology we call this Entropy or Information gain. Entropy is high when both options are equally likely. There are different ways in which this entropy function can be computed eg. variance entropy, gini, impurity. In layman’s term this function just returns how good our function will split the data at a particular node.

Algorithms like CART and ID3 are popular for solving tree based problems. This technique is however useful when we have nominal data, ie we have no way to measure or relate two values in terms of distance. They are incomparable. Like in a popular toy example they say given chances of rain, sun how likely one will go out to play. We enumerate all possible cases and respective values associated with those combinations.

decision tree enumeration for toy example

decision tree enumeration for toy example

Then we create a tree where our columns in the table becomes questions we ask at each node. Sequence in which we ask questions is learned from training sequence.

Follow this tutorial for computational example
http://www.cise.ufl.edu/~ddd/cap6635/Fall-97/Short-papers/2.htm

For working code in Java
https://github.com/saebyn/java-decision-tree.git

Plot a route on a map

Today I was working on plotting a set of geo-locations on google map. These lat/lon were part of a gpx file recorded on a motorcycle trip (auto-logs every x interval). So here is how I formulate the problem statement.

gpx route plot on google map

route plot

Problem:

Plot all geo-coordinates on a google map, centered and scaled for the selected route.

Solution:

  1. Parse the gpx file & extract required co-ordinates.
  2. Use Google Map V3 javascript library & plot all the points
  3. Step 2 would do the plot, however the map is required to be centered. In order to find the center location find min/max of lat/lon.
  4. 	 
    public static Double[] getCenterLatLon(List trackPoints){
    		Double[] center = new Double[2];
    		Double minLat = 999.0;
    		Double maxLat = -999.0;
    		Double minLon = 999.0;
    		Double maxLon = -999.0;
    		Double lat, lon;
    		for(TrackPoint tp:trackPoints){
    			lat = tp.getLatitude();
    			lon = tp.getLongitude();
    
    			if(minLat>lat){
    				minLat = lat;
    			}else if(maxLat<lat){ 
    				maxLat = lat; 
    			}
    
    			if(minLon>lon){
    				minLon = lon;
    			}else if(maxLon<lon){
    				maxLon = lon;
    			}
    
    		}
    		center[0] = (maxLat+minLat)/2;
    		center[1] = (maxLon+minLon)/2;
    
    		return center;
    	}
  5. Now our maps needs to be scaled & normalized to the size of our map canvas. This can be done by setting zoom option to the required level. On the map zoom level scales the map to by certain factor (*2). Read here for more on how google map zoom works. However above technique requires us to find the radius of our plot first. Here is a detailed writeup on understanding relation between earth’s curvature and lat/lons. We use Haversine forumla to compute our distance in miles.
    
    public static double distFrom(double lat1, double lng1, double lat2, double lng2) {
    	double earthRadius = 3958.75;
    	double dLat = Math.toRadians(lat2-lat1);
    	double dLng = Math.toRadians(lng2-lng1);
    	double sindLat = Math.sin(dLat / 2);
    	double sindLng = Math.sin(dLng / 2);
    	double a = Math.pow(sindLat, 2) + Math.pow(sindLng, 2)
    	            * Math.cos(Math.toRadians(lat1)) * Math.cos(Math.toRadians(lat2));
    	double c = 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1-a));
    	double dist = earthRadius * c;
    
    	return dist;
    }
    
    

    Above snippet is taken from here.

  6. Once we have the radius, we can translate this into a zoom level on map. I found it kind of an approximation technique.
    
    public static long googleRadiusToZoomLevel(Double radius){
    	return Math.round(16-Math.log(radius)/Math.log(2));
    }
    
  7. We are all set. Additionally you may want to scale size of your markers with change in zoom level. Here is one way of doing it.

Writing Business Plan

Last week I attended a session on writing business plan, organized by our school of management. I found it very useful, here is what we learned.
 
Business Plan
A business plan is a kind of wiki document, which is dynamic yet contains all the aspects of the product from business perspective. One mistake that people often make is thinking it as a grant proposal or a scientific paper specifying the specification of the product/idea. It should be very simple yet informative with business as prospect.

Business plan in short should specify following
1. Business description
2. What is the product or service
3. Who will buy it and why
4. How will you produce and develop it
5. What is your marketing strategy
6. Can you make profit out of it
7. Is your team competent enough, if not what do you plan to do

 

You generally start with a mission statement and objective. Don’t write ideas about changing the world or making things beautiful (Mark Zuckerberg is an exception). Talk from business perspective. The goal is to earn money. For instance an objective like “Earning $$$$ by YYYY & being top Z in KLMN market share.

 

Problem
A business is always about solving a problem and you make money in exchange of that solution. However in recent times we have seen things changing. Markets which never existed were created, for example iPAD. However you still focus on customer problems and how your product can solve them or make them efficient. As the author in UX Design for startups explains, always formulate your solution as Customer-Problem-Solution (CPS), where you start with who the customer is and their problems and how your solution helps them. You may also include profits with your solution to each problem.

 

Solution
Further it should include current stage of development, the further ahead you are better chances of funding. Also include specific milestones, more detailed and through the better. On internet the biggest factor is anyone can pirate your idea. Your strategy to protect it would play crucial role and will give confidence to investors in you.

 

Revenue
Revenue model is one single most point of interest for any investor. Be precise on who will pay for it (find your customers) & also find how much will they pay and what services/products do they pay for. How much profit do you really make?

 

Customers
Any business starts from customers. Before coming up with a solution verify if there is a problem in the first place. To identify this do a market assessment. First find how big is the opportunity, what is the upper bound. Once done find what portion of the market are you targeting on. Don’t make the mistake of considering whole market as your target area. Focus on small and define it clearly.

 

Once you identify your market, find how market will accept your product. There are different levels on which you can measure acceptance of your product/idea by market. 
1. You just believe
2. Validated with customer reviews/polls
3. Spoken/interviewed potential customers/affiliates
4. You received purchase orders or customers supported you on sites like kickstarter.
5. Generated $$$ in sales.
 
Competition
Finding your competition & your strategy to stand out is another important aspect. Tell what is so unique about your product/solution. Make a competitor matrix and show how you stand on that scale.

 

Marketing 
Now that you have done everything and have your product ready, how will your customers find you? Tell about your marketing strategy. Think from a customers perspective and ask yourself on how do you search for solution. If googling was one of the answer then a higher page rank or SEO keyword techniques can get you to your customers. If it’s an mobile app, advertising on Facebook may find you more users. Be specific about the geographic location, customer demographics and pricing strategy.
 

Operations
If you are an internet company tell about how you will develop your product, code repository cost, server costs. Also include additional workforce you will hire for sales or marketing.

 

Team
As they say investors will always invest on a A team with B idea than a B team with A idea. So team plays an important role. Put relevant experience, accomplishments and if there are any gaps, what is our strategy in filling that. Enlist key advisers in your relevant area.

 

Financial Plan
Should include income statement, balance sheet and cash flow statements, usually for 5 years. All your assumptions must be chained into the plan and how they affect your finances. In addition if you are looking for funding, put milestone on when and how. There are two ways of growth, organic and disruptive. Former mostly run’s on founder’s money to start with and then grows with profits. Latter seeks investment from the market and gets things done quicker. Identify which type you are & how they impact your business.

 

Writing Style
Write clearly and concisely. Do not overuse technical terms, support your statements with facts. In short consider this as your Graduate school essay writing exam and be assertive & convincing. In addition show your engagement with customers, focus on money & how committed you are. Be optimistic and look for big market share. Never assume/understate marketing by saying “The product is so cool that it will sell by itself”.

 

This is a living document, keep revising it every sprint. Always remember market keeps changing so adapt and succeed.

 

Few useful links

Automate filters for candid shots

These days we see lot of image based apps in market. In addition to improving image quality they provide with different filters which increases overall appeal of an image.

I am currently facing a problem where we need to automate this process. We need to find out which filter suits best for an image. This is different than Google Picasa’s “I am feeling lucky” feature. In that we try to fit our normal curve for image exposure to get best results. However here it’s more of an artistic choice. If we have enough data on images and filters applied to them, we can certainly build a model on top of it.

Can we find a setting which would appeal to most of the users? If we do the more important would be to personalize it. Find a setting which a particular user would love the most.

Why is it difficult for machines to comprehend images?

We have done a considerable work with words. There are search engines where we can find a match in billions of documents in a blink. However the same is not true with images or videos. I think the reason is inherent in the representation.

A language consists of characters, words and some basic rules. There are a finite number of characters used in representation of any language. A dictionary would contain most of the words. Each word has a meaning locally, and contextually when associated with other words. So in a way we have traced a written language in its structural form and given a similar encoding into machines. So the problem is deemed solved.

However with images, there are no fixed set of rules at macroscopic level, it is safe to say there exists an infinite set. On a broader level we are considering few categories right now, like trees, people, cars, and houses and trying to label them. Now the question is why it is inherently difficult. In images how do we tell if two visual references of an object refer to same object i.e. two images of a room each having a bed, pillow, windows and carpet.

bed pillow window
How does the machine label it as a room, bed or pillow. One is to identify/recognize each object independently using features encoded in the machine model. Say shape of an object. e.g. a chair will have 4 legs and back support. Second is using information from spatial domain. The object in itself will be difficult to recognize. However in relation to other objects it can be identified. e.g. a pillow will have a rectangular shape in a 2D image, however when we see a room and a pillow in the bed our confidence that it is a pillow increases. Now say there is a book and a pillow on the bed. By further encoding sharpness of edges, we can deduce or distinguish a book from pillow. Scale is another feature; say a pillow will have a proportional size with respect to its surrounding object.

For a moment, let us think on how we see things. As we grow, we create a prototype of the world as a visual reference in our head. We see cars, trucks, bikes when we go on the road. Next time we see a road we know what all we can anticipate. If we go inside a house, there are number of objects we can expect. So when we see an image there is a limited set of objects which need to be mapped or matched against.

Essentially in our machine models we should be making use of this context. But for using this context we should have a similar model in our machine. Say a prototype of the world like a 3D computer game where each object has enough details available.

A generalized representation of the world with enough details. Image from : http://www.3dcity-world.com/3dcity/

Sometimes objects are occluded or only partly visible. Say a table and chair

With a context it is much easier to predict. Say we a see a room with a car in it. If we see a car outside the window, on a road (and sufficient information available that it is outside), it is a real car. However if it is inside the room, it must be toy car (a miniature model), or the room is a garage. In that case the room will not have the things we see in a normal room and our model should choose garage.

On a low level, digital images are still a group of pixels with each pixel encoding some grayscale value. We can iterate through an image by running two loops, one for each height and width. During the time we run that loop we have to do all our matching/recognition and detection to make machines understand our world and see the things the way humans do.

Camera, Images and EXIF

Last year I lost my camera during a trip up north of San Francisco. I was sad for about a week, till those memory cells faded. Suddenly it struck me why do we loose things, there should be a way to locate them, like we do we mobile phones. 

Another alternative is we do exhaustive search of all the images shared on internet and find if one clicked with your camera. Digital camera’s store meta information in EXIF format which includes even the serial number of your camera. A serial number can uniquely identify any device. stolen camera finder works on the same basis and claims to crawl all the images. However it returned zero search results when I searched for mine. Although I have published many pictures on my picasa album with the same camera, this search engine fails to identify them. There can be many reasons why this approach would fail. 

Image

1. Crawlers aren’t good enough. I wish Google to have started this service, it will help so many.

2. Images with missing EXIF info. I ran a test using EXIF-py, a python library on different images from Facebook, Picasa, Flickr and web and found only a very few preserved this information. Facebook seems to remove exif information from the pictures. I saw serial number for few images in picassa. In some cases only the model number is present but not serial number. 

Image

So it seems most of the images editing softwares do not write exif information. This makes it difficult to search for your lost camera by simply crawling images. So we see we need to look for a different approach to solve this problem.

Writing experience on IPad

Last month I got my hands on an iPad. This device is amazingly wonderful. It opens up lot of possibilities on how we can consume & interact with information in a more natural way. Since then, I have been installing numerous application in order to be more organized. I would like to mention one app called “habit pro”. As we know habits are formed by repeated actions so in this app you can jot down things you would like to be part of your daily life. Later you mark whether you did them or not and also monitor your performance over time.

notebook vs ipad

I was also searching for a daily journal/diary kind app where I can do my everyday introspection & make a note of it (not just 4/5 lines of tweet kind messages). I have been using a conventional diary for about 11 years now. The problem is I have lot of them and lost few while shifting (our lives are so much mobile these days). So I was looking for an app which would translate the same experience for me. I found few on which you can either type of write with a stylus. I thought a stylus would be more natural transition. In search of it, I started behaving like a consumer. I read about different kinds on Amazon and watched a couple of videos on YouTube. Finally went to Fry’s yesterday evening to try them out. There were quite a few of them ranging from $6 to $39. They were good for some occasional marking or drawing but when I started writing, they couldn’t be compared to what it feels when we write with a pencil on paper.

First the iPad screen is smaller than a conventional A4 sheet & the writing resolution is much smaller resulting in ability to write only a few lines on one screen. On top you are writing on glass, so the feedback is poor and you might touch some part of the screen with your palm which would activate some other widget, which is annoying.

Continue reading