Kailash Mansarovar Yaatra 2013

Heavy rains have disrupted normal life in Uttarakhand, a Himalayan state of India. Lot of tourists/pilgrims are stranded around temples at these places due to flooding and continuous downpour. We often associate summer with a season to travel, in India however summer brings Monsoon with it. So it’s not the best weather for traveling, however most of the religious events take place during these months as the mountains are inaccessible during winters. For example every year millions of pilgrims visit shrines like Kedarnath, Badrinath, Hemkund Sahib. This year a record rainfall has damaged these temples and has caused an emergency in the state.

Mt. Kailash

Another pilgrimage which heads to Mt. Kailash and Mansarovar lake, also happens during the same time. This is one of the highly coveted trip and there are very few permits issued each year. Mt. Kailash being in China requires lot of process and dealing with red tape. There is an application process and a lottery is drawn for selection process.

Given the bad weather right now I was wondering how this trip was progressing. Travelers (Yaatri’s) are sent in different batches for logistical reasons. Each batch consists of around 60 people. Interestingly enough I found a list of this years yaatri’s on ministry of external affairs website. List includes their name, age and father’s name. I thought of playing around and draw some statistics. Ideally this data should have been private and not disclosed as it contains personal information.

age distribution

age distribution

1. Average age of travelers
Mean is around 47 years with Standard deviation of 11 years. Oldest person is around 70 years old and youngest being 19 years.

most common first names

most common first names

2. Most common first names
There are 21 people with first name Ramesh followed by Rajesh and Sunil. We can infer these names were popular among the masses when people with the mean age were born.

most common last name

most common last name

most common father's last name

most common father’s last name

3. Most common Last names
Patel’s have a majority with 118 people with this last name, followed by Sharma, Kumar and Gupta. When we tried to find most occurring fathers last name we find Singh taken top spot. For a while it looked amusing but when manually checked against the data set, we find there is a big ambiguity on how we write a name in India. In north people prefer to write their first name followed by middle and last. In central and west last name precedes the first name. In south there is nothing like last common name, it’s your name followed by fathers/husbands name. On some occasions there is grandparents name also added to it. So we do not have a consensus on how to write a name. In above case we found lot of people have written their last name followed by first names in their fathers name category.

birthdays

birthdays

4. Birthdays
Many people seem to share birthdays, with as many as 28 people sharing 7 among them. Interestingly 62 people have their birthdays as 01-June followed by 01-July and 01-January (different years). In probability theory if we pick random values from a pdf they have to be independently and identically distributed, which doesn’t look the case here. It is possible that birth date present in these documents are not real but given later when official records were prepared. If we look at the plot we see first of each month look more likely day than any other in that month. We see lot of people born around June/July month. One analogy behind this could be that most schools open during this time. For a lot of people this could to be the first instance when they needed to officially declare their date of birth. So first of June becomes the most likely day.

5. Relations
We can also find out relationships in this dataset. In some parts of India women change their name after marriage. They put their husbands first name as their middle name. In South people generally have firstName and LastName. LastName is their fathers/husbands first name. So in a way we can create a whole ancestry tree if we have all the names and date of birth information. We tried to use this rule and check if we find any relations. Surprisingly all such relations appear in order in this list. We can also use age difference to find if this is a paternal relation as kids use their fathers name as their middle name too. This rule is however less applicable to North Indian names, in such cases we used last names in order and age difference in order to see if it represents a group. Inherent in our society there is a caste system. People usually marry in same caste or religion. We also tried to find a compatible last name pair which represent a common caste or group. However due to ambiguity in representation of father’s name we used both first and last name in the pair. From the results we can actually see such a pattern forming.

social  cluster after inference from data

social cluster after inference from data

Sadly enough as we hear the latest news, batches 2-10 of Mansarovar pilgrimage have been cancelled this year due to bad weather.

In case you want to do further analysis on this set.You may follow this Starter code.

Advertisements

Decision Trees

Have you played that guessing game where they ask you 20 questions and guess what are you thinking about http://en.akinator.com/personnages/

In Decision trees we use a criteria to split our data into parts and finally classify the sequence into a class. Like in Akinator if they ask 20 question and each question has 2 options (either yes or no) we can classify/find 2^20 people at best.

akinator

Questions are asked in a way which would result in best split. Say if we have 10 people and we want this tree to have minimum depth, we would like to ask a question which would split this data into equal halves. In Machine learning terminology we call this Entropy or Information gain. Entropy is high when both options are equally likely. There are different ways in which this entropy function can be computed eg. variance entropy, gini, impurity. In layman’s term this function just returns how good our function will split the data at a particular node.

Algorithms like CART and ID3 are popular for solving tree based problems. This technique is however useful when we have nominal data, ie we have no way to measure or relate two values in terms of distance. They are incomparable. Like in a popular toy example they say given chances of rain, sun how likely one will go out to play. We enumerate all possible cases and respective values associated with those combinations.

decision tree enumeration for toy example

decision tree enumeration for toy example

Then we create a tree where our columns in the table becomes questions we ask at each node. Sequence in which we ask questions is learned from training sequence.

Follow this tutorial for computational example
http://www.cise.ufl.edu/~ddd/cap6635/Fall-97/Short-papers/2.htm

For working code in Java
https://github.com/saebyn/java-decision-tree.git

Plot a route on a map

Today I was working on plotting a set of geo-locations on google map. These lat/lon were part of a gpx file recorded on a motorcycle trip (auto-logs every x interval). So here is how I formulate the problem statement.

gpx route plot on google map

route plot

Problem:

Plot all geo-coordinates on a google map, centered and scaled for the selected route.

Solution:

  1. Parse the gpx file & extract required co-ordinates.
  2. Use Google Map V3 javascript library & plot all the points
  3. Step 2 would do the plot, however the map is required to be centered. In order to find the center location find min/max of lat/lon.
  4. 	 
    public static Double[] getCenterLatLon(List trackPoints){
    		Double[] center = new Double[2];
    		Double minLat = 999.0;
    		Double maxLat = -999.0;
    		Double minLon = 999.0;
    		Double maxLon = -999.0;
    		Double lat, lon;
    		for(TrackPoint tp:trackPoints){
    			lat = tp.getLatitude();
    			lon = tp.getLongitude();
    
    			if(minLat>lat){
    				minLat = lat;
    			}else if(maxLat<lat){ 
    				maxLat = lat; 
    			}
    
    			if(minLon>lon){
    				minLon = lon;
    			}else if(maxLon<lon){
    				maxLon = lon;
    			}
    
    		}
    		center[0] = (maxLat+minLat)/2;
    		center[1] = (maxLon+minLon)/2;
    
    		return center;
    	}
  5. Now our maps needs to be scaled & normalized to the size of our map canvas. This can be done by setting zoom option to the required level. On the map zoom level scales the map to by certain factor (*2). Read here for more on how google map zoom works. However above technique requires us to find the radius of our plot first. Here is a detailed writeup on understanding relation between earth’s curvature and lat/lons. We use Haversine forumla to compute our distance in miles.
    
    public static double distFrom(double lat1, double lng1, double lat2, double lng2) {
    	double earthRadius = 3958.75;
    	double dLat = Math.toRadians(lat2-lat1);
    	double dLng = Math.toRadians(lng2-lng1);
    	double sindLat = Math.sin(dLat / 2);
    	double sindLng = Math.sin(dLng / 2);
    	double a = Math.pow(sindLat, 2) + Math.pow(sindLng, 2)
    	            * Math.cos(Math.toRadians(lat1)) * Math.cos(Math.toRadians(lat2));
    	double c = 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1-a));
    	double dist = earthRadius * c;
    
    	return dist;
    }
    
    

    Above snippet is taken from here.

  6. Once we have the radius, we can translate this into a zoom level on map. I found it kind of an approximation technique.
    
    public static long googleRadiusToZoomLevel(Double radius){
    	return Math.round(16-Math.log(radius)/Math.log(2));
    }
    
  7. We are all set. Additionally you may want to scale size of your markers with change in zoom level. Here is one way of doing it.

Automate filters for candid shots

These days we see lot of image based apps in market. In addition to improving image quality they provide with different filters which increases overall appeal of an image.

I am currently facing a problem where we need to automate this process. We need to find out which filter suits best for an image. This is different than Google Picasa’s “I am feeling lucky” feature. In that we try to fit our normal curve for image exposure to get best results. However here it’s more of an artistic choice. If we have enough data on images and filters applied to them, we can certainly build a model on top of it.

Can we find a setting which would appeal to most of the users? If we do the more important would be to personalize it. Find a setting which a particular user would love the most.

Why is it difficult for machines to comprehend images?

We have done a considerable work with words. There are search engines where we can find a match in billions of documents in a blink. However the same is not true with images or videos. I think the reason is inherent in the representation.

A language consists of characters, words and some basic rules. There are a finite number of characters used in representation of any language. A dictionary would contain most of the words. Each word has a meaning locally, and contextually when associated with other words. So in a way we have traced a written language in its structural form and given a similar encoding into machines. So the problem is deemed solved.

However with images, there are no fixed set of rules at macroscopic level, it is safe to say there exists an infinite set. On a broader level we are considering few categories right now, like trees, people, cars, and houses and trying to label them. Now the question is why it is inherently difficult. In images how do we tell if two visual references of an object refer to same object i.e. two images of a room each having a bed, pillow, windows and carpet.

bed pillow window
How does the machine label it as a room, bed or pillow. One is to identify/recognize each object independently using features encoded in the machine model. Say shape of an object. e.g. a chair will have 4 legs and back support. Second is using information from spatial domain. The object in itself will be difficult to recognize. However in relation to other objects it can be identified. e.g. a pillow will have a rectangular shape in a 2D image, however when we see a room and a pillow in the bed our confidence that it is a pillow increases. Now say there is a book and a pillow on the bed. By further encoding sharpness of edges, we can deduce or distinguish a book from pillow. Scale is another feature; say a pillow will have a proportional size with respect to its surrounding object.

For a moment, let us think on how we see things. As we grow, we create a prototype of the world as a visual reference in our head. We see cars, trucks, bikes when we go on the road. Next time we see a road we know what all we can anticipate. If we go inside a house, there are number of objects we can expect. So when we see an image there is a limited set of objects which need to be mapped or matched against.

Essentially in our machine models we should be making use of this context. But for using this context we should have a similar model in our machine. Say a prototype of the world like a 3D computer game where each object has enough details available.

A generalized representation of the world with enough details. Image from : http://www.3dcity-world.com/3dcity/

Sometimes objects are occluded or only partly visible. Say a table and chair

With a context it is much easier to predict. Say we a see a room with a car in it. If we see a car outside the window, on a road (and sufficient information available that it is outside), it is a real car. However if it is inside the room, it must be toy car (a miniature model), or the room is a garage. In that case the room will not have the things we see in a normal room and our model should choose garage.

On a low level, digital images are still a group of pixels with each pixel encoding some grayscale value. We can iterate through an image by running two loops, one for each height and width. During the time we run that loop we have to do all our matching/recognition and detection to make machines understand our world and see the things the way humans do.

Google Normalized Distance

Sometimes statistical approach simply wins over science trying to model real world. Here is an example of that. Google Normalized distance finds the relatedness between two words/concepts. this is based on number of results that comes when two are searched together to what when they are searched individually. http://en.wikipedia.org/wiki/Semantic_relatedness

If you are bound by thinking in java. Here is a good implementation of the same. Use this jar (http://www2.informatik.hu-berlin.de/~hakenber/publ/suppl/smbm06/WBI-TM.jar) to make your app finding semantic relatedness between two concepts.

few results

result for Agra & Taj Mahal: 0.37951525964462646
result for Agra & Delhi: 0.43014626260551725

Lower the score more the semantic relatedness between those concepts.

Note:

If you are working behind a firewall. you might need following properties to be set before you go through.

System.setProperty(“http.proxyHost”, “yourproxy”);
System.setProperty(“http.proxyPort”, “yourport”);

WordSenseGoogler wordSense = new WordSenseGoogler();

System.out.println(“result for Agra & Taj Mahal: ” + wordSense.getNormalizedGoogleDistance(“Agra”, “TajMahal”));
System.out.println(“result for Agra & Delhi: ” + wordSense.getNormalizedGoogleDistance(“Agra”, “Delhi”));

SPARQL Query Generation from NL

I been searching for this utility on Google & Delicious. But sometimes these crawlers & taggers fail to find information which a human being can do if they have patience to read a paper, which others have written.

After a failed effort on writing a partial query converter on my own, today i found a paper which mentions few good tools which can do this for me.

If you are also looking for any such requirement. please visit following links. AquaLog is for people who think in java.

http://technologies.kmi.open.ac.uk/aqualog/

http://alumni.media.mit.edu/~mueller/papers/tt.html