The Kitchen Sink and Other Oddities

Atabey Kaygun

A Migration Analysis

Description of the data 

I am not sure if readers of this blog are aware: there was a big data dump which contained data on Turkish citizens several years back, and recently it made the headlines. The data contained names, parents’ names, birth day and place along with the current addresses and the Turkish Citizenship ID numbers.

Description of the problem

Today, I will look into internal migration patterns in Turkey: what are the visible patterns for a citizen’s birth city and the current city s/he lives in. 

I will split the data into 3 time frames:

  1. Those who were born before 1950
  2. Those who were born between 1950 and 1980
  3. Those who were born after 1980

This time around, I will not post the code nor the data.

Graphs

Methodology

The original data was a SQL database dump. Luckily, the dump was well-structured text file and all I had to do to delete first few and last few line. And voila! I had a tab separated data file.

Next, I extracted the birth place, birth date and current city. I counted the results via a hash table:

(let ((table (make-hash-table :test #'equal)))
   (do ((line (read-line *standard-input* nil)
              (read-line *standard-input* nil)))
       ((null line) nil)
       (let* ((entries (cl-ppcre:split (format nil "~c" #\tab) line))
              (date (third (cl-ppcre:split "/" (elt entries 8))))
              (from (get-city (asciify (elt entries 7))))
              (to   (asciify (elt entries 9))))
           (setf (gethash (cons from to) table)
                 (1+ (gethash (cons from to) table 0)))
   (maphash (lambda (x y) (format t "~a~c~a~c~a~%" (car x) #\tab (cdr x) #\tab y))
            table))

Unfortunately, the birth place wasn’t city proper but the municipal arrondissement. I had to get the city using a simple lookup. The get-city function does that. I got the list from Wikipedia.

Incidentally, the cl-unicode library does not correctly upcase Turkish letters. Besides, it is safer to ascii-fy the letters. For that, I used the following function.

 (let ((from "ÂÇĞİÖŞÜ")
       (to "ACGIOSU"))
   (defun asciify (word)
     (map 'string
          (lambda (x)
            (let ((i (position x from)))
              (if i (elt to i) x)))
          (map 'string #'cl-unicode:uppercase-mapping word))))

After the results are displayed, I used a simple awk script to convert the results into a graphviz dot file:

 awk 'BEGIN { print "digraph cities {"; } \
      { if(($3 > 4000) && ($1 != $2)) { print $1" -> "$2" [penwidth="int($3/4000)"];"; } } \
      END { print "}" }'

Here the threshold for the number of data entries was 4000.