I am not sure if readers of this blog are aware: there was a big data dump which contained data on Turkish citizens several years back, and recently it made the headlines. The data contained names, parents’ names, birth day and place along with the current addresses and the Turkish Citizenship ID numbers.
Today, I will look into internal migration patterns in Turkey: what are the visible patterns for a citizen’s birth city and the current city s/he lives in.
I will split the data into 3 time frames:
This time around, I will not post the code nor the
data.
The original data was a SQL database dump. Luckily, the dump was well-structured text file and all I had to do to delete first few and last few line. And voila! I had a tab separated data file.
Next, I extracted the birth place, birth date and current city. I counted the results via a hash table:
(let ((table (make-hash-table :test #'equal)))
(do ((line (read-line *standard-input* nil)
(read-line *standard-input* nil)))
((null line) nil)
(let* ((entries (cl-ppcre:split (format nil "~c" #\tab) line))
(date (third (cl-ppcre:split "/" (elt entries 8))))
(from (get-city (asciify (elt entries 7))))
(to (asciify (elt entries 9))))
(setf (gethash (cons from to) table)
(1+ (gethash (cons from to) table 0)))
(maphash (lambda (x y) (format t "~a~c~a~c~a~%" (car x) #\tab (cdr x) #\tab y))
table))
Unfortunately, the birth place wasn’t city proper but the municipal
arrondissement. I had to get the city using a simple lookup. The
get-city
function does that. I got the list from Wikipedia.
Incidentally, the cl-unicode
library does not correctly
upcase Turkish letters. Besides, it is safer to ascii-fy the letters.
For that, I used the following function.
(let ((from "ÂÇĞİÖŞÜ")
(to "ACGIOSU"))
(defun asciify (word)
(map 'string
(lambda (x)
(let ((i (position x from)))
(if i (elt to i) x)))
(map 'string #'cl-unicode:uppercase-mapping word))))
After the results are displayed, I used a simple awk script to convert the results into a graphviz dot file:
awk 'BEGIN { print "digraph cities {"; } \
{ if(($3 > 4000) && ($1 != $2)) { print $1" -> "$2" [penwidth="int($3/4000)"];"; } } \
END { print "}" }'
Here the threshold for the number of data entries was 4000.