Tutorial: Deploy a mordecai geoparser on Linux Server

Author

Jeffrey W. Rozelle

1 Introduction

In this tutorial, we will guide you through setting up the Mordecai geoparser on a Linux server. You will create a droplet, install Docker and the geoparser, and learn how to run text through it. This tutorial requires a DigitalOcean account.

You will need to enter billing information - but at the time of writing this costs only about $6.00 per month (or $0.009 per hour). If you shut down or delete the instance within 5 days, the total cost of this tutorial should be less than $1.

2 Create Ubuntu droplet

2.1 Create Project

Log into DigitalOcean.

You do not necessarily need to use DigitalOcean. Most webhosts will function well. Google even has a free e2 tier that is likely sufficiently powered for this tutorial. I chose DigitalOcean because I have found their navigation to be among the most simple and speedy. However, so long as you choose a server with Ubuntu 18.04, essentially all steps should be identical, starting from the Section 4 step.

If this is your first time using DigitalOcean, you will need to create a new project. Select + New Project, name your project, select the purpose, and click Create Project. You can skip any subsequent steps the website asks you for.

Create Project

2.2 Create droplet

Next, create a droplet using the Create button at the top.

Create droplet

You can choose any region you like, but it probably makes sense to choose the region closest to your location. When asked to choose an image, select Ubuntu version 18.04 (LTS) x64.

Although newer versions of Ubuntu exist, 18.04 is a long-term stable release, and should still be working well when you work with this tutorial. Additionally, it comes pre-installed with Python 3.6 - newer versions of Python often have compatability issues with the mordecai package.

Choose image

Under choose size, you may choose any you wish - but the most affordable option should work. For Droplet Type, choose Basic, And for CPU options, choose Regular with the 1GB / 1CPU, 25 GB SSD, 1000 GB transfer.

Choose size

Now you must select your authentication option. Although an SSH key is more secure, for this tutorial we will use simple password authentication. Under authentication method, selelct Password, and choose a password that you will remember and meets the requirements.

Choose authentication method

You may also wish to name your droplet with an identifying name, but edits to this section are optional. If you make no other changes, DigitalOcean will create a unique droplet name for you. Select the project you created in Section 2.1. Click Create Droplet, and you will be taken to a new page. There is a progress bar, which, after a few moments should load into a new droplet! You have just created a linux server!

3 Log into your server

First, note the IP address of your server. IPv4 addresses follow the format ###.###.###.##, though the number of digits in each section may vary. You do not necessarily need to use this information for the tutorial, unless you plan to access your server through a PC based client, but it is good to be aware of it. In the image below, the IP address has been blocked out in red.

DigitalOcean includes a web-based console, for convenient access to the linux terminal of your server. To access this, First select the menu button, then click Access Console.

Access Console In the next screen, ensure root is the user selected, and click Launch Droplet Console.

Launch Console

This should bring up a new window with a black background. This is the command line interface. All subsequent commands will be entered here.

Although the browser-based console works great, and will be used for the tutorial, you may wish to use a dedicated client to access the command line interface. Putty is one reliable SSH client for this purpose. Using Putty, set the host name to root@<your.ip.address.xx>, set the port to 22, and you should be able to connect as long as you have your password.

All subsequent steps should be the same whether you access the console from your browser or via a dedicated client.

4 Server setup and installations

All the following code will be entered in the console.

4.1 Add new sudo user

First, it is not good practice to work in linux as the root user. Thus, we will setup a ‘sudo’ or administrative user, and run all commands from this user. I will call my user john_denver but you may create any username you wish by simply replacing john_denver with the username of your choice.

This section of the tutorial borrows material from DigitalOcean’s tutorial

adduser john_denver

Linux will then prompt you to create and verify a password for your new user:

Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully

Next, Linux will ask you to fill in some info about the new user. This is optional.

Changing the user information for john_denver
Enter the new value, or press ENTER for the default
    Full Name []:
    Room Number []:
    Work Phone []:
    Home Phone []:
    Other []:
Is the information correct? [Y/n]

Then we will add this user to the sudo group, meaning give this user super-admin powers. Again, replace john_denver with the username you have created. Then, switch to your new user account.

usermod -aG sudo john_denver 
su - john_denver

Now, you should see the new line starting with john_denver@ rather than root@. To enable super user permissions on any command, simply start it with sudo. The first time you use it, linux will ask for your password.

4.2 Update & upgrade your Linux install

The first time you spin up your Linux machine, you should check for updates. Do this by simply typing:

sudo apt update

sudo apt upgrade -y

4.3 Add swap memory

Borrows from DigitalOcean’s excellent tutorial

The memory on the droplet we set up is insufficient to install and run all the components that mordecai needs. Fortunately, Linux has a workaround. You can dedicate hard drive space to be used as RAM. It is slow, but for the purpose of this tutorial - that is unimportant. We will dedicate an initial 8 gigabytes of hard drive space to be used as memory.

sudo fallocate -l 8G /swapfile

To assess whether this was successful, run the command:

ls -lh /swapfile
-rw-r--r-- 1 root root 8.0G Apr 25 11:14 /swapfile

We can see from the 8.0G that the appropriate sized swap-file has been created.

Next, enable the swap file. First, we want to change the security permissions of the swapfile:

sudo chmod 600 /swapfile

Now we can mark the file as swap space with the command:

sudo mkswap /swapfile

Finally, we can enable the swapfile by typing:

sudo swapon /swapfile

To verify that everything is working, type:

sudo swapon --show

You should see a small table that includes a /swapfile with size 8G. To make the swapfile permanent, we need to add a line to the end of our /etc/fstab file. Do this with the command:

echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

4.4 Install docker

This section borrows from another DigitalOcean tutorial

First, we need to install prerequisite packages. You can do so using the command:

sudo apt install apt-transport-https ca-certificates curl software-properties-common

Next, you’ll need a GPG key to secure a download from the Docker repository:

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

Add the Docker reposity to your APT sources (i.e. the listing where Linux checks for packages and updates)

sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu bionic stable"

Now update the package database on your linux install by running this line again:

sudo apt update

At last, you are ready to install docker! Do so with the command:

sudo apt install docker-ce -y

To see if it worked, enter the command:

sudo systemctl status docker

You should see green text that reads active (running). If not, either check the steps above, or visit the DigitalOcean tutorial for a more detailed guide. To exit this view, simply type CTRL + C

4.5 Install mordecai

For detailed instructions on installing mordecai, you can visit the github repo readme. The instructions I provide below work at the time of this writing, and are slightly expanded to ensure necessary dependencies are installed

First, the mordecai developers recommend running mordecai in a virtual environment. A virtual environment allows you to isolate the specific versions of the dependencies that mordecai needs. Since mordecai relies on some outdated packages, this is likely important if you intend to use the linux server for anything else.

To do this, you will need to get some package dependencies, starting with venv:

pip3 install --upgrade pip
sudo apt-get install python3-venv -y
sudo apt-get install cython -y

Now you can set up the virtual environment with:

python3 -m venv mordecai-env
source mordecai-env/bin/activate

Now that you have created the virtual environment, it will still be available if you exit ther console. You can return to your virtual terminal by logging in with your new user (in my case john_denver), and running: source mordecai-env/bin/activate.

Now, install Cython

pip3 install Cython

Lastly, install mordecai itself using the command:

pip3 install mordecai

Now download the required spaCy NLP model. This step requires a meaningful amount of RAM. If it fails with a memory error, go back and complete the Section 4.3.

python3 -m spacy download en_core_web_lg

In order to work, Mordecai needs access to a Geonames gazetteer running in Elasticsearch. The easiest way to set it up is by running the following commands:

sudo docker pull elasticsearch:5.5.2
wget https://andrewhalterman.com/files/geonames_index.tar.gz --output-file=wget_log.txt
tar -xzf geonames_index.tar.gz
sudo docker run -d -p 127.0.0.1:9200:9200 -v $(pwd)/geonames_index/:/usr/share/elasticsearch/data elasticsearch:5.5.2

Congratulations! You have successfully set up mordecai on a linux server!

5 Running mordecai

Now, start python by typing the command python3

You will notice that some information about your python installation appears, and the prompt switches from # to >>>

first run

from mordecai import Geoparser

You will likely see an error about problems loading CUDA. You may ignore this, and all CUDA related errors. These are only relevant if you are running with an Nvidia GPU and wish to run the language model on your GPU.

Next, assign geoparser to the geo object. This may take a few minutes to start because our server has relatively low specifications.

geo = Geoparser()

Now, to test whether the geoparser is working, type:

geo.geoparse("The Eiffel Tower is in Paris, France.")

If everything is working, it should return the following:

[{'word': 'Eiffel Tower', 'spans': [{'start': 4, 'end': 16}], 'country_predicted': 'CHN', 'country_conf': 0.8529637, 'geo': {'admin1': 'Guangdong', 'lat': '22.53697', 'lon': '113.96932', 'country_code3': 'CHN', 'geonameid': '8030083', 'place_name': 'Window of the World Eiffel Tower', 'feature_class': 'S', 'feature_code': 'TOWR'}}, {'word': 'Paris', 'spans': [{'start': 23, 'end': 28}], 'country_predicted': 'FRA', 'country_conf': 0.9634982, 'geo': {'admin1': 'Île-de-France', 'lat': '48.85339', 'lon': '2.34864', 'country_code3': 'FRA', 'geonameid': '2988506', 'place_name': 'Paris', 'feature_class': 'A', 'feature_code': 'ADM3'}}, {'word': 'France', 'spans': [{'start': 30, 'end': 36}], 'country_predicted': 'FRA', 'country_conf': 0.9516948, 'geo': {'admin1': 'NA', 'lat': '46', 'lon': '2', 'country_code3': 'FRA', 'geonameid': '3017382', 'place_name': 'Republic of France', 'feature_class': 'A', 'feature_code': 'PCLI'}}]

6 Shutting down the server

So long as you keep the server running, DigitalOcean will bill you. You may wish to destroy the instance to cease the accumulation of charges. To do so, log back into your account at DigitalOcean. Navigate to the project you created, click the menu button and you will find an option to destroy the instance. A few prompts will emerge asking you to confirm your choice to irreversibly destroy the droplet - successfully complete these and the droplet will be destroyed.

Warning

You must destroy the droplet to stop biling. Simply turning it off will not pause billing.

If you wish to make a copy of your droplet, but do not wish to spend $6 monthly to maintain a copy of it, you may be interested in creating a snapshot. Snapshots are essentially backup copies of the droplet at a specific point in time. You can create a new droplet from a snapshot and it will be a carbon copy of the server at the time of the snapshot.

While the storage of snapshots is not free, it costs at $0.06 per GB per month at the time of this writing. Since our droplet is 25GB, monthly storage costs are $1.50. The example in this tutorial . Once the snapshot is complete, you can safely destroy your droplet with the knowledge a clone can be created from your snapshot.

For more information, check out this tutorial

Destroy droplet