Github based URL shortening

17 Aug 2016  Posted under: linux , python , programming , dbrks

URL shorting is a mixed bag - there are a number reasons why it is a great idea, but also just as many reasons why it isn’t. But first, let’s talk about why people like it. To start, you can include a link such as http://weaselpipe.com/about in a paragraph of text and it still reads nicely, unlike links that include endless amounts of GET data such as in the example below

https://www.google.com/maps/place/Clover/@42.3611971,-71.1220626,12z/data=!4m8!1m2!2m1!1sfood!3m4!1s0x89e37a1241557ee1:0xad192345b4105ba0!8m2!3d42.3729277!4d-71.1178962  

Another benefit of URL shortening - try reading that last link out loud to someone over the phone… Finally, long URLs are difficult to remember. If we can shorten a long URL into something easier to remember, it will be easier to use. Of course, shortened URLs are not all sunshine and flowers - there are many downsides to using them as well. The most obvious problem with shortened URLs is that you have no idea where links, such as http://bit.ly/2aX1nha, are going to take you. It might just be a shortcut to some location in google maps, but it could just as easily be sending you to http://sketchylink4u.ru/supermalwareinjector. Another problem with shortening URLs is that it almost certainly is adding to the link-rot problem. Not only could the target page disappear in the future, but the URL shortening service might discard old or unused addresses (or even disappear itself). Finally, let us not forget that nothing in this world is free. If you can’t identify how a service is making money then the answer is you. In between when you enter that short URL into your browser and being taken (hopefully) to you destination, the server providing the short URL could be doing anything - including collecting information about you they will later sell to people.

However, it recently occured to me that many of the problems with URL shortening can be eliminated if the shortening service actually belongs to you. I had never considered that I could run my own URL shortener until a few days ago when I received an email from a friend in which he included several shortened URLs hosted by his own server. That’s when it occured to me that (1) my friend knows where each of these links goes because he made them, (2) he can control whether or not the short URL expires, and (3) he knows whether or not he is selling data about his friends to advertising companies. I also realized that I actually already understood how URL shortening services worked - or at least I knew how I could make one very easily. And that’s when I decided I wanted my own URL shortening service.

My strategy was very straight forward:

  1. Acquire a very short base URL
  2. Link it to a public web server (running Apache) that I have access to
  3. Use a .htaccess file to respond to HTTP requests with 302 redirect messages
  4. Profit

Setting up the short base URL

I already own several URLs, so purchasing another was not a big deal. If you have never done it before, it is very easy to buy your own domain name. I bought mine from GoDaddy - a decision I frequently regret (for reasons I won’t discuss in this post), but have not yet found the energy to switch over to a different provider. If all you want to do is make a URL shortening service, GoDaddy will work out fine.

I have access to a public facing web server that is run by the university I attend. I realize this is a luxary most people don’t have access to - in which case you should check out dreamhost. At any rate, the university’s webserver allows people to run websites out of their user’s home directory by putting files inside of a public_html directory. In fact, I was already running a website from this location, so my URL shortening service would have to live as a subdirectory of this existing site. Unfortunately, this meant that my brand new short URL, dbrks.co, would need to be pointed at a rather long URL (http://www.cs.uml.edu/~dbrooks/links). This is not really how DNS works. At all. As it turns out, for the first time ever my DNS registrar (GoDaddy) actually provided me with a useful feature - “URL forwarding”. Ironically, they implement this by pointing your domain name at one of their servers, and repling to requests to your domain with a 302 redirect reply (the same technique I planned to use for my URL shortening). So, now I could type in dbrks.co into my browser and it would land me in the location I would be hosting my URL shortener at http://www.cs.uml.edu/~dbrooks/links. The next step was to make that magical location redirect to arbitrary other websites.

.htaccess Redirecting

I wrote a basic htaccess file with the contents below

Options +FollowSymLinks
RewriteEngine on
RewriteRule ^.htaccess$ - [r=404,L]
RewriteRule ^test1$ http://www.google.com [r=302,NE,L]
RewriteRule ^test2$ http://www.yahoo.com [r=302,NE,L]

This file would redirect the url http://www.cs.uml.edu/~dbrooks/links/test1 , as well as dbrks.co/test1, to google and dbrks.co/test2 to yahoo. Additionally, it provides a tiny bit of “security” by blocking people access to viewing the actual htaccess file. The NE tells the server not to escape special characters (basically allowing you to type in the URL’s to redirect to verbatim). This worked perfectly just as I had planned and I was ready to begin the final phase of my plans - profit.

Wrong.

My original idea had been to simply ssh into the server and edit the .htaccess file whenever I wanted to shorten a URL. It was a sweet and simple implementation, with the only downsides being I would need to be able to SSH into the server to add URLs, and I would have to type in some extra syntax for each URL. I did not plan on shortening URLs on a regular basis, so this seemed like a reasonable tradeoff. I was correct about everything except the part about not shortening URLs on a regular basis, and that quickly turned the whole “I’ll just ssh in and edit the .htaccess file” into a problem.

A user interface

As it turns out, I quickly fell into the habbit of wanting to add new URLs all the time - it was just too convenient. Suddenly, I could handwrite URLs into my paper notebooks. I could skip the step of having to type in my zip code at weather.gov, and I will never again have to do a google search for the port number of the CUPS server (Yes, I linked it to dbrks.co/cups.

So it did not take very long for me to realize that the real power of a URL shortening service comes from having an easy to use web interface that can be quickly edited and easily managed. This was a problem, because the school’s web server did not allow me to respond to HTTP requests using PHP, Ruby, Perl, or Python, effectively preventing me from pushing information to the server via a web browser. What I was allowed to do on the server was run scripts and cron jobs.

Around the same time I was working on a project, I was looking at some code on github when I noticed a small icon I had not seen before - an edit button. It turns out that github allows you to make edits to files directly from their website and commit the changes to your repository. That gave me an idea - I could store my links on github and use the server’s cron job to check for and download updates to the htaccess file. To further improve the link management process, I created a simple configuration file that stores short names and the corresponding long urls that is easy to edit. I then wrote a short python script that

  1. checks to see if the github repository has been updated
  2. downloads the updates
  3. parses the configuration file
  4. generates a new htaccess file
  5. creates an index file for convenience

I then set this script to run every few minutes on the server. Finally, I used my new favorite toy (my URL shortener) to create a convenient way of editing this configuration file - I simply linked the configuration file’s “edit” page on github to dbrks.co/edit. Now when I want to change a URL, I simply type in dbrks.co/edit, sign into github (if I’m not already), type in my changes, and click save. A few minutes later, the server will run the cron job, update the htaccess file, and my link will be active.

Faster updates

If you have the ability to run scripts from your host’s webserver, such as perl, php, or python, updating could be much faster using a Github feature called “webhooks”. Webhooks work by having github send an http request to an address you provide whenever particular events occur in your repository. This is a very powerful feature, but in this case all you need to do is give github an address to go to, and tell it to do this whenever there is a “push” event.

As for what you would put in your server side script, you have two options. First, you could potentially make a direct call to the update.py script, adding the --dont-check-updates parameter to force the script to pull down the latest version of the repository. This would be the easiest and fastest thing to do, but requires that you can 1) execute commands from the script, and 2) write to the directorywhere the update.py file is located.

Alternatively, you could use an intermediate file, called something like update.txt. Inside the update.txt file you would write a “1” to indicate that an update is available, or a “0” to indicate there are no updates waiting. You could then use a bash script to check this file once every minute inside a cron job (similar to the way the update.py script works, but at a faster rate). This has the advantage that you only need to be able to write to a file somewhere on the server. An example bash script is located at http://dbrks.co/p/dbrks/bash.

You might also want to add some security. First, use https if your server supports it. Second, there are many stupid bots on the internet that might try to visit your update script. Using a simple HTTP GET parameter in the url (e.g. mysite.com/update.php?key=notsosecret) is an easy way to prevent lots of accidental triggering. However, the best practice is to examin the json data that github sends when it makes the request. This data can be used to verify the secret key you entered when you created the webhook, which would be the most secure method.

Future work

Of course, this system is not perfect. There is a short lag time between editing the configuration page and the link becoming active. The reason for this is a combination of two issues. The first is that github’s static content site http://raw.githubusercontent.com that my script uses to check for updates can take a few minutes to update. Second, my cron job can only fire at most once a minute. Experimentally, I’ve found there to be a diminishing return somewhere below the five minute mark. Thus, it can take up to five minutes for my URL to become active. I have an idea for improving this, but I won’t be able to run it off the school webserver. Github has a system called webhooks where the server will make an HTTP request to an address you point it at every time you commit a change to your code. If I could run a PHP or Python script from Apache, I could have it intercept these HTTP messages from github and immediately update the htaccess file.

Another problem is organization - right now all the links are in a single configuration file. At the moment, I have a sufficiently small number of links that it is still manageable. It helps that I can put comments in the file, and also that I can namespace URLs. For example, all my webcam links start with dbrks.co/cam/.... However, it is not difficult to imagine that in the future as the number of links grows this will become problemmatic. One possible solution would be to support multiple configuration files and divide the links across them (possibly by namespace). But this would make editing much less convenient.

For the moment though, this setup has been working quite well for me. You can see the code and try it out at the same time at http://dbrks.co/p/dbrks/example