My First Bot

By, Neo Ellison

As someone who is pretty new to the whole 'hacking' thing I wanted to build a thing that did a thing. Here is my experience building a Twitter bot.

So like anyone who spent a lot of time on theory and had yet to really build something on their own, I wanted to apply my new found skill on something sexy. And what could be sexier than a pre-programmed process connecting to an API and performing automated data extraction? I know I can’t think of anything either. So with my task decided, I started searching for a target for my masterpiece. There was just one small question which I had to answer before I got started: How the hell does one even start building a bot?

The answers to this question can be fast, but understanding is a bit more of a slow process. For those of you less familiar with this topic let’s start with the basics. A bot is really just a program written in any computer language you like which performs a function, and in this context that function will be on the interwebs. So for me the exercise of building a bot was actually gaining an understanding of APIs to see what interesting things I could do with them. Also an API [application programming interface] is kind of like a database connection which a website owner sets up to allow other computers to access some of the underlying data of the site without the hassle of using a browser.

It’s worth mentioning that the Twitter bot was not actually my first bot. When I began this, I didn’t actually start with APIs because I caught wind of a nifty little Ruby gem called Watir. The appeal being that you can use it to basically control a web browser. The reason a program like this exists is primarily for testing purposes, e.g. writing a script to have a web browser turn on, go to a website and test every link 10,000 times to make sure that they are all valid. You know things that are so boring that they would drive a normal person to place all their shoes on the roof so they can be purified by the moonlight. And that is what most people use it for, although there are much more unscrupulous things one could do with it. What I ended up building was a nifty little bot which started Google Chrome, went to weather.com, checked out the 10-day forecast for Chicago, IL and returned the weather report for 7 days from today. Partly sunny, but that is not the bot this post is about, so let’s stay focused.

Now don’t get me wrong that was a fun little bot, but there were too many options and I find the keys to creativity really lies in imposing enough structure that only an elegant solution can satisfy the criteria of the problem you are trying to solve. Anyway, that is when I started playing around with APIs, specifically Twitter’s. One nice thing about Twitter is besides absolutely overflowing with data, it has among the most well documented APIs out there, which makes it an ideal playground for an aspiring data scientist.

I started small by making a program which sent a request to Twitter for the last 20 tweets from one of my throwaway Twitter accounts, @passactivism. Follow it if you want your news feed inundated with nonsense, but more on that in a minute. Twitter sends the results of this request in a format called JSON, which looks something like this:

{ createdat: 'Sat Nov 03 05:50:31 +0000 2012', id: 264605456318738430, idstr: '264605456318738433', text: 'Look at me I am tweeting', source: 'TRbuddy', truncated: false, inreplytostatusid: null, inreplytostatusidstr: null, inreplytouserid: null, inreplytouseridstr: null, inreplytoscreenname: null…}

So the first step, once I received it, was to parse it into a nice readable format. Once I had done that, I had a nifty little program which would go to Twitter and print my tweets to the command prompt - which is a nice accomplishment, but really not that interesting. So I kept tinkering with it.

Observing the output I noticed that all Twitter users have a pretty standard user_id, (a distinct identifying number which can be used to pull the user’s information). So I tried using my screen name’s user_id to query the list of my tweets, and it worked perfectly. Now at this point I got to thinking, what would happen if I entered a random user_id, would it still yield a profile? I won’t leave you in suspense, it does.

With this knowledge I set up a little program which instead of pulling a specified user would pull a random one. This is very important but because when dealing with statistics a sufficiently random sample can be used to make inferences about the whole population. I actually spent a fair amount of time just looking over the results. They were this strange window into the thoughts of someone I had never met and would never meet. And for those of you who do not know, Twitter is very international so you can even see what people are thinking in a place you don’t know anything about (Google translate is very helpful for this). This data was without a doubt a treasure trove, of what I have no idea, but there was so much of it, and I wanted more. My inner data nerd was screaming like a 10-year-old girl meeting Justin Bieber. There were so many possibilities - if I could only capture this data somehow.

Luckily I am a nerd and I came up with quite an efficient solution. I set up a little program which sends out the random request to Twitter, and upon receiving the results saves them to my computer in a SQL database (PostGres for the curious). This process, better known as scraping, is a legitimate hacker activity, possibly making me a legit hacker in a training wheels kind of way - tell your friends.

The last step before I could call this little bot building experience a success was to take me out of the equation. After all what is the fun of having a bot if every couple minutes I have to tell it to rampage? I want to cut it loose and hope no one gets hurt. For this I created a little batch file and scheduled to run it every 64 minutes. Why 64 minutes? Well Twitter has an hourly ping limit and I was definitely hitting it. And with that my creation was alive, searching the Twitter jungle for tasty morsels.

For good measure, I created a logging feature which posts a tweet each time it ran telling me how the last run went. So if you do happen to check out the Twitter feed of @passactivism that is what it will be populated with. Call me crazy but there is something delicious about using the platform you are scraping to tell you how good a job you are doing.