Collecting Twitter Data with Python and store into MongoDB

Hi All.

For my first project I have decided to collect a couple days the hashtags “#datascience” and “#datascientist” from twitters timeline and store into a MongoDB for later use. First of all we need to install MongoDB, Python and the necessary libraries for streaming Twitter and storing into MongoDB with Python. You need a sudo non-root user, which you can set up by following steps. Let’s start!!

Installing MongoDB

I have MongoDB installed as described on the MongoDB site you can find it here. Why should I write it again :). I installed it on linux CentOS 7, and used the Red Hat installation guide.

You can check if MongoDB is running and listen on the tcp port.

Installing Python

Python is already installed during the installation of CentOS, but I want to install Anaconda Python. Anaconda is a completely free enterprise-ready Python distribution for large-scale data processing, predictive analytics, and scientific computing. You can find it here.

After installation you can test if you are using Anaconda Python. If not you should check your path settings and correct that.

Installing necessary libraries for streaming Twitter and storing into MongoDB

If your path is correct you have also pip available, we need this for installing the library tweepy and pymongo. Tweepy is an easy-to-use Python library for accessing the Twitter API. PyMongo is a Python distribution containing tools for working with MongoDB, and is the recommended way to work with MongoDB from Python

Create a Twitter Application

First we need to create a Twitter Application, I will explain how to create a Twitter application and get your API access keys and tokens. These keys and tokens we need to authenticate the Python client application with Twitter.

Visit https://apps.twitter.com/ then log in using your Twitter account credentials. Once logged in, click the button labeled Create New App.

create_app

You will be redirected to the application creation page. Fill out the required form information and accept the Developer Agreement at the bottom of the page, then click the button labeled Create your Twitter application.

twitter_app

Don’t forget to click the checkbox that says Yes, I agree underneath the Developer Agreement.

Now you have setup your Twitter Application and you’re ready to write your stream listener script in Python and get your tweets and save it into MongoDB.

Creating the script

Create a file twitter_stream_to_mongodb.py and start with importing the libraries

After that we are setting up our twitter application variables that will be used in the stream listener. These are necessary for the OAuth Authentication.

The next step is creating an OAuthHandler instance. Into this we pass our consumer token and secret which was given to us in the previous paragraph.

Tweepy provides a class to access the Twitter Streaming API: StreamListener. We just need to create our own class StreamListener that inherits from tweepy.StreamListener and override some functions to adapt its behavior to output the data into MongoDB.

After the class is created we can create the stream object and start the listener.

Now the script is ready we can start it and let it run for a while.

There is no need to create the twitter database and the datascience collection, if they don’t exist, PyMongo will create them for you. After a short time we have streamed some tweets. Let’s look into the MongoDB console to explore some tweets.

Look into MongoDB

Start the MongoDB console and use direct our database twitter

Check is the collection datascience exists and count the number of tweets we have gathered

Let’s look at a tweet.

That looks nice, now we can do some analysis with this data, that will be another project.

You can find the python code on my GitHub, if you have questions follow me on Twitter.