Setting up a DataScience Server

After installing multiple software, servers etc. on my  laptop it was overloaded with different tools and running services. When I get a new laptop or it will crash I can start over again installing everything, at home on my iMac I had the same tools and servers. So I decided to setup a DataScience Server with a necessary software and servers.

I know there are a lot easier and faster projects for setting up a datascience server, but I will only install one server with all the necessary software that I can connect to in my home network. Besides that it’s a lot of fun for doing this 😉

In this post I will install a minimal CentOS 7 server with the containing software and servers to start with datascience:

  • Anaconda Python
  • The Jupyter Notebook
  • R and Rstudio Server
  • MongoDB
  • Splunk®

Minimal install CentOS 7 for setting up a datascience server

I will not explain it in detail, if you are not familiar with a CentOS installation, there are a lot of manuals to find.
Get a fresh “Minimal ISO” copy of the CentOS 7 image from https://www.centos.org/download/.
Burn it with your favorite software or mount it in your new virtual machine and boot it. I have changed some things like root password, timezone, disk layout etc.

If you have finished the minimal installation we need to install some needed packages.

Configure NTPD

You can edit your configuration and servers with vi /etc/ntp.conf, the default is good enough for me.

Installing Anaconda Python

Anaconda is the leading open datascience platform powered by Python. The open source version of Anaconda is a high performance distribution of Python and R and includes over 100 of the most popular Python, R and Scala packages for datascience. (source: https://www.continuum.io)

Get the latest Linux version from https://www.continuum.io/downloads
This package has a total size of 392M

Follow the instructions, I changed the install location:

And I have updated my .bashrc

After the installation is completed check your path and reinitialise it.

This is the default installed python with CentOS, we need the anaconda python to be default.

Installing The Jupyter Notebook

The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more for doing datascience. (source: http://jupyter.org/)

The notebook will default run on localhost and you need to start it by hand. I’ve created a systemd unit file to start it automaticly and runs on a different user.

First create the user:

Create the system unit file for automatically start on boot.

In vi press i to enable insert and copy paste below, at the end press <esc> :wq

Now we created the unit file we only need to reload the inits and enable the system unit file fo the Jupyter Notebook.

If the daemon started you can connect to http://your-server-name-here:8888 and shows your home screen.

datascience - jupyter

For detail configuration, like encryption and authentication you can check the official Jupiter documentation here http://jupyter-notebook.readthedocs.io/en/latest/

Installing R and Rstudio

R is a language and environment for statistical computing and graphics. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. (source: https://www.r-project.org/)

RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management. (source: https://www.rstudio.com/)

For installing R we need to install the Extra Packages for Enterprise Linux (EPEL) repo.

Refresh the repo

Now we can install R

This will install about 390 packages, so get a cup of coffee 🙂

If R is installed we can install Rstudio-server
I have used the instructions from https://www.rstudio.com/products/rstudio/download-server-2/ This package has a total size of 280M

If everything went fine you can connect to your server with the following URL,  and you will see a sign in screen.

http://your-server-name-here:8787

datascience - rstudio

See the official Getting Started document for information configuring and managing the server.

Installing MongoDB

MongoDB is an open-source document database that provides high performance, high availability, and automatic scaling. (source: https://www.mongodb.com/)

I have MongoDB installed as described on the MongoDB site you can find it here. Why should I write it again :). I installed CentOS 7, and therefore I used the Red Hat installation guide. To determine which platform you run, check it with the following command on CentOS.

As you can see we are running on a 64-bit platform, thats fine because the installation guide only supports 64-bit systems 😀

Disable SELinux by setting the SELINUX setting to disabled in /etc/selinux/config.

You can check if MongoDB is running and listen on the tcp port.

Installing Splunk®

You see servers and devices, apps and logs, traffic and clouds. We see data—everywhere. Splunk®offers the leading platform for Operational Intelligence. It enables the curious to look closely at what others ignore—machine data—and find what others never see: insights that can help make your company more productive, profitable, competitive and secure. What can you do with Splunk?
Just ask.

stock_notes-1For downloading Splunk® you need to create an account on www.splunk.com.

You can get your free Splunk® Enterprise here: https://www.splunk.com/en_us/download/splunk-enterprise.html

We need to choose Linux, than the 64-bits, stop the download because we are going it to download with the wget command. On the right side we can find “Got wget?”, press that and copy the URL into your linux console to download the rpm package.

datascience-splunk-download

If Splunk® is downloaded than install it with rpm

After install go the the directory where Splunk® is installed and start it. We accept the license directly with the start.

You can now connect to http://your-server-name-here:8000

datascience-splunk

Configure Splunk® to start automatically

Well thats it, you we can now start to gather some data and doing some datascience.

If you have some questions, follow me on Twitter or mail me, in the footer you can find my contact information.

Good luck!

2 comments on “Setting up a DataScience Server”

Comments are closed.

Related Posts