EN | DE

Big Data Cluster for teachining and research

The center for High Performance Computing at Vienna's Technical University offers a small Hadoop cluster named LBD (short for Little Big Data) composed of 20 nodes (//20 Server 2x XeonE5-2650v4/20 cores/256GB/10Gbps//). The cluster is operational since December 2017 and is currently used for teaching, seminars, and science.

The aim of the Little Big Data Cluster is to offer researchers and lecturers at the Vienna University of Technology a stable working environment.

Hardware

LBD has the following hardware setup: The namenode c100 is also called lbd and it is reachable from within the tuwien domain under lbd.zserv.tuwien.ac.at. Each of the 20 nodes (c100-c118, h1) has

Apart from two extra Ethernet devices for external connections on h1 and on c100, all nodes have the same hardware configurations. All ethernet connections (external and inter-node) support a speed of 10Gb/s.

HDFS configuration

Available software

Name Status Kommentar
Centos 7 Betriebssystem OK
XCAT Deploymentumgebung OK
Jupyterlab Web-based user interface running at: https://lbd.zserv.tuwien.ac.at:8000 OK
Cloudera Manager Big Data Deployment OK
Cloudera HDFS Hadoop Distributed File System OK
Cloudera Accumulo Key/value store OK
Cloudera HBase Database on top of HDFS OK
Cloudera Hive Data warehouse using SQL OK
Cloudera Hue Hadoop user experience, web gui, SQL analytics workbench OK
Cloudera Impala SQL query engine, used by Hue OK
Oozie Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Used by Hue OK
Cloudera Solr open source enterprise search platform, used by Hue, used by Key-Value Store Indexer OK
Cloudera Key-Value Store Indexer The Key-Value Store Indexer service uses the Lily HBase NRT Indexer to index the stream of records being added to HBase tables. Indexing allows you to query data stored in HBase with the Solr service. OK
Cloudera Spark (Spark 2) cluster-computing framework mit Scala 2.10 (2.11) OK
Cloudera YARN (MR2 Included) Yet Another Resource Negotiator (cluster management) OK
Cloudera ZooKeeper ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. OK
Java 1.8 Programmiersprache OK
Python 3.6.3 (python3.6), Python 3.4.5 (python3.4) Python 2.7.5 (python2) Programmiersprache OK
Anaconda Python (python) export PATH=/home/anaconda3/bin/:$PATH OK
Jupyter Notebook, benötigt anaconda OK
Singularity performs root-less containerization OK
PostgreSQL relational database management system OK
MongoDB | benötigt Plattenplatz, nicht alle Knoten Beta testing
Kafka Verarbeitung von Datenströmen Beta testing
Cassandra benötigt Plattenplatz, nicht alle Knoten TODO
Storm Eher Spark Streaming? auf weitere Anfrage
Drill -
Flume -
Kudu -
Zeppelin -
Giraph TODO

Get access to the cluster

To get a new user account send a request to hadoop-support@tuwien.ac.at.

Course instructors can request accounts for TU students by providing the following information:

Check this LVA checklist for details.

Technical support

If you need support with Hadoop cluster please contact hadoop-support@tuwien.ac.at. For general enquiries contact the dataLAB team at hadoop@tuwien.ac.at

Jupyter Notebook

To use a Jupyter notebook !TU-Wien network only!, connect to https://lbd.zserv.tuwien.ac.at:8000, and login with your user's credentials. Start a new notebook, e.g. Python3, PySpark3, a terminal, ... A short example in PySpark3:
import pyspark
import random
sc = pyspark.SparkContext(appName="Pi")
num_samples = 10000
def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()

Announcements

Documents and presentations related to LBD

Teilnahme PRACE Autumn school 2020
Introduction to Hadoop by Giovanna Roda
Bachelor thesis: mapreduce-join-algorithms
The Little Big Data tech stack
The technologies we use for our daily work at-a-glance
Big Data on the Vienna Scientific Cluster
Extended abstract ASHPC21 First Ausrtrian-Slovenian HPC Meeting 2021
Apache Spark is here to stay
Extended abstract Austrian HPC Meeting 2020
Apache Spark is here to stay (slides)
Presentation at the Austrian HPC Meeting 2020
Big Data at TU-Wien: current deployment status and outlook
Presentation of the Little Big Data cluster at the Austrian HPC Meeting 2019
Hadoop @TU.it
presentation for the rector
ssh tunnel for Yarn logs
ssh tunnel for Yarn logs (in German)
ssh with putty (in German)
HUE !TU-Wien network only!
Hue is a web-based interactive query editor that enables you to interact with data warehouses.