Big Data Cluster for teaching and research

The center for High Performance Computing at Vienna's Technical University offers a small Hadoop cluster named LBD (short for Little Big Data) composed of 20 nodes (//20 Server 2x XeonE5-2650v4/20 cores/256GB/10Gbps//). The cluster is operational since December 2017 and is currently used for teaching, seminars, and science.

The aim of the Little Big Data Cluster is to offer researchers and lecturers at the Vienna University of Technology a stable working environment.

Hardware

LBD has the following hardware setup:

2 namenodes (on c100: primary, on c101 secondary namenode)
18 datanodes c101--c118
an administrative server h1 as
- Cloudera Manager server
- backup of administrative data
a ZFS file server lbdnfs01 for /home with 300TB of storage space

The namenode c100 is also called lbd and it is reachable from within the tuwien domain under lbd.zserv.tuwien.ac.at. Each of the 20 nodes (c100-c118, h1) has

two XeonE5-2650v4 CPUs with 24 cores (total of 48 cores per node, 864 total worker cores)
256GB RAM (total of 4.5TB memory available to worker nodes of the whole cluster)
four hard disks, each with a capacity of 4TB (total of 16TB per node, 288TB total for worker nodes)

Apart from two extra Ethernet devices for external connections on h1 and on c100, all nodes have the same hardware configurations. All ethernet connections (external and inter-node) support a speed of 10Gb/s.

HDFS configuration

current version: Hadoop 3
block size: 128 MiB
default replication factor: 3

Available software

Name	Status	Kommentar
Centos 7	Betriebssystem	OK
XCAT	Deploymentumgebung	OK
Jupyterlab	Web-based user interface running at: https://lbd.zserv.tuwien.ac.at:8000	OK
Cloudera Manager	Big Data Deployment	OK
Cloudera HDFS	Hadoop Distributed File System	OK
Cloudera Accumulo	Key/value store	OK
Cloudera HBase	Database on top of HDFS	OK
Cloudera Hive	Data warehouse using SQL	OK
Cloudera Hue	Hadoop user experience, web gui, SQL analytics workbench	OK
Cloudera Impala	SQL query engine, used by Hue	OK
Oozie	Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Used by Hue	OK
Cloudera Solr	open source enterprise search platform, used by Hue, used by Key-Value Store Indexer	OK
Cloudera Key-Value Store Indexer	The Key-Value Store Indexer service uses the Lily HBase NRT Indexer to index the stream of records being added to HBase tables. Indexing allows you to query data stored in HBase with the Solr service.	OK
Cloudera Spark (Spark 2)	cluster-computing framework mit Scala 2.10 (2.11)	OK
Cloudera YARN (MR2 Included)	Yet Another Resource Negotiator (cluster management)	OK
Cloudera ZooKeeper	ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.	OK
Java 1.8	Programmiersprache	OK
Python 3.6.3 (python3.6), Python 3.4.5 (python3.4) Python 2.7.5 (python2)	Programmiersprache	OK
Anaconda Python (python)	export PATH=/home/anaconda3/bin/:$PATH	OK
Jupyter	Notebook, benötigt anaconda	OK
Singularity	performs root-less containerization	OK
PostgreSQL	relational database management system	OK
MongoDB \|	benötigt Plattenplatz, nicht alle Knoten	Beta testing
Kafka	Verarbeitung von Datenströmen	Beta testing
Cassandra	benötigt Plattenplatz, nicht alle Knoten	TODO
Storm	Eher Spark Streaming?	auf weitere Anfrage
Drill		-
Flume		-
Kudu		-
Zeppelin		-
Giraph		TODO

Get access to the cluster

To get a new user account send a request to hadoop-support@tuwien.ac.at.

Course instructors can request accounts for TU students by providing the following information:

a CSV file in the TISS format:

"0123456";"Lastname";"Firstname";"e0123456@student.tuwien.ac.at"

a unique name for their class (such as ADBS19)
an expiry date for the accounts

Check this LVA checklist for details.

Technical support

If you need support with Hadoop cluster please contact hadoop-support@tuwien.ac.at. For general enquiries contact the dataLAB team at hadoop@tuwien.ac.at

Jupyter Notebook

To use a Jupyter notebook !TU-Wien network only!, connect to https://lbd.zserv.tuwien.ac.at:8000, and login with your user's credentials. Start a new notebook, e.g. Python3, PySpark3, a terminal, ... A short example in PySpark3:

import pyspark
import random
sc = pyspark.SparkContext(appName="Pi")
num_samples = 10000
def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()

Announcements

Documents and presentations related to LBD

Teilnahme PRACE Autumn school 2020: Introduction to Hadoop by Giovanna Roda
Bachelor thesis: mapreduce-join-algorithms
The Little Big Data tech stack: The technologies we use for our daily work at-a-glance
Big Data on the Vienna Scientific Cluster: Extended abstract ASHPC21 First Ausrtrian-Slovenian HPC Meeting 2021
Apache Spark is here to stay: Extended abstract Austrian HPC Meeting 2020
Apache Spark is here to stay (slides): Presentation at the Austrian HPC Meeting 2020
Big Data at TU-Wien: current deployment status and outlook: Presentation of the Little Big Data cluster at the Austrian HPC Meeting 2019
Hadoop @TU.it: presentation for the rector
ssh tunnel for Yarn logs
ssh tunnel for Yarn logs (in German)
ssh with putty (in German)
HUE !TU-Wien network only!: Hue is a web-based interactive query editor that enables you to interact with data warehouses.