This is the term project in the course Big Data Systems at Pusan National University, South Korea. Our project scrapes Steam reviews, processes them using a Hadoop cluster, and stores the results in a MariaDB database. The project is divided into three parts:
- bds-seoul-mariadb: This repository contains the code for the MariaDB database that stores the processed data.
- bds-seoul-hadoop: This repository contains the code for the Hadoop cluster that processes the data.
- bds-seoul-client: This repository contains the code for the UI and API that interact with the MariaDB database and the Hadoop cluster.
Each part is deployed on a Raspberry Pi, and the entire system is designed to work together seamlessly.
The frontend is built using React and the backend is built using Python. This combination allowed us to display the result from the map reduce job in a user-friendly way.
Kafka was used to send messages between nodes in the cluster, and WebSockets was used to give the user real-time updates on their scraper job. We also implemented a form for caching to reduce the processing time needed for each request.
Note
Make sure bds-seoul-mariadb and bds-seoul-hadoop are up and running before starting the UI and API. This is the
third and final step in startup process. This is the correct order to start things up in:
-
Clone the repository:
git clone https://github.com/jathavaan/bds-seoul-client.git
-
Navigate to the project directory:
cd bds-seoul-client
There are two different setups for this project: one for local development and one for running on Raspberry Pis.
-
Navigate to the server directory:
cd server
-
Add a
.envfile in theserverdirectory with the following content:KAFKA_BOOTSTRAP_SERVERS=host.docker.internal SEQ_SERVER=host.docker.internal SEQ_PORT=5341 -
Navigate to the client directory:
cd ../client -
Add a
.envfile in theclientdirectory with the following content:VITE_BE_IP=localhost:5000
After this has been done you can check out the Starting the Services section to start the services.
Note
This takes a long time depending on your internet connection. It is therefore recommended to run this project locally.
The Raspberry Pi setup is more complex and requires multiple steps to be set up. The first step is to identify the IP-addresses of the Raspberry Pis. Open a terminal on your computer and ssh into the Raspberry Pi:
ssh seoul-1@<ip-address-of-raspberry-pi-1>Replace <ip-address-of-raspberry-pi-1> with the actual IP-address of the Raspberry Pi, and enter the password
seoul-1.
Then change directory into the bds-seoul-client directory:
cd bds-seoul-clientWe use envsubst to inject the correct IP-addresses when building the docker images. The IP-addresses are set in
~/.zshrc. To set the IP-addresses, you can use nano ~/.zshrc and add the following lines:
export SEOUL_1_IP=<ip-address-of-seoul-1-raspberry-pi>
export SEOUL_2_IP=<ip-address-of-seoul-2-raspberry-pi>
export SEOUL_3_IP=<ip-address-of-seoul-3-raspberry-pi>
export SEOUL_4_IP=<ip-address-of-seoul-4-raspberry-pi>Press CTRL + X, then Y and Enter to save the file. After that, run
source ~/.zshrcto apply the changes. You have now set the IP-addresses for the Raspberry Pis, and you only need to do this if the IP-addresses of any Raspberry Pi changes.
Using envsubst, you can now configure the correct .env file. First you need to change directory to the server
directory with cd server and then you can run the following commands:
envsubst < .env.template > .envThis will create a .env file with the correct IP-addresses for the Raspberry Pis. Run
cat .envto verify that the environment variables are set correctly. Now change directory to the client directory with
cd ../clientand repeat the process:
envsubst < .env.template > .envThis will create a .env file with the correct IP-addresses for the Raspberry Pis. Run
cat .envNote
If you are doing this on a Raspberry Pi, use sudo docker compose instead of docker-compose.
The next step is to create the containers by running the following command in the root of bds-seoul-client directory:
docker-compose up -dThe UI can be accessed at http://localhost:5173 and the API at http://localhost:5000.
There are some known issues with this system. Many problems can often be solved by taking down the problematic container and rebuilding it like this:
docker-compose down -v
docker-compose build --no-cache
docker-compose up -dInfo
Repository: bds-seoul-mariadb
Container(s): database-script
Problem
This error occurs in the database-script container and is almost always due to the database-script container being started too early. The looks something like this
Traceback (most recent call last):
2025-06-09T23:50:48.644765226Z File "/app/main.py", line 9, in <module>
2025-06-09T23:50:48.651105905Z container.kafka_service()
2025-06-09T23:50:48.651235681Z File "src/dependency_injector/providers.pyx", line 256, in dependency_injector.providers.Provider.__call__
2025-06-09T23:50:48.656903697Z File "src/dependency_injector/providers.pyx", line 3052, in dependency_injector.providers.Singleton._provide
2025-06-09T23:50:48.659812683Z File "src/dependency_injector/providers.pxd", line 650, in dependency_injector.providers.__factory_call
2025-06-09T23:50:48.662689848Z File "src/dependency_injector/providers.pxd", line 608, in dependency_injector.providers.__call
2025-06-09T23:50:48.667693058Z File "/app/src/application/services/kafka_service/kafka_service.py", line 19, in __init__
2025-06-09T23:50:48.684920334Z existing_topics = self.__admin_client.list_topics(timeout=10).topics
2025-06-09T23:50:48.685069094Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-06-09T23:50:48.685102151Z File "/usr/local/lib/python3.11/site-packages/confluent_kafka/admin/__init__.py", line 639, in list_topics
2025-06-09T23:50:48.688931730Z return super(AdminClient, self).list_topics(*args, **kwargs)
2025-06-09T23:50:48.688974157Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-06-09T23:50:48.688980664Z cimpl.KafkaException: KafkaError{code=_TRANSPORT,val=-195,str="Failed to get metadata: Local: Broker transport failure"}
Solution
Let the Kafka container run for 30~60 seconds before starting the database-script container.
Info
Repository: bds-seoul-client, bds-seoul-mariadb
Container(s): server, kafka, zookeeper
Problem
The cache check is the first message sent with Kafka. The first messeage often takes up to 5 minutes to complete. If it is stuck actions need to be taken. The logs will be stuck on this message
[INFO] 2025-06-09 23:59:03 application.kafka.producers.last_scraped_date_producer:20 Requested last scraped date for Steam game ID 730
Solution
Firstly, stop all containers in bds-seoul-client and bds-seoul-hadoop. The following command in each root directory.
docker-compose stopThen you need to turn off the database-script, kafka and zookeeper containers in the bds-seoul-mariadbrepository. Use the following command to do so:
docker-compose stop kafka database-script zookeeperAnd then take down both zookeeper and kafka containers:
docker-compose down -v zookeeper
docker-compose down -v kafkaStart both Zookeeper and Kafka with 30 ~ 60 seconds between each startup:
docker-compose up -d zookeeper
docker-compose up -d kafkaAfter Kafka have been running for 30 ~ 60 seconds start the database-script container
docker-compose up -d database-scriptThen start the containers in bds-seoul-hadoop and bds-seoul-client repositories as instructed in the setup guide. Try to request data for the same game now, the server logs should look something like this:
[INFO] 2025-06-10 01:33:19 application.kafka.consumers.last_scraped_date_consumer:45 Response received from last scraped date topic
[INFO] 2025-06-10 01:33:19 application.services.scraper_service.scraper_service:44 Scraping Steam game ID 730 until 500 reviews have been scraped or review date is 2025-06-10 00:00:00 [https://steamcommunity.com/app/730/reviews/?browsefilter=mostrecent&snr=1_5_100010_&p=1&filterLanguage=all]
Info
Repository: bds-seoul-hadoop
Container(s): namenode
Problem
This mostly occurs there are messages from a previous job in the queue.
Solution
Stop the containers in the repositores bds-seoul-client and bds-seoul-hadoop and restart the database-script container in the repository bds-seoul-mariadb. This will clear out all messages. You can now try to restart the other containers.

