Crawling, Indexing, and Storage System for United Nations Member states Press Statements in Real Time.
This repository should provide much of the structure and parsing code needed to crawl and scrape various countries Ministry of Foreign Affairs (MFA) web pages contents with very little effort. On launching our Diplomatic pulse Scrapy crawlers, the Diplomatic pulse clones our back end data: XPATHs of each country's website layout to crawl and scrape the html contents.
- Python 3.6+
- Docker version >=19.03.12
A docker-compose.yml file is used, once you have Docker installed and started, change to the project directory and follow:
git clone git@github.com:qcri/DiplomaticPulse.git
cd DiplomaticPulse
docker-compose upSee the install section in the documentation at https://diplomaticpulse.qcri.org/docs/installation.html for more details
In each visited country's webpage, the content is extracted and indexed into elasticsearch index dppa.st.
The indexed data can be browsed using the Dejavu (free and open source web UI for Elasticsearch).
Assuming all containers are running on your local machine, go to:
- http://localhost:1358
- connect to index
dppa.stwithhttp://localhost:9200
Diplomatic pulse uses the Scrapy UI, which can be used to observe each crawler job history and status.
You can access UI at:
using the username and password [SCRAPY_WEB_USERNAME,SCRAPY_WEB_PASSWORD] shown in the .env file.
Documentation is available online at https://diplomaticpulse.qcri.org/docs and in the docs directory.
See https://diplomaticpulse.qcri.org/docs/contributing.html for details.
