Diplomatic pulse

Crawling, Indexing, and Storage System for United Nations Member states Press Statements in Real Time.

Overview

This repository should provide much of the structure and parsing code needed to crawl and scrape various countries Ministry of Foreign Affairs (MFA) web pages contents with very little effort. On launching our Diplomatic pulse Scrapy crawlers, the Diplomatic pulse clones our back end data: XPATHs of each country's website layout to crawl and scrape the html contents.

Requirements

Python 3.6+
Docker version >=19.03.12

Installing and executing

A docker-compose.yml file is used, once you have Docker installed and started, change to the project directory and follow:

git clone git@github.com:qcri/DiplomaticPulse.git
cd DiplomaticPulse
docker-compose up

See the install section in the documentation at https://diplomaticpulse.qcri.org/docs/installation.html for more details

Browsing extracted indexed data

In each visited country's webpage, the content is extracted and indexed into elasticsearch index dppa.st. The indexed data can be browsed using the Dejavu (free and open source web UI for Elasticsearch).

Assuming all containers are running on your local machine, go to:

http://localhost:1358
connect to index dppa.st with http://localhost:9200

Monitoring of Diplomatic pulse

Diplomatic pulse uses the Scrapy UI, which can be used to observe each crawler job history and status.

You can access UI at:

http://localhost:5000/1/jobs/

using the username and password [SCRAPY_WEB_USERNAME,SCRAPY_WEB_PASSWORD] shown in the .env file.

Full Documentation

Documentation is available online at https://diplomaticpulse.qcri.org/docs and in the docs directory.

Contributing

See https://diplomaticpulse.qcri.org/docs/contributing.html for details.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.github/workflows		.github/workflows
ES		ES
diplomaticpulse		diplomaticpulse
docs		docs
scrapydweb		scrapydweb
.coverage		.coverage
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.env		.env
.gitignore		.gitignore
.pylintrc		.pylintrc
AUTHORS		AUTHORS
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.py		run.py
run_scrapyd.sh		run_scrapyd.sh
scrapy.cfg		scrapy.cfg
scrapyd.conf		scrapyd.conf
setup.cfg		setup.cfg
setup.py		setup.py
test_runner.py		test_runner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Diplomatic pulse

Overview

Requirements

Installing and executing

Browsing extracted indexed data

Monitoring of Diplomatic pulse

Full Documentation

Contributing

About

Uh oh!

Releases

Packages

Languages

License

qcri/DiplomaticPulse

Folders and files

Latest commit

History

Repository files navigation

Diplomatic pulse

Overview

Requirements

Installing and executing

Browsing extracted indexed data

Monitoring of Diplomatic pulse

Full Documentation

Contributing

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages