PubMed XML Processor

A Python script that processes journal article XML files from Open Journal Systems (OJS) and transforms them into PubMed-compliant XML format for submission to the PubMed database.

Features

Batch Processing: Processes multiple XML files from an input directory
Title Translation: Retrieves English titles from OJS API using Dutch vernacular titles
Abstract Enhancement: Fetches English abstracts and reorganizes multilingual abstracts
Keyword Extraction: Extracts DC.Subject metadata as keywords
XML Restructuring: Reorganizes XML elements to comply with PubMed DTD requirements
Language Standardization: Converts language codes (e.g., 'dut' to 'NL')
Metadata Enrichment: Adds publication types, article IDs, and other required elements

Requirements

Dependencies

pip install beautifulsoup4 requests pandas python-dotenv lxml

Environment Setup

Create a .env file in the root directory with the following variables:

api_key=your_ojs_api_key
journal_title=your_journal_name
journal_abbreviation=Journal_Abbreviation

Directory Structure

project/
├── input/              # Place XML files to be processed here
├── output/             # Output directory (created automatically)
├── articleset.xml      # Final combined output file
├── json.txt           # API response cache file
├── .env               # Environment variables
├── main.py            # Main script
└── README.md          # This file

Usage

Basic Usage

Place your XML files in the input/ directory
Run the script:

python main.py

The processed XML will be saved as articleset.xml

Advanced Usage

You can also use individual functions programmatically:

from main import rewrite_xml, read_xml_file

# Process a single XML file
xml_content = read_xml_file('path/to/file.xml')
processed_xml = rewrite_xml(xml_content, 'journal_name', 'api_key')

Key Functions

Core Processing Functions

rewrite_xml(): Main function that orchestrates the entire transformation process
process_all_xml_files(): Batch processes multiple XML files and combines them
reorganize_article_xml(): Reorders XML elements according to PubMed DTD requirements

Data Retrieval Functions

retrieve_json_info(): Fetches article information from OJS API
get_english_abstract(): Extracts English abstracts from article pages
get_dc_subjects(): Retrieves keyword metadata from article pages

XML Manipulation Functions

add_article_title(): Adds English article titles
replace_vernacular_title(): Updates vernacular titles with subtitles
refurbish_abstracts(): Reorganizes abstracts by language
insert_keywords_after_abstract(): Adds keyword metadata
add_article_id_list(): Adds DOI information
add_publication_type(): Adds publication type metadata
replace_language_tag(): Standardizes language codes

XML Transformation Process

The script performs the following transformations:

Title Processing: Retrieves English titles from OJS API
Abstract Enhancement: Fetches English abstracts and creates multilingual structure
Metadata Addition: Adds keywords, publication types, and article IDs
Language Standardization: Updates language codes for PubMed compatibility
Structure Reorganization: Reorders elements according to PubMed DTD
DOCTYPE Addition: Adds proper DOCTYPE declaration for PubMed submission

Output Format

The script generates a combined articleset.xml file containing all processed articles with:

PubMed-compliant DOCTYPE declaration
Properly ordered XML elements
English and vernacular titles
Multilingual abstracts
Keyword metadata
Publication type information
DOI identifiers

Error Handling

The script includes error handling for:

Network connectivity issues
API response errors
XML parsing errors
Missing metadata
File I/O operations

Troubleshooting

Common Issues

API Connection Errors: Verify your API key and journal title in .env
XML Parsing Errors: Check that input XML files are well-formed
Missing Abstracts: Some articles may not have English abstracts available
Network Timeouts: Retry processing if network issues occur

Debugging

The script outputs detailed information during processing:

API responses are cached in json.txt
Progress messages are printed to console
Error messages indicate specific issues

Configuration

Journal Settings

Update the .env file with your specific journal information:

journal_title: The OJS journal identifier
journal_abbreviation: PubMed-approved journal abbreviation

Processing Options

Modify the element_order list in reorganize_article_xml() to change element ordering if needed.

Support

For issues or questions:

Check the troubleshooting section above
Review the error messages in the console output
Verify your .env configuration
Ensure input XML files are properly formatted

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.gitignore		.gitignore
app.py		app.py
readme.MD		readme.MD
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PubMed XML Processor

Features

Requirements

Dependencies

Environment Setup

Directory Structure

Usage

Basic Usage

Advanced Usage

Key Functions

Core Processing Functions

Data Retrieval Functions

XML Manipulation Functions

XML Transformation Process

Output Format

Error Handling

Troubleshooting

Common Issues

Debugging

Configuration

Journal Settings

Processing Options

Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PubMed XML Processor

Features

Requirements

Dependencies

Environment Setup

Directory Structure

Usage

Basic Usage

Advanced Usage

Key Functions

Core Processing Functions

Data Retrieval Functions

XML Manipulation Functions

XML Transformation Process

Output Format

Error Handling

Troubleshooting

Common Issues

Debugging

Configuration

Journal Settings

Processing Options

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages