A Python script that processes journal article XML files from Open Journal Systems (OJS) and transforms them into PubMed-compliant XML format for submission to the PubMed database.
- Batch Processing: Processes multiple XML files from an input directory
- Title Translation: Retrieves English titles from OJS API using Dutch vernacular titles
- Abstract Enhancement: Fetches English abstracts and reorganizes multilingual abstracts
- Keyword Extraction: Extracts DC.Subject metadata as keywords
- XML Restructuring: Reorganizes XML elements to comply with PubMed DTD requirements
- Language Standardization: Converts language codes (e.g., 'dut' to 'NL')
- Metadata Enrichment: Adds publication types, article IDs, and other required elements
pip install beautifulsoup4 requests pandas python-dotenv lxmlCreate a .env file in the root directory with the following variables:
api_key=your_ojs_api_key
journal_title=your_journal_name
journal_abbreviation=Journal_Abbreviationproject/
├── input/ # Place XML files to be processed here
├── output/ # Output directory (created automatically)
├── articleset.xml # Final combined output file
├── json.txt # API response cache file
├── .env # Environment variables
├── main.py # Main script
└── README.md # This file
- Place your XML files in the
input/directory - Run the script:
python main.py- The processed XML will be saved as
articleset.xml
You can also use individual functions programmatically:
from main import rewrite_xml, read_xml_file
# Process a single XML file
xml_content = read_xml_file('path/to/file.xml')
processed_xml = rewrite_xml(xml_content, 'journal_name', 'api_key')rewrite_xml(): Main function that orchestrates the entire transformation processprocess_all_xml_files(): Batch processes multiple XML files and combines themreorganize_article_xml(): Reorders XML elements according to PubMed DTD requirements
retrieve_json_info(): Fetches article information from OJS APIget_english_abstract(): Extracts English abstracts from article pagesget_dc_subjects(): Retrieves keyword metadata from article pages
add_article_title(): Adds English article titlesreplace_vernacular_title(): Updates vernacular titles with subtitlesrefurbish_abstracts(): Reorganizes abstracts by languageinsert_keywords_after_abstract(): Adds keyword metadataadd_article_id_list(): Adds DOI informationadd_publication_type(): Adds publication type metadatareplace_language_tag(): Standardizes language codes
The script performs the following transformations:
- Title Processing: Retrieves English titles from OJS API
- Abstract Enhancement: Fetches English abstracts and creates multilingual structure
- Metadata Addition: Adds keywords, publication types, and article IDs
- Language Standardization: Updates language codes for PubMed compatibility
- Structure Reorganization: Reorders elements according to PubMed DTD
- DOCTYPE Addition: Adds proper DOCTYPE declaration for PubMed submission
The script generates a combined articleset.xml file containing all processed articles with:
- PubMed-compliant DOCTYPE declaration
- Properly ordered XML elements
- English and vernacular titles
- Multilingual abstracts
- Keyword metadata
- Publication type information
- DOI identifiers
The script includes error handling for:
- Network connectivity issues
- API response errors
- XML parsing errors
- Missing metadata
- File I/O operations
- API Connection Errors: Verify your API key and journal title in
.env - XML Parsing Errors: Check that input XML files are well-formed
- Missing Abstracts: Some articles may not have English abstracts available
- Network Timeouts: Retry processing if network issues occur
The script outputs detailed information during processing:
- API responses are cached in
json.txt - Progress messages are printed to console
- Error messages indicate specific issues
Update the .env file with your specific journal information:
journal_title: The OJS journal identifierjournal_abbreviation: PubMed-approved journal abbreviation
Modify the element_order list in reorganize_article_xml() to change element ordering if needed.
For issues or questions:
- Check the troubleshooting section above
- Review the error messages in the console output
- Verify your
.envconfiguration - Ensure input XML files are properly formatted