I NEED HELP WITH A WEB SCRAPING PROJECT #157430

Kuribbo · 2025-04-23T01:11:19Z

Kuribbo
Apr 23, 2025

Body

I'm currently working on a project where I need to download a large number of PDFs from different websites. These PDFs are technical sheets for various investment funds.

The PDFs are uploaded monthly, so I need a way to download the file for a specific month and year.

I have an Excel sheet that contains information for each fund, including the administrator, fund name, and the link to the page where the technical sheets are published, formatted like this:
ADMIN, FUND NAME, TECHNICAL SHEETS LINK

Using this data, I just need to visit each link and download the corresponding PDF for the required month.

The main challenge is that each website is different, so there’s no single solution that works for all of them.

My current approach is to fetch the HTML content of each page and search for the correct PDF link using the following keywords:

Year

Month

Fund name

Phrases like "Technical sheet" or variations of it

However, I'm running into a lot of issues with this method. It doesn't work consistently across all websites, and when the correct PDF isn't found, there's no reliable way to detect that it failed.

If you know of any ideas or tools—especially AI-based ones—that could help make this process more efficient, I'd really appreciate your input.

Guidelines

I have read and understood this category's guidelines before making this post.

mohmedi-reza · 2025-04-23T02:39:09Z

mohmedi-reza
Apr 23, 2025

Core Tools & Frameworks
Web Scraping Libraries:

Selenium/Playwright: For JavaScript-heavy websites. These tools mimic browser behavior and handle dynamic content.

Beautiful Soup/Scrapy: For static HTML parsing. Use regex to match URLs with patterns like YYYY-MM or keywords (e.g., fund_name_2023_10.pdf).

# Using regex to find PDF links for October 2023 import re pdf_links = re.findall(r'href=[\'"]?([^\'" >]+\.pdf[^\'"]*2023[^\'"]*10[^\'"]*)', html_content)

Headless Browsers:

Puppeteer (Node.js) or Headless Chrome to render pages and extract links programmatically.

AI/ML Solutions for Unstructured Data
Pre-trained NLP Models:

Use spaCy or BERT to analyze page text and identify PDF links semantically (e.g., "Technical Sheet for October 2023").

Example workflow:

Extract all text and links from the page.

Use NLP to rank links by similarity to phrases like "October 2023 technical sheet."

Select the highest-confidence link.

Vision-based Tools:

Diffbot or Browse.ai: AI-powered scrapers that interpret page layouts visually, even if HTML structures vary.

Custom LLM Prompts:

Use GPT-4 or Claude 2 to parse HTML and infer PDF links. Example prompt:

"Analyze this HTML content and return the URL of the PDF technical sheet for October 2023.
Focus on links containing '2023', 'October', or 'Q4'. HTML: [PASTE_CONTENT]"
3. Error Handling & Validation
Checks to Detect Failures:

Verify downloaded files are valid PDFs using libraries like PyPDF2 or pdfplumber.

Check file size (e.g., reject files <10KB, which might be error pages).

Log failures (e.g., No PDF found for [Fund Name] - [Month/Year]) for manual review.

Fallback Workflow:

If no PDF is found, flag the row in Excel and send an email alert using Python SMTPLib.

Automation & Scalability
Cloud-Based Scrapers:

Deploy on AWS Lambda or Google Cloud Functions to run monthly.

Use Apify for pre-built scraping workflows.

Excel Integration:

Use Python Pandas or OpenPyXL to read Excel data and iterate through rows.

Example:

python

import pandas as pd
df = pd.read_excel("funds.xlsx")
for index, row in df.iterrows():
    url = row["TECHNICAL SHEETS LINK"]
    download_pdf(url, month="10", year="2023")

Advanced Options
APIs:

If websites offer APIs (e.g., REST endpoints for documents), use Postman or Python Requests to fetch data directly.

RPA Tools:

UiPath or Automation Anywhere for repetitive workflows across sites with no code.

1 reply

buoysophit Jun 26, 2025

Thank u

Nikhil-Info · 2025-05-07T19:32:03Z

Nikhil-Info
May 7, 2025

To streamline your process of downloading PDFs for investment fund technical sheets, here are some potential improvements to address the challenges you're facing:

1. Enhanced Web Scraping with Robust Parsing

Since each website is different, you can use a more flexible and dynamic approach for scraping and locating the correct PDF links. Here's how you can enhance your scraping strategy:

Use Headless Browsers (Puppeteer or Playwright): These tools can help automate interactions with websites that rely on JavaScript for rendering content. You can script navigation to handle websites that dynamically load PDF links.
Text Pattern Matching: Instead of using static keywords like "Year," "Month," and "Fund name," you can use more robust text pattern matching (e.g., regex or machine learning) to look for dates, fund names, and specific phrases. For example:
- Detect patterns like YYYY-MM for year-month format or FundName_TechnicalSheet.pdf.
- Machine learning models (like trained NLP models) could recognize contextually relevant links or content based on fund names.
AI-powered Web Scraping Libraries: Tools like Diffbot or Scrapy (with added NLP modules) can dynamically adapt to different website structures and detect PDF download links. They could also help identify when a page structure doesn’t match expected patterns (e.g., when a PDF is not found).

2. Handling Failures & Error Detection

When scraping fails or when the correct link is not found, it's important to build a system that can identify failure points. Consider implementing the following:

Logging and Alerting: When the scraping logic doesn’t find a valid PDF, log the failure with details (e.g., URL, failed criteria, timestamp). You can then review these logs to troubleshoot or manually intervene when necessary.
Retries: Implement a retry mechanism for each failed attempt (e.g., try scraping again after a delay or use different methods to locate the link if the first attempt fails).
PDF Validation: After finding the link, use a validation step to confirm the URL leads to a PDF by checking its file extension (.pdf), response headers (Content-Type), or even attempting to download and validate the file’s content (e.g., checking the PDF metadata or file size).

3. Automation & Workflow Optimization

Batch Processing: Automate the process for batch downloads by looping through your Excel sheet and fetching the corresponding link for each fund. Use a script (Python with libraries like pandas and requests) to process the URLs and download PDFs in parallel.
Download Automation Tools: Use tools like wget or aria2 for efficient file downloading in bulk. These tools can handle large numbers of downloads, support retries, and resume from interruptions.

4. Integrating AI for Enhanced Search and Verification

Text Classification: Implement AI models (like NLP-based models) that can classify and prioritize PDF links based on the context of the webpage. For example, the model could learn that a link containing both the year and the phrase "technical sheet" is more likely to be the correct PDF link.
Document Verification: Use Optical Character Recognition (OCR) for verifying that the downloaded PDF contains relevant content. You could use tools like Tesseract OCR to read the PDF and confirm whether it contains the right information (such as the month/year and fund name).

5. Using Structured Data for Consistency

Data Extraction from Websites: Many websites provide structured data like XML, JSON, or RSS feeds that contain links to technical sheets. Check if these sites offer any structured data format (sometimes in the footer or API endpoints). If available, using this structured data will be more reliable than scraping HTML.
Web API Integration: Some websites might expose APIs that allow you to directly request PDF links. If you find any such API, this can drastically reduce the complexity of your task.

Example Python Script for Automation

Here’s an example of a Python script you can use to download PDFs from the links in your Excel sheet:

import pandas as pd
import requests
from bs4 import BeautifulSoup

# Load the data from Excel
df = pd.read_excel('funds_info.xlsx')

# Function to download PDFs
def download_pdf(pdf_url, filename):
    response = requests.get(pdf_url)
    if response.status_code == 200:
        with open(filename, 'wb') as file:
            file.write(response.content)
        print(f"Downloaded {filename}")
    else:
        print(f"Failed to download {filename}")

# Iterate over each fund and download the PDF
for index, row in df.iterrows():
    admin = row['ADMIN']
    fund_name = row['FUND NAME']
    link = row['TECHNICAL SHEETS LINK']
    
    # Request the page and parse it
    response = requests.get(link)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Look for the PDF link (adjust according to the structure of the website)
        pdf_link = soup.find('a', text='Technical Sheet')  # This is just an example
        if pdf_link:
            pdf_url = pdf_link['href']
            filename = f"{fund_name}_{admin}_technical_sheet.pdf"
            download_pdf(pdf_url, filename)
        else:
            print(f"PDF not found for {fund_name} - {admin}")
    else:
        print(f"Failed to access {fund_name} page")

6. Final Thoughts

While AI-based tools may not be immediately necessary, they can enhance your search for correct PDF links, especially on complex pages. For instance, applying NLP techniques to detect keywords and fund names could be beneficial.
Start by improving your parsing logic with more robust error handling and retry mechanisms, then move towards machine learning techniques if necessary.

Happy Learning😎✌️😊

2 replies

TanWeiMing-1018 Jun 24, 2025

Thank you for your sharing!

And do you mind if I ask one more? How can I update this code so that reduce the download time?

Nikhil-Info Jun 26, 2025

To reduce the download time, you can use parallel downloading instead of downloading one PDF at a time. Here’s an improved version using ThreadPoolExecutor to download multiple PDFs at once:

import pandas as pd
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import os

df = pd.read_excel('funds_info.xlsx')
os.makedirs("downloads", exist_ok=True)

def download_pdf(row):
    try:
        response = requests.get(row['TECHNICAL SHEETS LINK'], timeout=10)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            pdf_tag = soup.find('a', href=lambda x: x and '.pdf' in x)
            if pdf_tag:
                pdf_url = pdf_tag['href']
                if not pdf_url.startswith('http'):
                    from urllib.parse import urljoin
                    pdf_url = urljoin(row['TECHNICAL SHEETS LINK'], pdf_url)
                pdf = requests.get(pdf_url, timeout=10)
                if pdf.status_code == 200:
                    filename = f"downloads/{row['FUND NAME']}_{row['ADMIN']}.pdf".replace(" ", "_")
                    with open(filename, 'wb') as f:
                        f.write(pdf.content)
                    print(f"Downloaded: {filename}")
    except Exception as e:
        print(f"Error: {row['FUND NAME']} - {e}")

with ThreadPoolExecutor(max_workers=10) as executor:
    executor.map(download_pdf, [row for _, row in df.iterrows()])

This will download multiple PDFs in parallel, saving you a significant amount of time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Community

I NEED HELP WITH A WEB SCRAPING PROJECT #157430

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

GitHub Community

I NEED HELP WITH A WEB SCRAPING PROJECT #157430

Uh oh!

Kuribbo Apr 23, 2025

Body

Guidelines

Replies: 2 comments · 3 replies

Uh oh!

mohmedi-reza Apr 23, 2025

Uh oh!

buoysophit Jun 26, 2025

Uh oh!

Nikhil-Info May 7, 2025

1. Enhanced Web Scraping with Robust Parsing

2. Handling Failures & Error Detection

3. Automation & Workflow Optimization

4. Integrating AI for Enhanced Search and Verification

5. Using Structured Data for Consistency

Example Python Script for Automation

6. Final Thoughts

Uh oh!

TanWeiMing-1018 Jun 24, 2025

Uh oh!

Nikhil-Info Jun 26, 2025

Kuribbo
Apr 23, 2025

Replies: 2 comments 3 replies

mohmedi-reza
Apr 23, 2025

Nikhil-Info
May 7, 2025