Use to search the recent publication, personal homepage, blog posts of a given scholar and summarize
"""One‑shot intelligence report generator for academic researchers.
Given a researcher's full name, this script will:
- Locate their Google Scholar profile (via
scholarly). - Compile their top‑10 most‑cited papers plus all papers published within the last 3 calendar years.
- Attempt to fetch PDFs (preferring arXiv/open‑access links). If a clean text layer is unavailable, fallback to OCR with Tesseract.
- Summarise each paper, then derive a holistic view of the researcher’s interests, themes, and stated goals using the Gemini LLM.
- Discover the personal homepage / blog via DuckDuckGo and extract readable
content + recent posts/tweets (via
snscrape) for a glimpse of personal thoughts. - Emit a Markdown report to STDOUT and save it as
<slugified‑name>.md.
$ export GEMINI_API_KEY="sk‑..." $ python researcher_intel.py "Ada Lovelace"
scholarly duckduckgo_search readability‑lxml newspaper3k snscrape git+https://github.com/google‑generativeai/python pdfminer.six pdf2image pytesseract python‑dateutil tqdm beautifulsoup4 requests
- Tesseract OCR: Install system package and ensure
tesseractis on PATH. - Poppler (for
pdf2image): Needed on Linux/macOS for PDF → image.
- Google Scholar scraping is fragile. For heavy use, swap out
scholarlyfor SerpAPI or Publish‑or‑Perish. - Only publicly available/CC‑licensed PDFs are downloaded to avoid copyright issues.
- Gemini usage billed under your key. The script budgets tokens conservatively but large corpora still cost. """