Skip to content

tadeasf/nest-scraping-api

Repository files navigation

NestJS Scraping API

codecov

A robust NestJS application for scraping news articles from popular Czech news websites. Built with enterprise-grade logging, comprehensive testing, and automated CI/CD pipeline.

πŸ“Š Code Coverage

Code Coverage Sunburst

πŸš€ Features

  • News Scraping: Automated scraping from major Czech news websites
    • iDnes.cz
    • HospodΓ‘Ε™skΓ© noviny (HN.cz)
    • AktuΓ‘lnΔ›.cz
    • Novinky.cz
    • Blesk.cz
  • Enterprise Logging: Winston-based logging with file and console outputs
  • Database: SQLite with TypeORM for data persistence
  • API Documentation: Scalar UI for interactive API documentation
  • Background Jobs: Scheduled scraping every hour using NestJS Schedule
  • Duplicate Prevention: Content-based deduplication using SHA-256 hashing
  • Comprehensive Testing: Unit tests, e2e tests, and code coverage
  • CI/CD Pipeline: GitHub Actions with security scanning and code quality checks

πŸ“‹ Prerequisites

  • Bun (v1.2.17 or higher)
  • Node.js (v20 or higher)

πŸ› οΈ Installation

  1. Clone the repository:
git clone <repository-url>
cd nest-scraping-api
  1. Install dependencies:
bun install
  1. Create logs directory:
mkdir -p logs

πŸƒβ€β™‚οΈ Running the Application

Development Mode

bun run start:dev

Production Mode

bun run build
bun run start:prod

Debug Mode

bun run start:debug

πŸ§ͺ Testing

Run All Tests

bun run test

Run Tests with Coverage

bun run test:cov

Run Tests in Watch Mode

bun run test:watch

Run End-to-End Tests

bun run test:e2e

Run Tests for CI

bun run test:ci

πŸ” Code Quality

Linting

# Check and fix linting issues
bun run lint

# Check linting issues only (no auto-fix)
bun run lint:check

Type Checking

bun run type-check

Code Formatting

bun run format

Security Audit

bun run audit

πŸ“Š Code Coverage

The project maintains a minimum code coverage threshold of 80% for:

  • Branches
  • Functions
  • Lines
  • Statements

Current Coverage: 83.33%

Coverage reports are generated in multiple formats:

  • HTML: coverage/index.html
  • LCOV: coverage/lcov.info
  • Console output

Coverage Breakdown

  • Statements: 91.02%
  • Branches: 77.09%
  • Functions: 87.5%
  • Lines: 91.07%

The project uses Codecov for continuous coverage monitoring and reporting.

πŸ”§ Configuration

Environment Variables

Create a .env file in the root directory:

# Application
PORT=3000
NODE_ENV=development

# Logging
LOG_LEVEL=info

# Database
DB_TYPE=sqlite
DB_DATABASE=db.sqlite3

Logging Configuration

Logs are stored in the logs/ directory:

  • logs/combined.log: All log levels
  • logs/error.log: Error level only

πŸ“š API Documentation

Once the application is running, you can access:

The API documentation includes:

  • Interactive API explorer
  • Request/response examples
  • Authentication details
  • Schema definitions

πŸ—οΈ Project Structure

src/
β”œβ”€β”€ config/
β”‚   └── logging.config.ts      # Winston logging configuration
β”œβ”€β”€ entities/
β”‚   └── article.entity.ts      # Article database entity
β”œβ”€β”€ scraping/
β”‚   β”œβ”€β”€ scraping.module.ts     # Scraping module
β”‚   β”œβ”€β”€ scraping.service.ts    # Core scraping logic
β”‚   └── scraping.service.spec.ts # Unit tests
β”œβ”€β”€ app.controller.ts          # Main controller
β”œβ”€β”€ app.module.ts             # Root module
β”œβ”€β”€ app.service.ts            # App service
└── main.ts                   # Application entry point

test/
β”œβ”€β”€ setup.ts                  # Test environment setup
└── scraping.e2e-spec.ts      # End-to-end tests

logs/                         # Application logs
coverage/                     # Test coverage reports

πŸ”„ CI/CD Pipeline

The project includes a comprehensive GitHub Actions workflow that runs on every push and pull request:

Jobs

  1. Test: Runs linting, type checking, and tests with coverage
  2. Security: Performs security audits and vulnerability scanning
  3. Build: Creates production build artifacts (main branch only)

Features

  • Automated testing with Jest
  • Code coverage reporting to Codecov
  • Security scanning with Snyk
  • Dependency vulnerability checks
  • Automated builds and deployments

πŸ› Troubleshooting

Common Issues

  1. Port already in use:

    # Change the port in .env file
    PORT=3001
  2. Database issues:

    # Remove existing database and restart
    rm db.sqlite3
    bun run start:dev
  3. Logging issues:

    # Ensure logs directory exists
    mkdir -p logs

Debug Mode

Run the application in debug mode to get detailed logs:

bun run start:debug

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run tests and linting
  5. Submit a pull request

Development Workflow

# Create feature branch
git checkout -b feature/your-feature

# Make changes and test
bun run test
bun run lint
bun run type-check

# Commit changes
git commit -m "feat: add your feature"

# Push and create PR
git push origin feature/your-feature

πŸ“„ License

This project is licensed under the MIT License.

πŸ†˜ Support

For support and questions:

  • Create an issue in the repository
  • Check the API documentation at /reference
  • Review the test files for usage examples

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published