Skip to content

Conversation

@NathanSavageKaimai
Copy link
Contributor

Overview

This pull request adds Opentelemetry support in the basic crawler, http crawler and browser crawler classes. This functionality is completely opt in through the enableTelemetry constructor parameter.

Spans are created through this withSpan function of the BasicCrawler class.

    protected async withSpan<T>(name: string, options: SpanOptions, fn: () => Promise<T>): Promise<T> {
        if (!this.telemetry) {
            return fn();
        }

        return this.tracer.startActiveSpan(name, options, async (span) => {
            try {
                return await fn();
            } finally {
                span.end();
            }
        });
    }

This function checks if tracing is enabled and if it is, wraps its callback in a span context. If it is not, the callback is called directly.

Another new function in BasicCrawler is wrapLogWithTracing

      private wrapLogWithTracing(LogCtor: typeof Log): void {
        const WRAPPED = Symbol.for('otel.log.internal.patched');

        const proto = LogCtor.prototype as any;

        if (proto[WRAPPED]) {
            return;
        }

        const originalInternal = proto.internal;

        proto.internal = function (this: Log, level: LogLevel, message: string, data?: any, exception?: any): void {
            if (level <= this.getLevel()) {
                const span = trace.getSpan(context.active());
                if (span && span.isRecording()) {
                    if (exception) {
                        span.recordException(exception);
                    } else {
                        span.addEvent(message, {
                            'crawlee.log.level': level,
                            'crawlee.log.data': toOtelAttributeValue(data),
                        });
                    }
                }
            }

            return originalInternal.call(this, level, message, data, exception);
        };

        Object.defineProperty(proto, WRAPPED, {
            value: true,
            enumerable: false,
        });
    }

This function is called in the BasicCrawler constructor and overwrites the internal function of the logger to also emit span events. This should work for the base apify logger as well as any custom loggers so long as they still use the internal function in the same manner. This wrapper can be enabled using the new collectLogs constructor parameter.

Fixes

issue: #2955

Example Usage

The following snippet requires crawlee, @opentelemetry/sdk-node and, @opentelemetry/api

import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { resourceFromAttributes } from '@opentelemetry/resources';
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base';
import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node';
import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } from '@opentelemetry/semantic-conventions';
import { context, diag, DiagConsoleLogger, DiagLogLevel, trace } from '@opentelemetry/api';
import { randomInt } from 'node:crypto';
import { PlaywrightCrawler } from 'crawlee';

// URLs to crawl for the demo
const DEMO_URLS = ['https://crawlee.dev/', 'https://crawlee.dev/docs/introduction'];

async function main() {
   console.log('🚀 Starting Crawlee Telemetry Demo\n');
   diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.DEBUG);

   // Start telemetry with Jaeger OTLP endpoint
   console.log('📡 Initializing OpenTelemetry...');

   const resource = resourceFromAttributes({
       [ATTR_SERVICE_NAME]: 'crawlee',
       [ATTR_SERVICE_VERSION]: 'dev',
   });

   // Create the tracer provider
   const otlpExporter = new OTLPTraceExporter({
       url: 'http://localhost:4318/v1/traces',
       headers: {},
   });

   const provider = new NodeTracerProvider({
       resource,
       spanProcessors: [new BatchSpanProcessor(otlpExporter)],
   });

   // Register the provider
   provider.register();

   console.log('✅ Telemetry initialized\n');

   // Create a Playwright crawler
   const crawler = new PlaywrightCrawler({
       // Limit for demo purposes
       maxRequestsPerCrawl: 50,
       maxConcurrency: 2,

       async requestHandler({ request, page, enqueueLinks, log }) {
           const span = trace.getSpan(context.active());

           const title = await page.title();
           const headings = (await page.$$('h1, h2')).length;
           const links = (await page.$$('a')).length;

           if (span) {
               span.setAttribute('page.title', title);
               span.setAttribute('page.headings_count', headings);
               span.setAttribute('page.links_count', links);
           }

           log.info(`📄 ${title}`, {
               url: request.url,
               headings,
               links,
           });

           if (randomInt(0, 100) < 50) {
               throw new Error('Random error');
           }

           // Enqueue more links (limited by maxRequestsPerCrawl)
           await enqueueLinks({
               globs: ['https://crawlee.dev/**'],
           });
       },
       preNavigationHooks: [
           async ({ log }) => {
               log.info('🔍 Pre-navigation hook');
           },
       ],
       postNavigationHooks: [
           async ({ log }) => {
               log.info('🔍 Post-navigation hook');
           },
       ],

       errorHandler({ request, log }, error) {
           log.error(`❌ Request failed and will be retried: ${request.url}`, { error: error.message });
       },

       failedRequestHandler({ request, log }, error) {
           log.error(`❌ Request failed and reached maximum retries: ${request.url}`, { error: error.message });
       },

       enableTelemetry: true,
       collectLogs: true,
   });

   console.log('🕷️  Starting crawler...\n');
   console.log('─'.repeat(60));

   // Run the crawler
   const stats = await crawler.run(DEMO_URLS);

   console.log('─'.repeat(60));
   console.log('\n📊 Crawl Statistics:');
   console.log(`   ✅ Requests finished: ${stats.requestsFinished}`);
   console.log(`   ❌ Requests failed: ${stats.requestsFailed}`);
   console.log(`   ⏱️  Total time: ${(stats.crawlerRuntimeMillis / 1000).toFixed(2)}s`);
   console.log(`   📈 Requests/minute: ${stats.requestsFinishedPerMinute}`);

   // Shutdown telemetry to flush all spans
   console.log('\n📤 Flushing telemetry data...');
   await provider.shutdown();

   console.log('✅ Done!\n');
   console.log('🔍 View traces in Jaeger UI: http://localhost:16686');
   console.log('   Select service "crawlee" to see the traces.\n');
}

main().catch((error) => {
   console.error('Fatal error:', error);
   process.exit(1);
});

This crawler created the following Otel traces

otel-traces.json

To view these easily, use an open telemetry viewer like Jaeger and upload the traces. It is available as a docker image.

image

As should be obvious from the screenshot above, Opentelemetry can provide a lot more data and context to what an application is doing than can be gathered from logs alone.

Also, once this is merged, we can apply to be listed as adopters of native Opentelemetry alongside Nextjs and SvelteKit on the Opentelemetry website!

@NathanSavageKaimai NathanSavageKaimai changed the title Opentelemetry tracing for Basic Crawler, Http Crawler and Browser Crawler Dec 30, 2025
@NathanSavageKaimai
Copy link
Contributor Author

one note, the automatic log collection in the current implementation doesnt send the logs as "logs" to opentelemetry, they are sent as "spanEvents". these are two seperate but related primitives in opentelemetry and the difference largely calls back to opentelemetry's predecessor openTracing.

The reason I have done this is that the opentelemetry api package for logs is still considered to be in pre release. See the NPM page . Having said that, it is widely used (16 million downloads a week) including with the CNCF's own auto instrumentation packages such as winston-instrumentation. It does look like its being actively worked on though, they have a milestone for its GA release.

Id be interested to hear your thoughts on the way to approach this. Both approaches would work to capture logs, sending them as proper logs would be better for observability as they would be independantly searchable, not strictly tied to spans etc but would add in the alpha package or since theres an overarching span for run anyway, we can keep them as spanEvents and move to the logs api when it is ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant