Parquet Data Ingestion in Druid Error in Timestamp parsing using Joda

Question

Context:

I am able to submit a MapReduce job from druid overlord to an EMR. My Data source is in S3 in Parquet format. The timestamp field value is in format "2017-09-01 21:14:11:552 IST".

Error is while parsing the timestamp

Issue Stack trace is:

2018-01-18T19:31:52,509 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - Task Id : attempt_1516108443547_0022_m_000068_0, Status : FAILED
Error: io.druid.java.util.common.RE: Failure on row[{"t": "2017-09-01 21:14:11:552 IST"}]
    at io.druid.indexer.HadoopDruidIndexerMapper.map(HadoopDruidIndexerMapper.java:91)
    at io.druid.indexer.DetermineHashedPartitionsJob$DetermineCardinalityMapper.run(DetermineHashedPartitionsJob.java:288)
    ..

Caused by: java.lang.IllegalArgumentException: Invalid format: "2017-09-01 21:14:11:552 IST" is malformed at "IST"
    at org.joda.time.format.DateTimeFormatter.parseDateTime(DateTimeFormatter.java:945)
    at io.druid.java.util.common.parsers.TimestampParser.lambda$createTimestampParser$4(TimestampParser.java:93)
    at io.druid.java.util.common.parsers.TimestampParser.lambda$createObjectTimestampParser$8(TimestampParser.java:129)
    . .

I have used different set of format that can parse but unable to get a format in joda lib. But, the timestamp format is readable in java.text.SimpleDateFormat see following code:

Sample Java program to parse Date

String text = "2017-09-01 21:14:11:552 IST";
SimpleDateFormat sdf =  new SimpleDateFormat("yyyy-MM-dd HH:mm:ss:SSS zzz");
TimeZone gmt = TimeZone.getTimeZone("GMT");
sdf.setTimeZone(gmt);
sdf.setLenient(false);

try {
    Date date = sdf.parse(text);
    System.out.println(date);
    System.out.println(sdf.format(date));
} catch (Exception e) {
    e.printStackTrace();
}

Output

Fri Sep 01 21:14:11 IST 2017
2017-09-01 21:14:11:552 IST

Environment:

Druid version: 0.11
EMR version : emr-5.11.0
Hadoop version: Amazon 2.7.3

Druid input json

{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "io.druid.data.input.parquet.DruidParquetInputFormat",
        "paths": "s3://s3_path"
      }
    },
    "dataSchema": {
      "dataSource": "parquet_test1",
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "DAY",
        "queryGranularity": "ALL",
        "intervals": ["2017-08-01T00:00:00:000Z/2017-08-02T00:00:00:000Z"]
      },
      "parser": {
        "type": "parquet",
        "parseSpec": {
          "format": "timeAndDims",
          "timestampSpec": {
            "column": "t",
            "format": "yyyy-MM-dd HH:mm:ss:SSS zzz"            
          },
          "dimensionsSpec": {
            "dimensions": [
              "dim1","dim2","dim3"
            ],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      "metricsSpec": [{
        "type": "count",
        "name": "count"
      },{
          "type" : "count",
          "name" : "pid",
          "fieldName" : "pid"
        }]
    },
    "tuningConfig": {
      "type": "hadoop",
      "partitionsSpec": {
        "targetPartitionSize": 5000000
      },
      "jobProperties" : {
        "mapreduce.job.user.classpath.first": "true",
        "fs.s3.awsAccessKeyId" : "KEYID",
        "fs.s3.awsSecretAccessKey" : "AccessKey",
        "fs.s3.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
        "fs.s3n.awsAccessKeyId" : "KEYID",
        "fs.s3n.awsSecretAccessKey" : "AccessKey",
        "fs.s3n.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
        "io.compression.codecs" : "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec"
      },
      "leaveIntermediate": true
    }
  }, "hadoopDependencyCoordinates": ["org.apache.hadoop:hadoop-client:2.7.3", "org.apache.hadoop:hadoop-aws:2.7.3", "com.hadoop.gplcompression:hadoop-lzo:0.4.20"]
}

Possible solution

 1. How to parse "2017-09-01 21:14:11:552 IST" in joda format 

 2. Any config to use SimpleDateFormat for parsing date in timestampSpec, as joda library is used default.

I tried loading a data with proper timestamp record in parquet i got another Exception. Exception Error: java.lang.IllegalArgumentException: INT96 not yet implemented. at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:279) at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:264) at org.apache.parquet.schema.PrimitiveType$PrimitiveTypeName$7.convert(PrimitiveType.java:223) — Shiva Achari
– Shiva Achari, Commented Jan 19, 2018 at 18:36
came across the issue discussion on the same issue INT96 timestamp issue — Shiva Achari
– Shiva Achari, Commented Jan 19, 2018 at 18:38

Meno Hochschild · Accepted Answer · 2018-01-19 12:53:28Z

You have failed to parse the timezone abbreviation "IST". Such abbreviations are often ambivalent.

In this case, "IST" can stand for: "Europe/Dublin" (Irish Summer Time), "Asia/Jerusalem" (Israel Standard Time), "Asia/Kolkata" (India Standard Time). Looking at your name, I strongly assume that you want India Time.

Now I discuss several possible solutions and their advantages and drawbacks. A time library can use different strategies to resolve zone name ambiguities. Either it allows users to specify explicitly what zone they want (user-preference), or the region/country-information inside the current/associated locale might be used for resolving.

Joda-Time

The ONLY! solution is realized by following code:

String s = "2017-09-01 21:14:11:552 IST";

Map<String, DateTimeZone> preferredJodaZones =
    Collections.singletonMap("IST", DateTimeZone.forID("Asia/Kolkata"));
DateTimeUtils.setDefaultTimeZoneNames(preferredJodaZones); // attention: static (global)
org.joda.time.format.DateTimeFormatter formatter =
    DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss:SSS zzz");
DateTime joda = formatter.parseDateTime(s);
System.out.println(joda);
// 2017-09-01T21:14:11.552+05:30

While this approach based on explicit user-preference will probably satisfy your requirements because you don't need to change your dependency and preferred library, I consider this way as not so great for two reasons:

It uses a static method to set the user preference (potentially vulnerable in multi-thread-environment).
It requires explicit knowledge which zone abbreviations have to be resolved.

I recommend to set the user preference only once during program initialization. And then you can probably work with Joda.

Old SimpleDateFormat-class

Yes, that works for you but not for me because the locale on my machine is not India. And I get the timestamp/instant of Israel (3.5 hours difference to India). We see that this old class uses the region info of associated locale in the background in order to resolve the name ambiguity, not the explicitly set tz-offset GMT (via sdf.setTimeZone(gmt);).

System.out.println(sdf.format(date)); // 2017-09-01 22:14:11:552 IDT

So please be very cautious where your code is running.

java.time (Java-8 or later)

DateTimeFormatter threeten =
    DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss:SSS zzz", new Locale("en", "IN"));
ZonedDateTime jdt = ZonedDateTime.parse(s, threeten);
System.out.println(jdt);
// 2017-09-01T21:14:11.552+03:00[Asia/Jerusalem] 
// (on my machine! - might work on your machine but is unreliable)

This experiment reveals that the locale information for resolving the tz-ambiguity is unfortunately not used. But it is possible to specify the user-preference via a builder-based approach:

Set<ZoneId> preferredZones = Collections.singleton(ZoneId.of("Asia/Kolkata"));
DateTimeFormatter threeten2 =
    new DateTimeFormatterBuilder()
    .appendPattern("yyyy-MM-dd HH:mm:ss:SSS ")
    .appendZoneText(TextStyle.SHORT, preferredZones)
    .toFormatter();
ZonedDateTime jdt2 = ZonedDateTime.parse(s, threeten2);
System.out.println(jdt2);
// 2017-09-01T21:14:11.552+05:30[Asia/Kolkata]

Here, the user-preference can be given as local parameter to the parser and does not suffer from any multi-thread-problem (better than Joda).

Time4J (my lib)

It can use a builder approach similar to Java-8 to set the user-preference (not shown here), or it can deploy a non-fixed-offset parameter in constructing the formatter or use the locale information parameter (for greatest flexibility).

ChronoFormatter<Moment> time4j =
    ChronoFormatter.ofMomentPattern(
        "yyyy-MM-dd HH:mm:ss:SSS zzz",
        PatternType.CLDR,
        new Locale("en", "IN"), // // uses India for resolving tz-ambiguity
        ZonalOffset.UTC 
        // using ASIA.KOLKATA would have higher ranking than locale information
    );
ZonalDateTime zdt = ZonalDateTime.parse(s, time4j); 
// convertible to java.time.ZonedDateTime (zdt.toTemporalAccessor())
System.out.println(zdt);
// 2017-09-01T21:14:11,552+05:30[Asia/Kolkata]

Collectives™ on Stack Overflow

Parquet Data Ingestion in Druid Error in Timestamp parsing using Joda

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related