Each run of the same Hadoop SequenceFile creation routine creates a file with different crc. Is it ok? -
i have simple code creates hadoop's sequence file. each code ran leaves in working dir 2 files:
mysequencefile.txt .mysequencefile.txt.crc
after each run sizes of both files remain same. crc file contents become different!
is bug or expected behaviour?
this confusing, expected behaviour.
according sequencefile standart, each sequencefile has sync-block, length 16 bytes. sync-block repeats after each record in block-compressed sequencefiles, , after records or 1 long record in uncompressed or record-compressed sequencefiles.
thing is, sync-block sort of random value. written in header, how reader recognizes it. stays same within 1 sequencefile, can (and is) different 1 sequencefile another.
files logically same, binary different. crc binary shecksum, different between 2 files too.
haven`t found ways manually set sync-block. if gets way, please write here.
Comments
Post a Comment