Get the same hash value for a Pandas DataFrame each time -
my goal unique hash value dataframe. obtain out of .csv file. whole point same hash each time call hash() on it.
my idea create function
def _get_array_hash(arr): arr_hashable = arr.values arr_hashable.flags.writeable = false hash_ = hash(arr_hashable.data) return hash_
that calling underlying numpy array, set immutable state , hash of buffer.
inline upd.
as of 08.11.2016, version of function doesn't work anymore. instead, should use
hash(df.values.tobytes())
see comments most efficient property hash numpy array.
end of inline upd.
it works regular pandas array:
in [12]: data = pd.dataframe({'a': [0], 'b': [1]}) in [13]: _get_array_hash(data) out[13]: -5522125492475424165 in [14]: _get_array_hash(data) out[14]: -5522125492475424165
but try apply dataframe obtained .csv file:
in [15]: fpath = 'foo/bar.csv' in [16]: data_from_file = pd.read_csv(fpath) in [17]: _get_array_hash(data_from_file) out[17]: 6997017925422497085 in [18]: _get_array_hash(data_from_file) out[18]: -7524466731745902730
can explain me, how's possible?
i can create new dataframe out of it,
new_data = pd.dataframe(data=data_from_file.values, columns=data_from_file.columns, index=data_from_file.index)
and works again
in [25]: _get_array_hash(new_data) out[25]: -3546154109803008241 in [26]: _get_array_hash(new_data) out[26]: -3546154109803008241
but goal preserve same hash value dataframe across application launches in order retrieve value cache.
i had similar problem: check if dataframe changed , solved hashing msgpack serialization string. seems stable among different reloading same data.
import pandas pd import hashlib data_file = 'data.json' data1 = pd.read_json(data_file) data2 = pd.read_json(data_file) assert hashlib.md5(data1.to_msgpack()).hexdigest() == hashlib.md5(data2.to_msgpack()).hexdigest() assert hashlib.md5(data1.values.tobytes()).hexdigest() != hashlib.md5(data2.values.tobytes()).hexdigest()
Comments
Post a Comment