Get the same hash value for a Pandas DataFrame each time -


my goal unique hash value dataframe. obtain out of .csv file. whole point same hash each time call hash() on it.

my idea create function

def _get_array_hash(arr):     arr_hashable = arr.values     arr_hashable.flags.writeable = false     hash_ = hash(arr_hashable.data)     return hash_ 

that calling underlying numpy array, set immutable state , hash of buffer.

inline upd.

as of 08.11.2016, version of function doesn't work anymore. instead, should use

hash(df.values.tobytes()) 

see comments most efficient property hash numpy array.

end of inline upd.

it works regular pandas array:

in [12]: data = pd.dataframe({'a': [0], 'b': [1]})  in [13]: _get_array_hash(data) out[13]: -5522125492475424165  in [14]: _get_array_hash(data) out[14]: -5522125492475424165  

but try apply dataframe obtained .csv file:

in [15]: fpath = 'foo/bar.csv'  in [16]: data_from_file = pd.read_csv(fpath)  in [17]: _get_array_hash(data_from_file) out[17]: 6997017925422497085  in [18]: _get_array_hash(data_from_file) out[18]: -7524466731745902730 

can explain me, how's possible?

i can create new dataframe out of it,

new_data = pd.dataframe(data=data_from_file.values,              columns=data_from_file.columns,              index=data_from_file.index) 

and works again

in [25]: _get_array_hash(new_data) out[25]: -3546154109803008241  in [26]: _get_array_hash(new_data) out[26]: -3546154109803008241 

but goal preserve same hash value dataframe across application launches in order retrieve value cache.

i had similar problem: check if dataframe changed , solved hashing msgpack serialization string. seems stable among different reloading same data.

import pandas pd import hashlib data_file = 'data.json'  data1 = pd.read_json(data_file) data2 = pd.read_json(data_file)  assert hashlib.md5(data1.to_msgpack()).hexdigest() == hashlib.md5(data2.to_msgpack()).hexdigest() assert hashlib.md5(data1.values.tobytes()).hexdigest() != hashlib.md5(data2.values.tobytes()).hexdigest() 

Comments

Popular posts from this blog

Fail to load namespace Spring Security http://www.springframework.org/security/tags -

sql - MySQL query optimization using coalesce -

Maven Javadoc 'Cannot find default setter' and fails -