python - Decrypting a file to a stream and reading the stream into pandas (hdf or stata) -


overview of i'm trying do. have encrypted versions of files need read pandas. couple of reasons better decrypt stream rather file, that's interest below although attempt decrypt file intermediate step (but isn't working).

i'm able working csv, not either hdf or stata (i'd accept answer works either hdf or stata, though answer might same both, why i'm combining in 1 question).

the code encrypting/decrypting files taken stackoverflow question (which can't find @ moment).

import pandas pd import io crypto import random crypto.cipher import aes  def pad(s):     return s + b"\0" * (aes.block_size - len(s) % aes.block_size)  def encrypt(message, key, key_size=256):     message = pad(message)     iv = random.new().read(aes.block_size)     cipher = aes.new(key, aes.mode_cbc, iv)     return iv + cipher.encrypt(message)  def decrypt(ciphertext, key):     iv = ciphertext[:aes.block_size]     cipher = aes.new(key, aes.mode_cbc, iv)     plaintext = cipher.decrypt(ciphertext[aes.block_size:])     return plaintext.rstrip(b"\0")  def encrypt_file(file_name, key):     open(file_name, 'rb') fo:         plaintext = fo.read()     enc = encrypt(plaintext, key)     open(file_name + ".enc", 'wb') fo:         fo.write(enc)  def decrypt_file(file_name, key):     open(file_name, 'rb') fo:         ciphertext = fo.read()     dec = decrypt(ciphertext, key)     open(file_name[:-4], 'wb') fo:         fo.write(dec) 

and here's attempt extend code decrypt stream rather file.

def decrypt_stream(file_name, key):     open(file_name, 'rb') fo:         ciphertext = fo.read()     dec = decrypt(ciphertext, key)     cipherbyte = io.bytesio()     cipherbyte.write(dec)     cipherbyte.seek(0)     return cipherbyte  

finally, here's sample program sample data attempting make work:

key = 'this example key'[:16] df = pd.dataframe({ 'x':[1,2], 'y':[3,4] })  df.to_csv('test.csv',index=false) df.to_hdf('test.h5','test',mode='w') df.to_stata('test.dta')  encrypt_file('test.csv',key) encrypt_file('test.h5',key) encrypt_file('test.dta',key)  decrypt_file('test.csv.enc',key) decrypt_file('test.h5.enc',key) decrypt_file('test.dta.enc',key)  # csv works here hdf , stata don't # i'm less interested in part include completeness df_from_file = pd.read_csv('test.csv') df_from_file = pd.read_hdf('test.h5','test') df_from_file = pd.read_stata('test.dta')  # csv works here hdf , stata don't # hdf , stata lines below need working df_from_stream = pd.read_csv( decrypt_stream('test.csv.enc',key) ) df_from_stream = pd.read_hdf( decrypt_stream('test.h5.enc',key), 'test' ) df_from_stream = pd.read_stata( decrypt_stream('test.dta.enc',key) ) 

unfortunately don't think can shrink code anymore , still have complete example.

again, hope have 4 non-working lines above working (file , stream hdf , stata) i'm happy accept answer works either hdf stream alone or stata stream alone.

also, i'm open other encryption alternatives, used existing pycrypto-based code found here on so. work explicitly requires 256-bit aes beyond i'm open solution needn't based on pycrypto library or specific code example above.

info on setup:

python: 3.4.3 pandas: 0.17.0 (anaconda 2.3.0 distribution) mac os: 10.11.3 

the biggest issue padding/unpadding method. assumes null character can't part of actual content. since stata/hdf files binary, it's safer pad using number of bytes use, encoded character. number used during unpadding.

also time being, read_hdf doesn't support reading file object, if api documentation claims so. if restrict ourselves stata format, following code perform need:

import pandas pd import io crypto import random crypto.cipher import aes  def pad(s):     n = aes.block_size - len(s) % aes.block_size     return s + n * chr(n)  def unpad(s):     return s[:-ord(s[-1])]  def encrypt(message, key, key_size=256):     message = pad(message)     iv = random.new().read(aes.block_size)     cipher = aes.new(key, aes.mode_cbc, iv)     return iv + cipher.encrypt(message)  def decrypt(ciphertext, key):     iv = ciphertext[:aes.block_size]     cipher = aes.new(key, aes.mode_cbc, iv)     plaintext = cipher.decrypt(ciphertext[aes.block_size:])     return unpad(plaintext)  def encrypt_file(file_name, key):     open(file_name, 'rb') fo:         plaintext = fo.read()     enc = encrypt(plaintext, key)     open(file_name + ".enc", 'wb') fo:         fo.write(enc)  def decrypt_stream(file_name, key):     open(file_name, 'rb') fo:         ciphertext = fo.read()     dec = decrypt(ciphertext, key)     cipherbyte = io.bytesio()     cipherbyte.write(dec)     cipherbyte.seek(0)     return cipherbyte  key = 'this example key'[:16]  df = pd.dataframe({     'x': [1,2],     'y': [3,4] })  df.to_stata('test.dta')  encrypt_file('test.dta', key)  print pd.read_stata(decrypt_stream('test.dta.enc', key)) 

output:

   index  x  y 0      0  1  3 1      1  2  4 

in python 3 can use following pad, unpad versions:

def pad(s):     n = aes.block_size - len(s) % aes.block_size     return s + bytearray([n] * n)  def unpad(s):     return s[:-s[-1]] 

Comments