bytearray - Efficient way to convert string to ctypes.c_ubyte array in Python -


i have string of 20 bytes, , convert ctypes.c_ubyte array bit field manipulation purposes.

 import ctypes  str_bytes = '01234567890123456789'  byte_arr = bytearray(str_bytes)  raw_bytes = (ctypes.c_ubyte*20)(*(byte_arr)) 

is there way avoid deep copy str bytearray sake of cast?

alternatively, possible convert string bytearray without deep copy? (with techniques memoryview?)

i using python 2.7.

performance results:

using eryksun , brian larsen's suggestion, here benchmarks under vbox vm ubuntu 12.04 , python 2.7.

  • method1 uses original post
  • method2 uses ctype from_buffer_copy
  • method3 uses ctype cast/pointer
  • method4 uses numpy

results:

  • method1 takes 3.87sec
  • method2 takes 0.42sec
  • method3 takes 1.44sec
  • method4 takes 8.79sec

code:

import ctypes import time import numpy  str_bytes = '01234567890123456789'  def method1():     result = ''     t0 = time.clock()     x in xrange(0,1000000):              byte_arr = bytearray(str_bytes)         result = (ctypes.c_ubyte*20)(*(byte_arr))      t1 = time.clock()     print(t1-t0)      return result  def method2():      result = ''     t0 = time.clock()     x in xrange(0,1000000):              result = (ctypes.c_ubyte * 20).from_buffer_copy(str_bytes)      t1 = time.clock()     print(t1-t0)      return result  def method3():      result = ''     t0 = time.clock()     x in xrange(0,1000000):              result = ctypes.cast(str_bytes, ctypes.pointer(ctypes.c_ubyte * 20))[0]      t1 = time.clock()     print(t1-t0)      return result  def method4():      result = ''     t0 = time.clock()     x in xrange(0,1000000):              arr = numpy.asarray(str_bytes)         result = arr.ctypes.data_as(ctypes.pointer(ctypes.c_ubyte*len(str_bytes)))      t1 = time.clock()     print(t1-t0)      return result  print(method1()) print(method2()) print(method3()) print(method4()) 

i don't that's working how think. bytearray creates copy of string. interpreter unpacks bytearray sequence starargs tuple , merges new tuple has other args (even though there none in case). finally, c_ubyte array initializer loops on args tuple set elements of c_ubyte array. that's lot of work, , lot of copying, go through initialize array.

instead can use from_buffer_copy method, assuming string bytestring buffer interface (not unicode):

import ctypes     str_bytes = '01234567890123456789' raw_bytes = (ctypes.c_ubyte * 20).from_buffer_copy(str_bytes) 

that still has copy string, it's done once, , more efficiently. stated in comments, python string immutable , interned or used dict key. immutability should respected, if ctypes lets violate in practice:

>>> ctypes import * >>> s = '01234567890123456789' >>> b = cast(s, pointer(c_ubyte * 20))[0] >>> b[0] = 97 >>> s 'a1234567890123456789' 

edit

i need emphasize not recommending using ctypes modify immutable cpython string. if have to, @ least check sys.getrefcount beforehand ensure reference count 2 or less (the call adds 1). otherwise, surprised string interning names (e.g. "sys") , code object constants. python free reuse immutable objects sees fit. if step outside of language mutate 'immutable' object, you've broken contract.

for example, if modify already-hashed string, cached hash no longer correct contents. breaks use dict key. neither string new contents nor 1 original contents match key in dict. former has different hash, , latter has different value. way @ dict item using mutated string has incorrect hash. continuing previous example:

>>> s 'a1234567890123456789' >>> d = {s: 1} >>> d[s] 1  >>> d['a1234567890123456789'] traceback (most recent call last):   file "<stdin>", line 1, in <module> keyerror: 'a1234567890123456789'  >>> d['01234567890123456789'] traceback (most recent call last):   file "<stdin>", line 1, in <module> keyerror: '01234567890123456789' 

now consider mess if key interned string that's reused in dozens of places.


for performance analysis it's typical use timeit module. prior 3.3, timeit.default_timer varies platform. on posix systems it's time.time, , on windows it's time.clock.

import timeit  setup = r''' import ctypes, numpy str_bytes = '01234567890123456789' arr_t = ctypes.c_ubyte * 20 '''  methods = [   'arr_t(*bytearray(str_bytes))',   'arr_t.from_buffer_copy(str_bytes)',   'ctypes.cast(str_bytes, ctypes.pointer(arr_t))[0]',   'numpy.asarray(str_bytes).ctypes.data_as('       'ctypes.pointer(arr_t))[0]', ]  test = lambda m: min(timeit.repeat(m, setup)) 

>>> tabs = [test(m) m in methods] >>> trel = [t / tabs[0] t in tabs] >>> trel [1.0, 0.060573711879182784, 0.261847116395079, 1.5389279092185282] 

Comments

Popular posts from this blog

html - Sizing a high-res image (~8MB) to display entirely in a small div (circular, diameter 100px) -

java - IntelliJ - No such instance method -

identifier - Is it possible for an html5 document to have two ids? -