python - Apache Spark: How to create a matrix from a DataFrame? -

- August 15, 2013

i have dataframe in apache spark array of integers, source set of images. want pca on it, having trouble creating matrix arrays. how create matrix rdd?

> imagerdd = traindf.map(lambda row: map(float, row.image)) > mat = densematrix(numrows=206456, numcols=10, values=imagerdd) traceback (most recent call last):    file "<ipython-input-21-6fdaa8cde069>", line 2, in <module> mat = densematrix(numrows=206456, numcols=10, values=imagerdd)    file "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 815, in __init__ values = self._convert_to_array(values, np.float64)    file     "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 806, in _convert_to_array     return np.asarray(array_like, dtype=dtype)    file "/usr/local/python/conda/lib/python2.7/site-        packages/numpy/core/numeric.py", line 462, in asarray     return array(a, dtype, copy=false, order=order)  typeerror: float() argument must string or number

i'm getting same error every possible arrangement can think of:

imagerdd = traindf.map(lambda row: vectors.dense(row.image)) imagerdd = traindf.map(lambda row: row.image) imagerdd = traindf.map(lambda row: np.array(row.image))

if try

> imagedf = traindf.select("image") > mat = densematrix(numrows=206456, numcols=10, values=imagedf)

traceback (most recent call last):

  file "<ipython-input-26-a8cbdad10291>", line 2, in <module> mat = densematrix(numrows=206456, numcols=10, values=imagedf)    file "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 815, in __init__     values = self._convert_to_array(values, np.float64)    file "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 806, in _convert_to_array     return np.asarray(array_like, dtype=dtype)    file "/usr/local/python/conda/lib/python2.7/site-packages/numpy/core/numeric.py", line 462, in asarray     return array(a, dtype, copy=false, order=order)  valueerror: setting array element sequence.

since didn't provide example input i'll assume looks more or less id row number , image contains values.

traindf = sqlcontext.createdataframe([     (1, [1, 2, 3]),     (2, [4, 5, 6]),     (3, (7, 8, 9)) ], ("id", "image"))

first thing have understand densematrix local data structure. precise wrapper around numpy.ndarray. (spark 1.4.1) there no distributed equivalents in pyspark mllib.

dense matrix take 3 mandatory arguments numrows, numcols, values values local data structure. in case have collect first:

values = (traindf.     rdd.     map(lambda r: (r.id, r.image)). # extract row id , data     sortbykey(). # sort row id     flatmap(lambda (id, image): image).     collect())   ncol = len(traindf.rdd.map(lambda r: r.image).first()) nrow = traindf.count()  dm = densematrix(nrow, ncol, values)

finally:

> print dm.toarray() [[ 1.  4.  7.]  [ 2.  5.  8.]  [ 3.  6.  9.]]

edit:

in spark 1.5+ can use mllib.linalg.distributed follows:

from pyspark.mllib.linalg.distributed import indexedrow, indexedrowmatrix  mat = indexedrowmatrix(traindf.map(lambda row: indexedrow(*row))) mat.numrows() ## 4 mat.numcols() ## 3

although api still limited useful in practice.

Search This Blog

Post

python - Apache Spark: How to create a matrix from a DataFrame? -

Comments

Post a Comment

Popular posts from this blog

Fail to load namespace Spring Security http://www.springframework.org/security/tags -

Maven Javadoc 'Cannot find default setter' and fails -

lua - nginx string.match non posix -