python - Apache Spark: How to create a matrix from a DataFrame? -
i have dataframe in apache spark array of integers, source set of images. want pca on it, having trouble creating matrix arrays. how create matrix rdd?
> imagerdd = traindf.map(lambda row: map(float, row.image)) > mat = densematrix(numrows=206456, numcols=10, values=imagerdd) traceback (most recent call last): file "<ipython-input-21-6fdaa8cde069>", line 2, in <module> mat = densematrix(numrows=206456, numcols=10, values=imagerdd) file "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 815, in __init__ values = self._convert_to_array(values, np.float64) file "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 806, in _convert_to_array return np.asarray(array_like, dtype=dtype) file "/usr/local/python/conda/lib/python2.7/site- packages/numpy/core/numeric.py", line 462, in asarray return array(a, dtype, copy=false, order=order) typeerror: float() argument must string or number
i'm getting same error every possible arrangement can think of:
imagerdd = traindf.map(lambda row: vectors.dense(row.image)) imagerdd = traindf.map(lambda row: row.image) imagerdd = traindf.map(lambda row: np.array(row.image))
if try
> imagedf = traindf.select("image") > mat = densematrix(numrows=206456, numcols=10, values=imagedf)
traceback (most recent call last):
file "<ipython-input-26-a8cbdad10291>", line 2, in <module> mat = densematrix(numrows=206456, numcols=10, values=imagedf) file "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 815, in __init__ values = self._convert_to_array(values, np.float64) file "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 806, in _convert_to_array return np.asarray(array_like, dtype=dtype) file "/usr/local/python/conda/lib/python2.7/site-packages/numpy/core/numeric.py", line 462, in asarray return array(a, dtype, copy=false, order=order) valueerror: setting array element sequence.
since didn't provide example input i'll assume looks more or less id
row number , image
contains values.
traindf = sqlcontext.createdataframe([ (1, [1, 2, 3]), (2, [4, 5, 6]), (3, (7, 8, 9)) ], ("id", "image"))
first thing have understand densematrix
local data structure. precise wrapper around numpy.ndarray
. (spark 1.4.1) there no distributed equivalents in pyspark mllib.
dense matrix take 3 mandatory arguments numrows
, numcols
, values
values
local data structure. in case have collect first:
values = (traindf. rdd. map(lambda r: (r.id, r.image)). # extract row id , data sortbykey(). # sort row id flatmap(lambda (id, image): image). collect()) ncol = len(traindf.rdd.map(lambda r: r.image).first()) nrow = traindf.count() dm = densematrix(nrow, ncol, values)
finally:
> print dm.toarray() [[ 1. 4. 7.] [ 2. 5. 8.] [ 3. 6. 9.]]
edit:
in spark 1.5+ can use mllib.linalg.distributed
follows:
from pyspark.mllib.linalg.distributed import indexedrow, indexedrowmatrix mat = indexedrowmatrix(traindf.map(lambda row: indexedrow(*row))) mat.numrows() ## 4 mat.numcols() ## 3
although api still limited useful in practice.
Comments
Post a Comment