Bug #2963: Add data structure for tabular data and associated metadata - Kepler - Ecoinformatics Redmine

Actions

Copy link

Bug #2963

open

Add data structure for tabular data and associated metadata

Added by ben leinfelder over 17 years ago. Updated over 15 years ago.

Status:

New

Priority:

Normal

Assignee:

ben leinfelder

Category:

data access

Target version:

3.X.Y

Start date:

09/11/2007

Due date:

% Done:

Estimated time:

Bugzilla-Id:

2963

Description

Essentially add support for R dataframe type to be passed around the workflow.
Most likely to be used for EMLDatasource -> RExpression transfer, but having the ability to transfer it among many different actors would be ideal.

Actions

Copy link

Updated by Dan Higgins over 17 years ago

There is already support for handing an R dataframe from one RExpression to another. If RExpression has an output port assigned to a dataframe, it writes the dataframe to a file and passes a string starting with 'dataframe:' followed by the file path. Another RExpression actor can access this as an input.

The rough equivalent Kepler data structure for a dataframe is a Kepler/PTII record. Each record has a name (the column name) and an array of values (the column vector). See $KEPLER/demos/R/emlToRecord_R.xml for an example of creating a record/dataframe and importing it into the RExpression actor.

There is one shortcoming with representing a dataframe with a record of arrays - namely, the Kepler record is not ordered; i.e. record are always referenced by name, not by location in an array. Thus, the nth item in creating a record may not be in the nth location when accessing. R dateframe columns can be referenced by name or index.

I suggest that we consider an extension of the R dataframe concept. This would include not only the list of named column arrays of the dataframe, but also additional associated metadata (i.e. all the eml attribute info) and possibly sematic information. (Dan Higgins)

Actions

Copy link

Updated by ben leinfelder over 15 years ago

I've introduced an OrderedRecordToken to ptolemy - it preserves the original order of the RecordToken.
Unfortunately this can be blown away when you serialize it as a string {a={1,2,3, b={4,5,6}} and then deserialize it back to a RecordToken.
In the mean time, the 'wrp' module overrides the ptolemy RecordToken to use the correct storage class to preserve the original order. I'm not sure how we can have both implementations without changing the serialization syntax for the ordered version. Plausible?

Or just change RecordToken in Ptolemy to keep it's order and give people a method to sort them if they want that behavior

Actions

Copy link