Drupal RDFme Plugin

The RDF Import feature of RDFme allows to populate Drupal with data extracted from either local or remote RDF datasets. However this feature is still in the beta version. The difference between this functionality and the Node import module is obviously the input data format. On one hand Node import is a mature and efficient module but on the other RDF/XML and the mappings system of RDFme allows a lot of new flexibility.

If configured properly RDFme should allow to import even very large datasets quite fast and painlessly. Nevertheless there are still some issues or simplifications. The following guide is supposed to be a set of hints how to prepare your mappings in order to import RDF without much problems.

Limitations and specific datatypes settings:

  1. [Mappings] In the current version (1.4 and earlier) there is no separate set of mappings for export and import. Therefore, save your mappings to export_mappings.xml and change them to fit the imported data.
  2. [Post Authors] During import, the users (as in post authors) referenced in the imported data are ignored. The author of all the imported nodes/comments is set to Anonymous by default.
  3. [CCK fields] All CCK fields are handled normally but there is special treatment for non-CCK fields added by popular modules (e.g. comments module add “map_comment” field to every node). Here is a list of the handled special non-CCK fields (by module):
    • [Taxonomy] If the nodes have references to drupal taxonomies, the vocabulary and terms have to be created beforehand manually in Drupal
    • [Comment] If there exists a mapping for comment id (cid) the imported comments will first try to update the existing, if the update fails (eg. invalid id), the comment will be saved into the database with a new id. (warning: in RDFme 1.3 and lower in case of failed update comments are not saved at all, please update to RDFme v1.4)
    • [Workflow] For RDFme v1.4, the workflow states will be imported correctly only if ‘workflow’ module is present and also if the given workflow state name exactly matches the name in the database (workflow_states.state). So before importing the states this should be adjusted.
    • [Voting API] If the 1>value>0 then the rating is treated as percent, otherwise as points. Voting API is fairly simple so the integration should work without many problems. However be advised that the voting widgets use the API is different way (e.g. five star does not recognise points only percent etc.).
    • [OPAL] All fields from opal are imported normally, however on the first visit of the nodes the values shall be recalculated anyway.
  4. [Date] Date is a special CCK field that has to have proper format in order to get imported correctly

Importing large datasets:

There are number of limitations and platform specific settings that you should take care of before starting to work with large datasets.

[Php.ini] Execution time settings (more info):

  • max_execution_time
  • max_input_time

[Php.ini] Memory size settings (more info):

  • memory_limit
  • upload_max_filesize
  • post_max_size

Performance and statistics:

The import feature performance is extreemly dependant of characteristics of the data (primarily the size of the RDF/XML file but also more specific metrics, e.g.: how much text the ideas have, how many comments, how many triples, how many attributes per idea etc.).

Furthermore, the results will very much depend on the hardware capabilities. The table beneath is just supposed to give a rough idea how fast can the plugin handle different datasets.

Source # Ideas # Commnets # of Triples Time
total per idea
average max min
ETSIT Ideas 16 6 1237 24.17s 1.508s 2.354s 1.022s
Dell IdeaStorm 9,851 65,222 520330 12h 37m 20.39s 3.194s 35.837s 1.947s
myStarbucks Ideas 10,949 21,870 194086 6h 2m 23.022s 1.859s 4.389s 1.109s
Cisco I-Prize 826 7,728 133413 8m 3.99s 0.341s 0.787s 0.239s
Acrobat Ideas 579 767 17859 1m 1.362s 0.097s 0.286s 0.064s
*The above tests where run on a desktop computer with 2 Ghz Core 2 Duo / 2gb RAM.