Add XML based Storage and query backend

Motivation and a bit of theory

Even if FW is not an OAIS (don't confuse with OASIS) conformant archiving tool, we already have/should have/will have several metadata too, like:
  • Descriptive metadata- these going into web-page head section for improved SEO
    • title of the topic <title> (additionally to the current reversed breadcrumb path)
    • author - <meta name="author"
    • abstract (not useded in the FW, yet - would be nice because of SEO, abstract shoud go into <meta name="description")
    • tags, keywords - <meta name="keywords"
      btw, TagsPlugin should be in the core because of SEO - unfortunately the new TagsPlugin needs sql - too bad cry
  • Administrative / Technical / Representational metadata,like:
    • encoding (not really needed when all topic will be utf8)
    • information about the topic-content-markup (e.g. legacy FW, DITA, TEI, OpenDocument, XHTML.. - now not needed)
    • maybe in the future, flag for the hidden topic (like dotted files in the shell) for the topics what should not display in the WebIndex
    • ACL
    • creation date
    • our modification history metadata (like rlog)
    • maybe sometimes in the future (gpg signed topic - so, public keys, fingerprints, SHA1 checksum...)
    • etc...
  • Structural metadata:these are twofold
    • first, the part what is notabout the "content structure" but about the relations to another content, like:
      • view-template
      • topic-parent
      • list of attached files
      • list of referenced (linked) topics (known at the save - caching them = mean faster access and can allow extended fuctionality)
      • list of included topics
      • attached form name and the attached form-fields and their content)
    • second, what are about the structure of topic (and i don't talking about),like:
      • section definitions
      • annotations - in-content comments like in the XWiki (what can be hidden or no)
      • footnotes
      • cross-referenced sections (e.g. some extended vesion of ExplicitNumberingPlugin)
      • etc - anything what is about the content of the topic
For dealing with the above metadata (remember, don't talking about the topic content) - need consider 2 things:
  • how to exchange them
  • how to store them

Serialization

Converting any internal representation of data into well-known and accepted format for the purpose of data exchange between different parts of system or different systems.

We can have more serialization plugins for the different data-exchange needs. E.g. JSON for browsers and/or Mongo and sevaral other currently popular applications, or, the serialization can be done with XML with well defined scheme. Creating serialization plugins are not very hard, when the internal data-structure has enough granularity.

But, the serialization is about the data-exchange and major work will be done with FW. (it is perfectly OK, but sometimes is much easier read files directly)

Storage level format

FW currently (and i really hope then in the future too!) storing topic in the filesystem files. The storage-file-format (remember, not talking about the macros and TML) is self-invented format what was created before XML got even defined. So we have:

META:TOPICINFO{author="ProjectContributor" date="1233365367" format="1.1" version="1"}

and so on...

It's wonderful to see, how FW invented and using all things what are needed for modern application. It is like old grandfather - have experiences, its works, but not very well looking and know nothing about the common modern tools like GPS... wink
  • The current "homemade" storage-file-format isn't standardized in any way - FW sure can gain more popularity with using somewhat standardized file-formats. Don't underestimate this, in several meeting i have dialogues like:
    Qst: How are the data sored?
    Ans; In the plain text file. And you don't need access them directly...
    Qst: Plain text? Somewhat structured?
    Ans: Of course! In the very effective parseable format. 
    Qst: Mean XML?
    Ans: ... no, in his own format.
    ... got the idea?
  • Standardized and extensible storage structure can help much faster and easier extending
  • from the perl is extremely easy dealing with XML (at compiled "XS" level too - what can have probably faster parsing than our own FW topic-parsing)
  • easy to map xml structures into perl hashes, JSON and easy mapping to NoSQL DBs (like MongoDB)
  • conversion of current topics into XML is easy with few-lines script
But now, XML is one of most common format for the structured informations.

<?xml version="1.0" encoding="UTF-8"?>
<fosdoc>
<version>1.0</version>
<format>1.1</format>
<web>System</web>
<topic>TopicName</topic>
<parent>System.SomeTopic</parent>
<createdBy>ProjectContributor</createdBy>
<creationDate>1233365367</creationDate>
<lastModified>1233383000</lastModified>
<title>Topic title goes here</title>
<viewTemplate></viewTemplate>
<content>---+ Heading
   * body
   * with %MACROS / *TML* comes
   * here
</content>
</fosdoc>

Although we can continue inventing our %META:SOMETHINGINTHEFUTURE i'm perfectly sure than with the XML we
  • will get more possible features
  • can easily read, write and create FW's files directly with a wide range of already existing XML tools without the need using FW library, or developing own external parsers.
  • will get more extensible format for plugins (best example is NatSkin = the plugin inventing his own parseable structures instead of simple extending one well defined XML
  • easy implementation of some advanced metadata without the need inventing %META:SOMETHINGINTHEFUTURE
  • easy transformations (XSLT)
  • similiar handling of meta-elemets and content e.g. $topic->{field}->{name} and $topic->{content}
  • easy integration of external systems - enough to know how to read-write xml
  • extremely easy form data integration with the external systems
  • less code to maintain (the whole XML read/write stuff is CPAN) smile
Drawbacks:
  • as someone mentioned on the IRC, the currect format is more grep / sed friendly than XML
  • legacy and backward compatibility
So,
  • we can lose sed compatibility, but can gain much more...
  • we can lose backward compatibility (only at the file level), but the conversion tool from the current format into XML is probably few lines script.

XML schema definition

Of course, we can "invent" our XML schema as above.

But, imo, would be best implement one of already standardized XML schema. Here are several. I'm recommending using of METS XML. METS is primary exchange format. But it is usable (and in the archival systems often used) as data storage format too.

Benefits as storage format - the same as above for the XML.

Developing something like MetsExportPlugin (as serialization format) and remain with the old storage-file-structure is probably worth only when we want address how to easily exchange topics between different FW installations. It could extremely help with inter-foswiki communication and we don't need invent something "self-made" exchange format.

Benefits using METS as (one of) serialization format:
  • Easy make an full export/import topics-format, from/to another foswikis. The METS allow embed any (base64 encoded) objects directly inside to METS file, (e.g. easily can embed attached images into one XML file and that's mean easy exchange the whole topic, without "inventing" or own serialization topic-exchange format.
  • METS is XML
  • Possibility insert into METS any other matadata standards - recommending to use Dublin Core for Decription metadata like DC:CREATOR DC:TITLE (or MODS)
  • not need invent "attachments", because "structure linking" is integrated directly into METS
  • Marketing buzz (foswiki known METS) wink wink
Even using METS as serialization format can be helpful, the main point is still on using XML as raw storage format.

Impact:

  • Store/Store2
  • Plugins? (Are here plugins dealing directly with files? (and doesn't use Store::*)
  • current fw installations - can be converted with an "conversion script" 1:1 without any problems.
The parts of Meta.pm and several other parts of source could replace with few lines of code, like:

use XML::Simple;
my $topic = XMLin( $topicfile );

my $content = $topic->{content};
my $author = $topic->{author};
my $form = $topic->{form};
do_someting( $form->{name} ) if $form && exists($form->{fieldname});
#and so on...

See also:

Ps: When going to talk about the topic-content-structure, here is many alternatives too.
  • Already mentioned DITA, what is interesant because slicing the document to parts (like sections), but for some other needs here are another alternatives
  • TEI
  • NLM (Pubmed)
  • or good-old docBook... wink
  • OpenDoc wink
  • etc..
But this is another question (and hard to decide) - the all of the above is about the well defined storage format.

-- JozefMojzis - 22 Dec 2011

We discussed on IRC, and I appreciate anybody interested in Foswiki XML smile

I'm involved with several Foswiki <-> XML information exchange efforts at work; and I think I'm getting tangled up in this (the problem of exchanging information for quite narrow, domain-specific applications - e.g. exchanging the structured data we have in formfields in a way that can be absorbed into some external database), versus what Jozef is talking about - using an XML where the value is simply in standardised metadata (authorship), but the content-mappings (or rather, infrastructure/capability for content mappings) for exchanging content is left unsolved.

SvenDowideit has done tremendous work making the serialisation of Foswiki pluggable, with Foswiki::Serialise. RestPlugin is an example Foswiki::Serialise module that can translate to/from JSON.

But making serialisation pluggable is much easier than allowing the content to have different representations (ontology mappings, etc).

An XML-isisation effort should consider CmisPlugin and SupportDITA

-- PaulHarvey - 18 Dec 2011

After an initial failure on my part to understand where Jozef was coming from, I finally understood. Writing a new (de)serialiser - even in the current codebase - is well do-able, and I think is all that would be needed - though some tweaks might be required in the VC store, which might read TOPICINFO for fast-reading purposes. Maybe a day's work? The questions in my mind are: Who wants this? Why do they want it? Why haven't they done it themselves? (and the perennial Is anyone prepared to pay me (or anyone else) to do it? Gotta eat)

-- CrawfordCurrie - 19 Dec 2011

Once the Foswiki storage format is all XML, why should I stop there and not continue and introduce XQUERY instead of the YASSE (yet another selfmade search engine) we have now (warning rhetorical question / devil's advocate)? As you see: once you enter XML it soon becomes quite a game changer in which direction Foswiki could evolve. This is possible in theory, best demonstrated by http://www.marklogic.com.

-- MichaelDaum - 19 Dec 2011

I cleaned / extended a bit my writing in the hope clear some misunderstandings caused with my limited ability to express myself... frown, sad smile

-- JozefMojzis - 22 Dec 2011

I've made this into a feature request, as it fixs well into the post store2 world.

Personally, I'm more insterested in using the JSON (de-)serialisation that I've already begun coding in my store2 branch on github, but One of the points of what I'm doing, is to enable admins to choose what mix of storage backends, query engines and serialisations are appropriate for their needs.

there's lots of work todo - lots more unit tests to write to ensure that we break nothing, and to allow the existing TML-encoded text format to continue to work - as it is good for small simple installations. _and a formalised integration speed benchmark would be good - something that has a variety of web/topic/query denisties and complexities so we can show users the pros&cons of the different implementations.

note also that text file format is one a small part of the store2 pluggable - database and NoSQL is another (given that Crawford, Paul and I are working on both SQL and MongoDB, and I will probably do a hadoop query some time in 2012 - assuming that i have time)

-- SvenDowideit - 23 Dec 2011

Four years since last comment. No committed developer. Changing to a parked proposal.

-- Main.GeorgeClark - 13 Feb 2016 - 19:17
Topic revision: r9 - 13 Feb 2016, GeorgeClark
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy