Feature Proposal: Refactor the Store to allow multiple plugable backends

Motivation

To provide the infrastructure for DatabaseStore.

TWiki topics will continue to be distributed in Rcs store form - to get them into the Database, the 2 store's will need to be accessible at the same time.

Description and Documentation

In the current test implementation of DatabaseStore, I have 2 stores loaded simultainiously, and am able to use ManagingWebs to create a database store using an rcs store - though the UI will obviously need work to make that a useful thing.

MultiStoreRefactor means that when the web is created, the admin must choose what *Store the web will reside in. (Initially pub files will still be store in the current rcs form)

An interesting side effect will be that instead of softlinking several TWiki's webs into a data directory, you may be able to specify where the data is for each web.

Examples

Impact

Refactoring %WHATDOESITAFFECT%
edit

Implementation

At the moment, I'm largely working on fixing unit tests and places in the core that were hard coded to think they can look at the file system to find out about topics.

-- Contributors: SvenDowideit - 11 Oct 2007

Discussion

This proposal seems to have jumped straight to Under implementation. Please try to keep the process so we can track and decide on the committed proposals.

For a proposal to go through the approval cycle it starts at UnderInvestigation. And it requires a CommittedDeveloper AND a DateOfCommitment to show up on the TWikiFeature04x02.

This proposal lacks a DateOfCommitment. It also lacks a spec to decide on. It is straight in line with the roadmap decided in Rome so the motivation is right smile

I could add todays date but I think it is better to wait until Sven adds a spec to this topic. This is part of what will probably be the most important work in GeorgeTownRelease

-- KennethLavrsen - 17 Dec 2007

Sven, great idea. Not sure it needs any core changes. You could simply plug in a kind of TwinStore which is a thin layer to delegate any operation to two independent stores. Being fault-tolerant is then the tricky part. One might even think of pluging in a first full-featured store into the TwinStore, i.e. the RcsWrap store, that is used as a rock-solid foundation, and start implementing a second database store prototype, keeping it running along the way without too much of a problem, and implementing each store feature step by step. The TwinStore layer could then be made in a way to tolerate different capabilities. Just my 2cent.

-- MichaelDaum - 17 Dec 2007

yeah, I started with a twin store, but ended up finding all it did was create a facade, like the one in the Plugins handling code (that Crawfords working to remove).

The code changes to make TWiki::Store do multiple stores directly is considerably less code, and in essence removes that extra layer of performance hit and developer confusion.

(Unlike the user code, where the different mappings ocupy the same namespace, in the store, its a single hash lookup, based on Web).

-- SvenDowideit - 17 Dec 2007

I made some long-run tests on FoswikiStandAloneFeature, and one of my conclusions is that filesystem access is one big bottleneck nowadays: it's responsible for very large standard deviation on response times. This proposal and DatabaseStore are very important to improve overall TWiki performance.

I only disagree with the simultaneous use of two store mechanisms, unless database would be used most and RCS only for data-security.

-- GilmarSantosJr - 17 Dec 2007

Gilmar, right. The TwinStore layer, a pure facade, would be taken out when the DatabaseStore is matured enough. An interim TwinStore layer would only be of use during development. It will impose its own performance burden that we don't want to pay in the long run. Any multi store setup will be as slow as its slowest backend. So there's definitely no performance argument in here. Only one to help developing additional storages. Ones mature you'd switch back to single store again.

The biggest advantage of a DatabaseStore will be:
  1. scalability,
  2. its implicit caching and indexing facilities and
  3. advanced querying in TWikiApplications by either leveraging SQL to TWiki, or even XQuery+XUpdate when using a native XML database store.

-- MichaelDaum - 17 Dec 2007

The reson for an actual MultiStore is more long term than that. Because of the way I use TWiki to integrate into many backend systems, I want to be able to connect other backends as though they were twiki webs. That way there could be a BugzillaStore, an SvnStore, a LegacyManagemenSystemStore, and a TWikiStore (DB or whatever).

'Any multi store setup will be as slow as its slowest backend.' is something I am grappling with - I'm not quite sure why its necessary for TWiki operations to access all the Webs on all transactions - but I have seen that it does.

It is likely that you will be surprised how little code is changed to make MultiStore a reality (basically a hash of Web->storeClassName).

The hard work is replacing the code that assumes that it can just look directly at the file system, and fixing the Unit tests that do the same.

-- SvenDowideit - 17 Dec 2007

It is likely that you will be surprised how little code is changed to make MultiStore? a reality (basically a hash of Web->storeClassName).

How did you handle backend errors that might lead to stores not being in sync anymore?

-- MichaelDaum - 18 Dec 2007

Aaaaahhh, no, This is not for having the same topics in more than one store at a time. This is to allow some webs to be in rcs form, some in the database, some in XML etc.

-- SvenDowideit - 18 Dec 2007

If I got your point, Sven, I think this will be a little complex. The hardest problem is to map TWiki semantics (view, edit, save, forms, etc) into these other stores (like Bugzilla or SVN). But it would be great to have this possibility!!

-- GilmarSantosJr - 19 Dec 2007

Okay got it. Sorry, Sven, my fault that I misunderstood your proposal.

-- MichaelDaum - 19 Dec 2007

FYI

There are currently three layers of store abstraction, two of which are firm and one of which is widely abused. These are:
  1. TWiki::Meta objects (what I call the "TOM layer"). Meta objects should support all of the methods necessary to manipulate the contents and history of a topic in an abstract way. The current TWiki::Meta implementation depends on TWiki::Store. This is the interface that is widely abused, as calls that should go to the meta object go instead to:
  2. TWiki::Store, which is the "traditional" facade for the TWiki store engine. It is used widely in other core modules. I have been working quietly but steadily to hide TWiki::Store behind TWiki::Meta for some time now, but because of the legacy APIs it's a long, slow process.
  3. TWiki::Store::RcsFile is a relatively simple RCS-style API that is used by TWiki::Store. It is meant to be hidden entirely within TWiki::Store, but different implementations of it are possible e.g. TWiki::Store::RcsWrap, =TWiki::Store::RcsLite and TWiki::Store::Subversive (an inactive subversion layer).
With Sven's help, over time I have been:
  1. pushing TWiki::Store into the role of a simple facade, by pushing implementation dependencies (such as searching) down into TWiki::Store::RcsFile, and pulling abstractions into TWiki::Meta.
  2. promoting TWiki::Meta to the role of a TOM, by deprecating the use of $topic, $web etc in function parameters.
Some time ago I added the ability for different webs to have different store implementations. This was a five minute hack that usefully demonstrated that we are already at the point of a viable TwinStore as described above. The hack couldn't work long term because there are still severe weaknesses in the abstraction (to do with moving, renaming and deleting topics and webs).

-- CrawfordCurrie - 21 Dec 2007

There is still no commitment date on this so I am not sure how to process it.

From a customer advocat point of view this is more a code refactoring than something affecting the end user - until someone actually implements something that uses this new refactoring. And I see no possible harm in this proposal.

So it seems to me that the key core developers simply need to agree on this one and announce when they have reached a consensus.

The best I can do now is to add Michael and Crawford to the concern field. Then simply remove yourself if you have no concern.

-- KennethLavrsen - 25 Dec 2007

god, if you're just missing a date tauri

-- SvenDowideit - 26 Dec 2007

I thought it could be a signal that you were not done with the proposal description. If it is clear that people forgot I usually just add the date. wink

-- KennethLavrsen - 26 Dec 2007

Good move here. I suggest to KISS on the audit trail, that is, keep ACLs and other meta data in topics but cache meta data in a database for fast retrieval.

-- PeterThoeny - 02 Jan 2008

Peter, that's not the point here. Sven wants to configure which store to use per web. And only one at a time. TwinStore is something different, meaning to piggyback one store ontop of another, actually mirroring it, like DBCacheContrib already does, but built into the core.

I have no concerns about MultiStore on that level.

-- MichaelDaum - 02 Jan 2008

One of the roadmap subjects was the scalability of TWiki and working on a spec that enables us to place information such as ACLs, form data etc in a standard indexed storage format.

The reason for this is obviously performance. As a TWiki grows we have to parse through more and more flat text files.

On the other hand the flat text file format including the meta data gives the audit trail which would be much harder to get in a database unless it is designed for it.

So we discussed in Rome that instead of arguing religiously about flatfiles versus database storage, the answer may be to do both. Having form fields and settings incl ACL in meta stored with topic for audit trail and having the data for the current version in an additional indexed storage. And also allowing the current version of the topic itself to be in a database if required.

So how does this relate to Svens proposal? In principle not at all. But it does indirectly. Both touches the design of how data is stored.

I think it will be important that a change for multiple storage backend it specified and thought through in the context of the road map goal because no matter what - one influences the other.

I think we can all agree that we do not want to implement something now that makes it harder later to make the changes for the generic storage design.

I would really prefer if the core developers would have completed and agreed on the initial work on the principle design for the road map part of the storage concept before this proposal gets implemented.

I think that once we have agreed on the TWiki 5.0 storage concepts it will be easy to define the spec for this proposal, and maybe even get some synergy. I have added my name in the concern field in this context. I am not at all against Svens's proposal from a feature point of view.

-- KennethLavrsen - 02 Jan 2008

Thanks Kenneth for highlighting the indirect relationship of my comment. I am with you, I'd like to see a clearly defined spec of the storage backend in Codev topics before the actual implementation.

-- PeterThoeny - 03 Jan 2008

Yes, Kenneth, I agree. We know TWiki won't scale without a real database backend. But that really is not the scope of this proposal. Maybe this is just YAPWAMT (yet another proposal with a misleading topic name).

Refactoring the store is a pending issue (...been following the development of dbxml for quite some time now).

As far as I understand Sven's proposal, his changes are quite trivial. Maybe best would be to see some code first before accepting or rejecting this proposal.

-- MichaelDaum - 03 Jan 2008

This proposal is indeed relativly trivial in scope - it is a refactoring to remove the remaining file assumptions from the non-store code, and to finish up the changes for having more than one store backend active at one time (important for moving webs from the distributed rcs files into a database)

It has little to do with the DatabaseStore feature, except that it is a pre-requisite for that work.

-- SvenDowideit - 03 Jan 2008

It is this pre-requisite factor I am concerned about.

If this is a pre-requisite, then it is a pre-requisite to something we have not defined.

So why don't we start defining it at a very high level?

-- KennethLavrsen - 04 Jan 2008

We have. Thats what the DatabaseStore topic exists for. I do not, however want to distract myself from the work needed to release 4.2 at this point.

-- SvenDowideit - 06 Jan 2008

Note that I didn't raise any concern over this, just provided a point of information, so I'm removing my name from the ConcernRaisedBy field. -- CrawfordCurrie - 26 May 2008

I would still like to see this proposal specified a little further for ONE reason only.

The proposal itself is quite OK. But since this is the heart of the core code and since we all want to see more contributors, it is essential that major rewrites of code is planned just a little bit to
  • Give others a chance to participate
  • Give others a chance to understand what is happening before, during and after it is implemented.

Code refactorings that are not changing any specs or create any compatibility issues do not need a community decision. If this is a pure refactoring then go ahead. But please consider my input to add 10-20 lines of doccu at the top of this topic before anyone starts coding.

I have changed proposal to Accepted.

-- KennethLavrsen - 27 May 2008

What is the status of this?

-- RafaelAlvarez - 04 Aug 2008

I have to port the work to trunk, and last time i looked, trunk has unit test failures in the SEARCH code - but in the next week or so I'll be able to plan out some work in this area.

-- SvenDowideit - 05 Aug 2008

Necessary precursor: Tasks.Item4795

-- CrawfordCurrie - 09 Mar 2009

with the large number of changes that have happened to foswiki (FSA) and the trunk's total rewrite of the Store API, this work has been left for later. improving SEARCH 's API is happening first.

-- SvenDowideit - 06 Mar 2010

In how far is this proposal related or superseded by SimplifyTheStoreMetaSemantics (aka store2)?

-- MichaelDaum - 31 May 2012

SimplifyTheStoreMetaSemantics is the API detail, this is the broader functional intention. That said, during development of this and other store related services, its become clear to me that a store backend per web is not enough.

The Store2.0 will/does cater for caching, both local memory cache and dbcache/mongodb cache wise, and importing, tracking and syncing external datasets.

It also caters for the simpler setup described initially in this proposal - but its much more capable than that:

#somewhere in configure...
@stores = qw(memcache fastquery sharedhoststore slowbackupstore remotedata);
....
#somewhere during a request for a resource
foreach $store (@stores) {
  my $item = $store->get($address);
  return $item if $item
}

ie, stupid fast unless the user asks a poorly implemented remote store for something it hasn't got yet.

and I mean poorly implemented, as the slower stores should return a client side lazyload reference, but I won't presume good code.

-- SvenDowideit - 31 May 2012

I expected something like that. That's not to say that I don't have a lot more questions, just not the time to properly express them. When VDBI is complete I'll throw a question bomb at you smile

But, I'll ask one now

#somewhere in configure...
@stores = qw(memcache fastquery sharedhoststore slowbackupstore remotedata);
....
#somewhere during a query

EITHER
foreach $store (@stores) {
  my $store->{result_set} = $store->query($query);
}
return mergeResultSets(\@stores);

OR
#somewhere during a query
foreach $store (@stores) {
  my @result_set = $store->query($query);
  return @result_set if @result_set;
}

Or something completely different?

-- JulianLevens - 31 May 2012

yup, thats the hard one. Its very likely that the first iteration will be terrible mergeResultSets style, and then we'll need to add some kind of 'completeness' shortcircuit - ie, if the resultset is stored in the memcache, then we're done, or if the fastquery store claims a sufficiently recent complete cache of all the subsequent stores, then we can stop there.

on the other hand, if during coding a better solution appears, we'll run with that smile

-- SvenDowideit - 31 May 2012

The above loops are problematic for several reasons.

First, they don't lend towards parallelization and fault robustness. Not that perl is particularly good in that but the problem is that when one $store->query() takes particularly long (say the slowbackupstore needs to instruct the robot to insert the tapes again), then all potential speed of a cache backend goes over the fence. I can't see how you are going to prevent this with code like the above one.

Better would be an event based approach.

Second, the stores aren't independent from each other, at least not the ones you listed in the example. Some stores don't need to be queried live as they are only there for backup reasons (slowbackupstore). Some stores serve as a fast caching frontend to the others (if stuff is found in a memcached store, why ask again the remotedata store, let alone merge redundant results?).

In the fragments of a store2 concept outlined so far, all stores are treated equal. That's not adequate, is it. It does not leveraging the potential benefits each one has got. Those example "stores" are all of very different type and characteristics. They can't be treated the same.

Third, different content in foswiki have differing requirements on the store they are in. For example:

  • databases for relational metadata,
  • linked data stored in a semantic network
  • a filebased store for static data served quickly by the http server,
  • office documents accessible via webdav or cmis,
  • videos stored in a streaming server,

... just to name a few that I can think of now.

Fourth, stores all have quite different query capabilities. Asking a nosql database to do heavy relational operations or semantic inferences is not adequate as it is not the right tool for the purpose of the query.

What I am missing is a classification of the role a store component plays in the overall system that

  • maps types of content to the store that best suites it,
  • defines capabilities and
  • configures the role of a store component when putting, querying and fetching items from the complete system (lookup the cache, fallback to remotedata, make a backup in some shadow store).

-- MichaelDaum - 01 Jun 2012

even driven - exactly - thats why i wrote:

as the slower stores should return a client side lazyload reference, but I won't presume good code

the important reason why I'm doing it this way, is that the store itself is the best to ask for mapping from address to store - the general API can never be that clever - without crippling itself.

and thus, you've basically re-iterated why the store2 api is heading where it is.

''all stores are treated equally'' - interesting observation - yes, and its up to the store implementation to tell the core to look elsewhere.

the heuristics you talk are the responsibility of the store implementation.

-- SvenDowideit - 01 Jun 2012

This 5 years old proposal should never have been accepted. It lacks a clear concept. There isn't actually any concept at all, at most some vague fragments and ideas that something called multi-store would be a cool thing to have. There isn't a coherent concept published on foswiki.org part of this proposal. There is no material that could be reviewed by others making sure that this all could actually work out. Instead some implementation seems to have started in some git branch.

This situation is so severe and effects such an important part of the foswiki engine that I have to rewind this proposal completely and add my concerns until after I see some form of concept work part of this proposal.

Yes I know this is a rare situation that a proposal is rejected again after already been accepted long time ago.

We have seen other proposals being rejected for far minor reasons and affecting far less important parts of the system.

-- MichaelDaum - 04 Jun 2012

I have similar concerns to Michael albeit tempered by the thought that an idea of a high level nature is very hard to create a full fledged coherent proposal upfront.

I have recently been reading about start-ups and the successful ones often pivot, sometimes more than once. That is to say, they change direction, technically and/or commercially. The same sort of thing applies here. My non-core (hence no proposal) VDBI design has certainly pivoted a few times for the better. If I had to follow a strict proposal process I would arguably have needed to raise the proposal a number of times.

Conversely, the initial proposal cannot be nebulous.

I think the initial proposal was sufficiently clear at a high level. However the Foswiki world has changed (pivoted) over the last 5 years as has this proposal (or at least if Sven wrote it again it would not be the same).

One significant change is the widespread adoption of GitHub which might allow a new proposal state of AcceptedExperiment or somesuch. Once the developer is happy with the final design, then the proposal would need rework to reflect it. This would then be open to final acceptance or rejection. It's common practice for devs to share code in their various git repositories and therefore cross check ideas.

From my very specific VDBI point of view I am frustrated that I must cut and paste code from RcsLite/Wrap rather than re-use them (or maybe I can — albeit it a non-approved way). It's not clear to me at the moment how an attachment store (aka vault) would fit into this scheme, indeed I feel as does Michael that the requirements are quite distinct between a topic store and a vault. Specifically, I only need the vault to physically store the attachment. The store still needs to handle all the meta about the attachments (fast querying et al, including multiple versions of meta one for each attachment store). It therefore feels very natural for a store to talk directly to a vault. That's not to say it could not work via a generic store manager directing the vault to store the attachment and then asking the store to keep the meta data. Although even here there a sense of abstracting out the two types of store, with each type letting the store manager know what it can do.

I also find myself thinking about the config of multiple stores:

Separating out stores and vaults:
@stores = qw(RcsLiteStore RcsLiteVault)

@stores = qw(VdbiStore RcsLiteVault)

Would instead need to be:

@stores = qw(RcsLiteStore)

@stores = qw(VdbiStore RcsLiteStore)

The problem with the last one is how to add extra options to configure RcsLiteStore to only handle attachments (also meaning significant chunks of it's code would be loaded unnecessarily). The first two are pretty obvious and only load the code actually required.

There is general value in splitting up vaults and stores because of distinct responsiblities which has nothing to do whether we venture down the store manager idea or not. Overall I find myself coming back to the idea that separating out into vaults would be a good idea. A Vault can still be a store from the point of view of this proposal, it would just be a different class of store.

-- JulianLevens - 04 Jun 2012

 
Topic revision: r21 - 08 Jul 2015, MichaelDaum
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy