We have currently a couple of IsisDb classes in Python, one to access isis/wxis databases, another one to access malete databases. We'd like to have a single IsisDb class allowing us to work at a higher (more abstract) level, regardless of the underlying database.
This table shows a sort of “equivalence” between methods in both IsisDb classes. Parameters are indicated in brackets.
wxis (xispy.py) | malete (malete.py) | preferred |
---|---|---|
mfnrange/doList[from, to, count] | read[mfn] | read |
search[query, from, to, count, totals] | query[expr, recs] (from, to, count?) | ? |
keyrange/index[from, to, count] | terms[from, to] (count?) | terms |
update/write | write[rec] (+index[idxrec]) | write |
update/edit | read[mfn] + [lock?] | ? |
update/delete | write [empty rec] + [lock?] | delete |
fullinv/invert | index for earch record? | fullindex, invert, fullinv ? |
status | write (long write tests existence and writeability of db) | status |
unlock | ? | ? |
create | ? | create |
extract | ? (the same method used to obtain idxrec) | ? |
For an example of how to implement backend abstraction, see __init__.py
in package django.db
, and explore the django.db
folder.
Idea for importing one of a number of options:
if option == 'A': import A as db elif option == 'B': import B as db
How does concurrent access work in Malete? That is, what's the mechanism to avoid conflicting updates?
See:
Something to read about database abstraction layers:
See also:
Malete offers some extra functionality, and a common interface would have to sacrifice some or all of this extra stuff… or not?
Where are the main differences?
For some of those Malete features not readily available in wxis, we could attempt some emulation (e.g. filters).
What will these methods return? As in xispy, the methods should return JSON/dictionary objects. xispy already does this, malete.py must be modified to do the same. I think there's no reason to keep e.g. the TAB as a separator in a response from Malete, but I'm not completely sure yet.
Using the common API, code would be written like this:
# ** CONFIG (uncomment the desired option) DATABASE_BACKEND = 'wxis' #DATABASE_BACKEND = 'malete' # ** END CONFIG # create a database object db = IsisDb('/path/to/db') # create a raw query q = 'water and plants' # get the results res = db.search(query=q) # list of records or list of mfns? total number (Isis_Total)? # process the results if res: print "Found: %d results" % len(res) # FIXME: len(res) may be less than total results for r in sorted(res): # sorted by what? To sort the result set, we need the *whole* set # what is r? an IsisRec? Just a dictionary? else: print "No records found."
def read(self, **params): """ params (keyword arguments): start, to, count, list If list is present, the other params are ignored. Difference: Malete accepts an arbitrary sequence of mfns; wxis only adjacent mfns. So if mfns are not adjacent, wxis must be called several times, one for each interval of adjacent mfns. We have to options here: modify list.xis so that it also accepts arbitrary lists, or keep list.xis as is, and call it repeated times (less efficient!). """ if backend == 'malete': return self.read(mfn) elif backend == 'wxis': return self.do_list(start, to, count) def search(self, query, **params): # params: start, to, count, total """ TODO: Check compatibility of query syntax. """ if backend == 'malete': return self.query(expr=query, recs=) # returns mfns or records elif backend == 'wxis': return self.search(query=, start=, to=, count=) # returns mfns or records def terms(self, **params): # params: start, to, count """ Malete does not accept a "count" parameter. But we can simulate it, requesting terms until the count limit is reached. """ if backend == 'malete': return self.terms(start=start, to=to) # returns [] or [t1*TAB*p1 [, t2*TAB*p2 [, ...]] => modify the terms() method elif backend == 'wxis': return self.index(start=start, to=to, count=) # returns [] or [{'key':t1, 'postings': p1} [, {'key':t2, 'postings': p2} [, ...]] def write(self, rec): # or a single param 'rec' (which is an IsisRec instance, and knows its mfn) """ """ if backend == 'malete': return self.write(rec) elif backend == 'wxis': mfn = rec.mfn content = rec.??? return self.write(mfn=mfn, content=content, lockid=??) def delete(self, rec): """ Malete does not have a special support for deleting records; writing empty records has the same effect. TODO: define a delete() method in malete.py? (Or perhaps malete.py must follow very closely Malete's protocol, without adding these kind of convenience functions?) """ if backend == 'malete': empty_rec = rec.mfn + EMPTY REC return self.write(empty_rec) elif backend == 'wxis': mfn = rec.mfn return self.delete(mfn=mfn, lockid=??)
What are the simplest and more “natural” data structures for Isis data in Python? That is, just for storing the data, not for adding extra functionality (which requires custom objects). By “Isis data” I understand data sent from the database server (lists of records, single records, lists of index terms, status codes, error messages), or data sent to the database server (record content, parameters). The database server can be either Wxis or Malete.
A list of index terms may be stored as a tuple of 2-tuples, e.g.
>>> terms = (('A', 10), ('AB', 3), ('AG', 4)) >>> for term, count in tup: ... print '%3d -- %s' % (count, term) ... 10 -- A 3 -- AB 4 -- AG
A tuple or list of dictionaries is also possible, but such verbosity seems unnecessary — unless we want to store extra pieces of information on each term, such as detailed posting information:
>>> terms = ({"term": "A", "count": 10}, {"term": "AB", "count": 3}, {"term": "AG", "count": 4}) >>> for t in terms: ... print "%3d -- %s" % (t['count'], t['term']) ... 10 -- A 3 -- AB 4 -- AG
A record is basically a list of fields (including an optional leader), plus a record id or MFN. Depending on the context, a record may be considered mutable or immutable, so the list of fields may be respectively stored as a list or a tuple. Each field is a 2-tuple of the form (tag, value). Some alternatives are:
>>> r = [(0, 'Leader'), (10, 'An author'), (20, 'Some title'), (30, 'Pub. date')] >>> r = ('002345', [(0, 'Leader'), (10, 'An author'), (20, 'Some title'), (30, 'Pub. date')]) >>> r = {"record_id": "002345", "fields": [(0, 'Leader'), (10, 'An author'), (20, 'Some title'), (30, 'Pub. date')]} >>> r = {"record_id": "002345", "leader": "foobar", "fields": [(10, 'An author'), (20, 'Some title'), (30, 'Pub. date')]}
A list of records can be stored as a tuple, or as a list in case we need to modify it (e.g. change its order):
>>> results = (rec_1, rec_2, ...., rec_n) >>> results = [rec_1, rec_2, ...., rec_n]
Another useful object is a query, which would be more abstract, since it encapsulates both the query expression used to retrieve a set of records, and the list of retrieved records themselves. (See Django's QuerySet class.)
See http://search.cpan.org/~dpavlin/Biblio-Isis-0.24/lib/Biblio/Isis.pm
Biblio::Isis - Read CDS/ISIS, WinISIS and IsisMarc database This module will read ISIS databases created by DOS CDS/ISIS, WinIsis or IsisMarc. It can be used as perl-only alternative to OpenIsis module which seems to depriciate it's old XS bindings for perl. It can create hash values from data in ISIS database (using to_hash), ASCII dump (using to_ascii) or just hash with field names and packed values (like ^asomething^belse). Unique feature of this module is ability to include_deleted records. It will also skip zero sized fields (OpenIsis has a bug in XS bindings, so fields which are zero sized will be filled with random junk from memory). It also has support for identifiers (only if ISIS database is created by IsisMarc), see to_hash. This module will always be slower than OpenIsis module which use C library. However, since it's written in perl, it's platform independent (so you don't need C compiler), and can be easily modified. I hope that it creates data structures which are easier to use than ones created by OpenIsis, so reduced time in other parts of the code should compensate for slower performance of this module (speed of reading ISIS database is rarely an issue).