User Tools

Site Tools


python_isisdb

Towards a common interface to WXIS and Malete

We have currently a couple of IsisDb classes in Python, one to access isis/wxis databases, another one to access malete databases. We'd like to have a single IsisDb class allowing us to work at a higher (more abstract) level, regardless of the underlying database.

IsisDb methods

This table shows a sort of “equivalence” between methods in both IsisDb classes. Parameters are indicated in brackets.

wxis (xispy.py)malete (malete.py)preferred
mfnrange/doList[from, to, count]read[mfn]read
search[query, from, to, count, totals]query[expr, recs] (from, to, count?)?
keyrange/index[from, to, count]terms[from, to] (count?)terms
update/writewrite[rec] (+index[idxrec])write
update/editread[mfn] + [lock?] ?
update/deletewrite [empty rec] + [lock?]delete
fullinv/invertindex for earch record?fullindex, invert, fullinv ?
statuswrite (long write tests existence and writeability of db) status
unlock ? ?
create ? create
extract? (the same method used to obtain idxrec) ?

Technique

For an example of how to implement backend abstraction, see __init__.py in package django.db, and explore the django.db folder.

Idea for importing one of a number of options:

if option == 'A':
    import A as db
elif option == 'B':
    import B as db

Locking

How does concurrent access work in Malete? That is, what's the mechanism to avoid conflicting updates?

See:

Database abstraction layers

Python examples (for relational databases)

  • Storm, an object-relational mapper (ORM) for Python developed at Canonical.

See also:

  • Core Python Programming: Chapter 21, Database Programming

Miscellaneous notes

Malete offers some extra functionality, and a common interface would have to sacrifice some or all of this extra stuff… or not?

Where are the main differences?

  • Query language
  • Indexing mechanism: FST replacement, char tables
  • Index terms (length limit): probably not a problem

For some of those Malete features not readily available in wxis, we could attempt some emulation (e.g. filters).

What will these methods return? As in xispy, the methods should return JSON/dictionary objects. xispy already does this, malete.py must be modified to do the same. I think there's no reason to keep e.g. the TAB as a separator in a response from Malete, but I'm not completely sure yet.

Using the common API, code would be written like this:

    # ** CONFIG (uncomment the desired option)
    DATABASE_BACKEND = 'wxis'
    #DATABASE_BACKEND = 'malete'
    # ** END CONFIG
 
    # create a database object
    db = IsisDb('/path/to/db')
    # create a raw query
    q = 'water and plants'
    # get the results
    res = db.search(query=q)  # list of records or list of mfns? total number (Isis_Total)?
    # process the results
    if res:
        print "Found: %d results" % len(res)  # FIXME: len(res) may be less than total results
        for r in sorted(res):  # sorted by what? To sort the result set, we need the *whole* set
            # what is r? an IsisRec? Just a dictionary?
    else:
        print "No records found."
def read(self, **params):
    """
    params (keyword arguments): start, to, count, list
    If list is present, the other params are ignored.
    Difference: Malete accepts an arbitrary sequence of mfns; wxis only adjacent mfns.
    So if mfns are not adjacent, wxis must be called several times, one for each interval of adjacent mfns.
    We have to options here: modify list.xis so that it also accepts arbitrary lists, or keep list.xis as is, and
    call it repeated times (less efficient!).
    """
    if backend == 'malete':
        return self.read(mfn)
    elif backend == 'wxis':
        return self.do_list(start, to, count)
 
def search(self, query, **params):   # params: start, to, count, total
    """
    TODO: Check compatibility of query syntax.
    """
    if backend == 'malete':
        return self.query(expr=query, recs=)             # returns mfns or records
    elif backend == 'wxis':
        return self.search(query=, start=, to=, count=)  # returns mfns or records
 
def terms(self, **params):   # params: start, to, count 
    """
    Malete does not accept a "count" parameter. But we can simulate it, requesting terms until the count limit is reached.
    """
    if backend == 'malete':
        return self.terms(start=start, to=to)           # returns [] or [t1*TAB*p1 [, t2*TAB*p2 [, ...]]  => modify the terms() method
    elif backend == 'wxis':
        return self.index(start=start, to=to, count=)   # returns [] or [{'key':t1, 'postings': p1} [, {'key':t2, 'postings': p2} [, ...]]
 
 
def write(self, rec):  # or a single param 'rec' (which is an IsisRec instance, and knows its mfn)
    """
    """
    if backend == 'malete':
        return self.write(rec)
    elif backend == 'wxis':
        mfn = rec.mfn
        content = rec.???
        return self.write(mfn=mfn, content=content, lockid=??)
 
def delete(self, rec):
    """
    Malete does not have a special support for deleting records; writing empty records has the same effect.
    TODO: define a delete() method in malete.py? (Or perhaps malete.py must follow very closely Malete's protocol, without adding
    these kind of convenience functions?)
    """
    if backend == 'malete':
        empty_rec = rec.mfn + EMPTY REC
        return self.write(empty_rec)
    elif backend == 'wxis':
        mfn = rec.mfn
        return self.delete(mfn=mfn, lockid=??)

Python data structures for Isis data

What are the simplest and more “natural” data structures for Isis data in Python? That is, just for storing the data, not for adding extra functionality (which requires custom objects). By “Isis data” I understand data sent from the database server (lists of records, single records, lists of index terms, status codes, error messages), or data sent to the database server (record content, parameters). The database server can be either Wxis or Malete.

A list of index terms may be stored as a tuple of 2-tuples, e.g.

>>> terms = (('A', 10), ('AB', 3), ('AG', 4))
>>> for term, count in tup:
...   print '%3d -- %s' % (count, term)
... 
 10 -- A
  3 -- AB
  4 -- AG

A tuple or list of dictionaries is also possible, but such verbosity seems unnecessary — unless we want to store extra pieces of information on each term, such as detailed posting information:

>>> terms = ({"term": "A", "count": 10}, {"term": "AB", "count": 3}, {"term": "AG", "count": 4})
>>> for t in terms:
...   print "%3d -- %s" % (t['count'], t['term'])
... 
 10 -- A
  3 -- AB
  4 -- AG

A record is basically a list of fields (including an optional leader), plus a record id or MFN. Depending on the context, a record may be considered mutable or immutable, so the list of fields may be respectively stored as a list or a tuple. Each field is a 2-tuple of the form (tag, value). Some alternatives are:

>>> r = [(0, 'Leader'), (10, 'An author'), (20, 'Some title'), (30, 'Pub. date')]

>>> r = ('002345', [(0, 'Leader'), (10, 'An author'), (20, 'Some title'), (30, 'Pub. date')])

>>> r = {"record_id": "002345", "fields": [(0, 'Leader'), (10, 'An author'), (20, 'Some title'), (30, 'Pub. date')]}

>>> r = {"record_id": "002345", "leader": "foobar", "fields": [(10, 'An author'), (20, 'Some title'), (30, 'Pub. date')]}

A list of records can be stored as a tuple, or as a list in case we need to modify it (e.g. change its order):

>>> results = (rec_1, rec_2, ...., rec_n)
>>> results = [rec_1, rec_2, ...., rec_n]

Another useful object is a query, which would be more abstract, since it encapsulates both the query expression used to retrieve a set of records, and the list of retrieved records themselves. (See Django's QuerySet class.)

Perl and CDS/ISIS

See http://search.cpan.org/~dpavlin/Biblio-Isis-0.24/lib/Biblio/Isis.pm

Biblio::Isis - Read CDS/ISIS, WinISIS and IsisMarc database
  
This module will read ISIS databases created by DOS CDS/ISIS, WinIsis or IsisMarc. It can be used as perl-only
alternative to OpenIsis module which seems to depriciate it's old XS bindings for perl.

It can create hash values from data in ISIS database (using to_hash), ASCII dump (using to_ascii) or just hash 
with field names and packed values (like ^asomething^belse).

Unique feature of this module is ability to include_deleted records. It will also skip zero sized fields (OpenIsis 
has a bug in XS bindings, so fields which are zero sized will be filled with random junk from memory).

It also has support for identifiers (only if ISIS database is created by IsisMarc), see to_hash.

This module will always be slower than OpenIsis module which use C library. However, since it's written in perl, 
it's platform independent (so you don't need C compiler), and can be easily modified. I hope that it creates data 
structures which are easier to use than ones created by OpenIsis, so reduced time in other parts of the code 
should compensate for slower performance of this module (speed of reading ISIS database is rarely an issue).
python_isisdb.txt · Last modified: 29/08/2009 00:00 (external edit)