Google App Engine: Data Duplicity

05. February 2009 · 4 comments · Categories: Programming, Web/Tech

I've stumbled across a real irritation with the way the Google App Engine data model works. One among many frankly, but I'll restrict myself to just this one for now.

The issue stems from the fact that db.run_in_transaction(func) insists that func is a function with no side effects, since it may be run repeatedly in an attempt to get the transaction to go through (if optimistic locking fails). Fair enough, but that means it has to freshly fetch any model objects that it wants to modify, otherwise it would have side effects to objects outside its scope. But consider this situation, in which we have an increment() function on our model object, that must use a transaction because it also modifies other related objects at the same time and require atomic behaviour:

class Person(db.Model):

count = db.IntegerProperty(default=0, required=True)

def increment(self):

def tx():

# Mess with some other related objects in data store.

# Must fetch a separate copy of self to avoid side effects.

person = db.get(self.key())

person.count += 1

person.put()

db.run_in_transaction(tx)

The problem here is that self hasn't actually been modified at all and is now out of date with respect to the data store (where the count is one bigger, assuming the transaction succeeded). This is a pain for the caller who had a Person object and called increment() on it and naturally expects their object's count to be one higher. But their object hasn't been modified at all – though the data store has, via the freshly fetched person. In case it's not obvious, we can't simply change the code above to use self instead of getting the new person object, since db.run_in_transaction(tx) may run our tx() function multiple times until it completes without an optimistic locking failure. If it did have to run multiple times, self's count would increment by one for each failed attempt, so the final successful attempt could end up with more than one added to the count. Or if the transaction eventually failed outright, self's count would still have been modified even though the data store had not been touched.

So the only solutions I can see are:

Put code after the run_in_transaction() call, that synchronises self with the data store. There isn't a sync() or refresh() method on Model objects, so you have to do this painstakingly by getting another fresh person with db.get(self.key()) and then copying across just the fields you know might have changed.
Insist that the caller is aware that certain methods on the model objects won't modify the object itself so they need to get a fresh one. This completely wrecks the idea of an object model and encapsulation though. You'd might as well just have a purely functional interface to the data store.

It all seems like madness to me, that defeats the point of trying to have a neat, simple data storage object model. As usual, I can only hope that I've missed some crucial point and that in fact the problem is easily and elegantly solved. I shall look out for that solution, unless some kind reader can enlighten me!

4 Comments

brett
6 February 2009 at 02:24

Use a piece of this to define sync() for yourself, and you should be good to go. Just make sure to do sync() after the transaction finishes (using a closure, or something):
class MyModel(db.Model):
foo = db.IntegerProperty(default=0)
def add_one(self):
python_scoping_is_annoying = [None]
def txn():
new = python_scoping_is_annoying[0] = db.get(self.key())
new.foo += 1
new.put()
db.run_in_transaction(txn)
new = python_scoping_is_annoying[0]
# Copy all new values over to this instance.
for k in new.properties().keys() + new.dynamic_properties():
setattr(self, k, getattr(new, k))
a = MyModel()
a.put()
print a.key(), a.foo
a.add_one()
print a.key(), a.foo
a.add_one()
print a.key(), a.foo
Output:
agVzaGVsbHIPCxIHTXlNb2RlbBjKqwUM 0
agVzaGVsbHIPCxIHTXlNb2RlbBjKqwUM 1
agVzaGVsbHIPCxIHTXlNb2RlbBjKqwUM 2

Reply
Sam
6 February 2009 at 07:51

Thanks for the neat property sync code – it should definitely be useful to make the general approach more robust. It’s still a long way from pretty though 🙂

Reply
Pete
6 February 2009 at 22:12

A shorter but less generic way to do this is just to add a one-line property sync after the put() call:
def tx():
# Mess with some other related objects in data store.
# Must fetch a separate copy of self to avoid side effects.
person = db.get(self.key())
person.count += 1
person.put()
self.count = person.count

Reply
Sam
6 February 2009 at 23:27

Pete – whether that works or not is dependent on the way the transactionality is implemented by Google. If it stacks up the datastore operations then attempts to commit after the whole of tx() has run, but then fails, you’ll already have modified self, and that modification will live on even though the datastore wasn’t modified. This is precisely why tx() isn’t supposed to have side effects.
Now I’m not sure quite how transactionality has been implemented by Google. If in fact it fails on the execution of put() if it’s going to fail at all, then your approach is quite reasonable.

Reply

JustTheSam

Google App Engine: Data Duplicity

4 Comments

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories