Merge into trunk : schema-sharding : Code : Storm

Status:	Merged
Merged at revision:	469
Proposed branch:	lp:~free.ekanayaka/storm/schema-sharding
Merge into:	lp:storm
Prerequisite:	lp:~free.ekanayaka/storm/schema-advance
Diff against target:	0 lines
To merge this branch:	bzr merge lp:~free.ekanayaka/storm/schema-sharding
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
Björn Tillenius (community)		2014-11-18	Approve on 2014-12-18
Review via email: mp+242133@code.launchpad.net

lp:~free.ekanayaka/storm/schema-sharding updated on 2014-11-18

471. By Free Ekanayaka on 2014-11-18: Add docstrings

Revision history for this message

Björn Tillenius (bjornt) wrote on 2014-12-10:

#

Before reviewing the code itself, I'd like to have a discussion the approach it self.

Looking at the code, it looks like everytime I want to patch a single store, I would have to add an empty patch file for all the other stores? That seems cumbersome.

I'd like to have a single patch file that can patch all the stores. That will make it much easier to track what's going on when you have multiple-store patches. With different files it's hard to get an overview and you still have the problem that you have to know the order the stores get patched in.

For example, let's say that you have two stores, A and B. You want to patch both A and B, and migrate data using data from both A and B. If you select from tables you patched in A and B, you have to consider whether the table have been patched or not. It'd be much easier if you had everything in a single file, e.g.:

  def patch(stores):
     patch_something(stores["a"])
     migrate_data(stores["a"], stores["b"])
     patch_something_else(stores["b])

What would be the downsides of having a single file for all the stores?

review: Needs Information

Revision history for this message

Free Ekanayaka (free.ekanayaka) wrote on 2014-12-11:

#

> Before reviewing the code itself, I'd like to have a discussion the approach
> it self.
>
> Looking at the code, it looks like everytime I want to patch a single store, I
> would have to add an empty patch file for all the other stores? That seems
> cumbersome.
>
> I'd like to have a single patch file that can patch all the stores. That will
> make it much easier to track what's going on when you have multiple-store
> patches. With different files it's hard to get an overview and you still have
> the problem that you have to know the order the stores get patched in.
>
> For example, let's say that you have two stores, A and B. You want to patch
> both A and B, and migrate data using data from both A and B. If you select
> from tables you patched in A and B, you have to consider whether the table
> have been patched or not. It'd be much easier if you had everything in a
> single file, e.g.:
>
> def patch(stores):
> patch_something(stores["a"])
> migrate_data(stores["a"], stores["b"])
> patch_something_else(stores["b])
>
> What would be the downsides of having a single file for all the stores?

The downside I see is that you might some stores where you want to apply the same schema changes (think account-1 and account-2) in Landscape.

In that case you have to manually loop over those stores (whose number might be dynamic). For example let's say you have "b-1" and "b-2" instead of "b":

   def patch(stores):
      patch_something(stores["a"])
      b_stores = [store for name, store in stores.items() if name.startswith("b-")]
      for b_store in b_stores:
          migrate_data(stores["a"], b_store)
          patch_something_else(b_store)

At least for Landscape the use case of cross-store data migration is much more infrequent than the use case of an isolated schema change against a certain store or set of stores.

Also, migrating from the current "vertical" patch system to the one you propose is more involved than migrating to the system proposed in the branch.

Regarding downsides of the system proposed in the branch, yes, you have to know the order, but it doesn't seem too create big headaches in real life. As for code living in different files, depending on how you see it the other flip of the coin is that could be also be a plus, because you know more easily what happens in a certain store (without the "noise" of changes in other stores), but I think either way the differences are minor.

> Before reviewing the code itself, I'd like to have a discussion the approach
> it self.
> 
> Looking at the code, it looks like everytime I want to patch a single store, I
> would have to add an empty patch file for all the other stores? That seems
> cumbersome.
> 
> I'd like to have a single patch file that can patch all the stores. That will
> make it much easier to track what's going on when you have multiple-store
> patches. With different files it's hard to get an overview and you still have
> the problem that you have to know the order the stores get patched in.
> 
> For example, let's say that you have two stores, A and B. You want to patch
> both A and B, and migrate data using data from both A and B. If you select
> from tables you patched in A and B, you have to consider whether the table
> have been patched or not. It'd be much easier if you had everything in a
> single file, e.g.:
> 
>   def patch(stores):
>      patch_something(stores["a"])
>      migrate_data(stores["a"], stores["b"])
>      patch_something_else(stores["b])
> 
> What would be the downsides of having a single file for all the stores?

The downside I see is that you might some stores where you want to apply the same schema changes (think account-1 and account-2) in Landscape.

In that case you have to manually loop over those stores (whose number might be dynamic). For example let's say you have "b-1" and "b-2" instead of "b":

def patch(stores):
      patch_something(stores["a"])
      b_stores = [store for name, store in stores.items() if name.startswith("b-")]
      for b_store in b_stores:
          migrate_data(stores["a"], b_store)
          patch_something_else(b_store)

At least for Landscape the use case of cross-store data migration is much more infrequent than the use case of an isolated schema change against a certain store or set of stores.

Also, migrating from the current "vertical" patch system to the one you propose is more involved than migrating to the system proposed in the branch.

Regarding downsides of the system proposed in the branch, yes, you have to know the order, but it doesn't seem too create big headaches in real life. As for code living in different files, depending on how you see it the other flip of the coin is that could be also be a plus, because you know more easily what happens in a certain store (without the "noise" of changes in other stores), but I think either way the differences are minor.

Revision history for this message

Free Ekanayaka (free.ekanayaka) wrote on 2014-12-11:

#

> > Before reviewing the code itself, I'd like to have a discussion the approach
> > it self.
> >
> > Looking at the code, it looks like everytime I want to patch a single store,
> I
> > would have to add an empty patch file for all the other stores? That seems
> > cumbersome.
> >
> > I'd like to have a single patch file that can patch all the stores. That
> will
> > make it much easier to track what's going on when you have multiple-store
> > patches. With different files it's hard to get an overview and you still
> have
> > the problem that you have to know the order the stores get patched in.
> >
> > For example, let's say that you have two stores, A and B. You want to patch
> > both A and B, and migrate data using data from both A and B. If you select
> > from tables you patched in A and B, you have to consider whether the table
> > have been patched or not. It'd be much easier if you had everything in a
> > single file, e.g.:
> >
> > def patch(stores):
> > patch_something(stores["a"])
> > migrate_data(stores["a"], stores["b"])
> > patch_something_else(stores["b])
> >
> > What would be the downsides of having a single file for all the stores?
>
> The downside I see is that you might some stores where you want to apply the
> same schema changes (think account-1 and account-2) in Landscape.
>
> In that case you have to manually loop over those stores (whose number might
> be dynamic). For example let's say you have "b-1" and "b-2" instead of "b":
>
> def patch(stores):
> patch_something(stores["a"])
> b_stores = [store for name, store in stores.items() if
> name.startswith("b-")]
> for b_store in b_stores:
> migrate_data(stores["a"], b_store)
> patch_something_else(b_store)
>
> At least for Landscape the use case of cross-store data migration is much more
> infrequent than the use case of an isolated schema change against a certain
> store or set of stores.
>
> Also, migrating from the current "vertical" patch system to the one you
> propose is more involved than migrating to the system proposed in the branch.
>
> Regarding downsides of the system proposed in the branch, yes, you have to
> know the order, but it doesn't seem too create big headaches in real life. As
> for code living in different files, depending on how you see it the other flip
> of the coin is that could be also be a plus, because you know more easily what
> happens in a certain store (without the "noise" of changes in other stores),
> but I think either way the differences are minor.

Another downside of having a single patch file/function is that when there's a failure you don't immediately know which store failed, so it can be a tad harder to debug the problem.

> > Before reviewing the code itself, I'd like to have a discussion the approach
> > it self.
> >
> > Looking at the code, it looks like everytime I want to patch a single store,
> I
> > would have to add an empty patch file for all the other stores? That seems
> > cumbersome.
> >
> > I'd like to have a single patch file that can patch all the stores. That
> will
> > make it much easier to track what's going on when you have multiple-store
> > patches. With different files it's hard to get an overview and you still
> have
> > the problem that you have to know the order the stores get patched in.
> >
> > For example, let's say that you have two stores, A and B. You want to patch
> > both A and B, and migrate data using data from both A and B. If you select
> > from tables you patched in A and B, you have to consider whether the table
> > have been patched or not. It'd be much easier if you had everything in a
> > single file, e.g.:
> >
> >   def patch(stores):
> >      patch_something(stores["a"])
> >      migrate_data(stores["a"], stores["b"])
> >      patch_something_else(stores["b])
> >
> > What would be the downsides of having a single file for all the stores?
> 
> The downside I see is that you might some stores where you want to apply the
> same schema changes (think account-1 and account-2) in Landscape.
> 
> In that case you have to manually loop over those stores (whose number might
> be dynamic). For example let's say you have "b-1" and "b-2" instead of "b":
> 
>    def patch(stores):
>       patch_something(stores["a"])
>       b_stores = [store for name, store in stores.items() if
> name.startswith("b-")]
>       for b_store in b_stores:
>           migrate_data(stores["a"], b_store)
>           patch_something_else(b_store)
> 
> At least for Landscape the use case of cross-store data migration is much more
> infrequent than the use case of an isolated schema change against a certain
> store or set of stores.
> 
> Also, migrating from the current "vertical" patch system to the one you
> propose is more involved than migrating to the system proposed in the branch.
> 
> Regarding downsides of the system proposed in the branch, yes, you have to
> know the order, but it doesn't seem too create big headaches in real life. As
> for code living in different files, depending on how you see it the other flip
> of the coin is that could be also be a plus, because you know more easily what
> happens in a certain store (without the "noise" of changes in other stores),
> but I think either way the differences are minor.

Another downside of having a single patch file/function is that when there's a failure you don't immediately know which store failed, so it can be a tad harder to debug the problem.

lp:~free.ekanayaka/storm/schema-sharding updated on 2014-12-12

472. By Free Ekanayaka on 2014-12-12: Merge from schema-advance
473. By Free Ekanayaka on 2014-12-12: Merge from schema-advance

Revision history for this message

Free Ekanayaka (free.ekanayaka) wrote on 2014-12-12:

#

I've changed the code to support a directory structure like this one:

myschema/
    __init__.py
    patch_1/
        __init__.py
        a.py
    patch_2/
        __init__.py
        b.py
    patch_3/
        __init__.py
        a.py
        b.py

This solves the issue of needing empty patch files (if you don't put a file for a certain store/schema in the patch directory, it will be automatically considered an empty patch for that store/schema), and makes it a bit easier all patch files for a certain patch level.

It still assumes a fixed stores order when applying patches, but I think the additional complexity of per-patch ordering isn't worth the effort.

The only branch that actually needed to be changed is the other one, schema-advance, this one is mostly untouched.

Revision history for this message

Björn Tillenius (bjornt) wrote on 2014-12-18:

#

Looks good now. I added a comment with a suggestion of a slightly alternative approach, but I still think this branch can be landed as it is.

review: Approve

Revision history for this message

Free Ekanayaka (free.ekanayaka) on 2014-12-18:

#

lp:~free.ekanayaka/storm/schema-sharding updated on 2014-12-28

474. By Free Ekanayaka on 2014-12-18: Address review comments
475. By Free Ekanayaka on 2014-12-28: Improve documentation

Storm

Merge lp:~free.ekanayaka/storm/schema-sharding into lp:storm

Commit message

Description of the change

Preview Diff

Subscribers