Gregari is meant to show an enterprise application that has SaaS (Software As A Service) capabilities. In creating Gregari’s database strategy I actually created several databases, i.e.
- Metadata – System configuration (metadata specific to provider functionality or global metadata which would be common to all tenents) as well as tenent specific metadata.. resides within this database. Generally speaking nothing in here is confidential and if it is those database columns can be encrypted (i.e. user passwords if stored in a single database rather than LDAP would be a good example).
- Audit – Any auditing information also goes to a single database. Only the system is creating these objects/rows, i.e. user sessions, security authentication/authorization attempts, data access at the row attempts, debug statements from logs, Web site requests, Web Service requests, scheduler job audits, etc. While there’s information about the system and the tenants/users who use it… this is more about audit information, i.e. system resource, tenant, user, timestamps, success/failure, etc.
- Instance – Any tenant specific information that is specific to the SAAS application is stored here. An example might be … if the SaaS application in question is a CRM application… then stored in the Instance databases would be any leads, opportunities, quotes, orders, etc. The Metadata database might have the tenant has subscribed to the CRM application’s Marketing, Sales and Support modules as well as what employees have subscriptions but that should be the extent of what’s stored in the Metadata application.
- 3rd Party Products – Components like Quartz and JBOSS Jbpm both can leverage a database. These database objects are stored separately since they’re separate projects/products.
- Reporting – Any reports or data feeds pull their data from this database. A single ETL job pulls data from the metadata/instance/auditing databases as needed. This way the metadata/instance/audit/etc databases can be optimized for OLTP operations and the reporting database can be optimized for … reports/bulk queries.
Since Gregari is using EJB3 Entities via JBoss Hibernate’s Entity Manager… this produces an interesting issue, i.e. EJB3 Entities are associated with a “persistence context”. The persistence context is associated with one and only database (if I understand the specification correctly). From a packaging perspective there can only be one persistence context per JAR file that contains an EJB-JAR.XML and PERSISTENCE.XML file. As such any EJB3 Entities need to be packaged in the correct JAR file… the JAR file that’s associated with the corresponding database/persistence context.
The other major concept to consider is foreign key relationships… Normally in a database designed for SaaS… you have the tenant table associated with each applicable object/table via a foreign key. But if the tenant object/table is in the Metadata database and the CRM opportunity object/table is in the Instance database… there’s an issue of foreign keys. Sure there might be some database vendor specific way to handle it. But Gregari attempts to do this in a database vendor neutral manner by creating some “hidden” tables to mimic the tenent table in each of the other databases. As an example:
- Metadata database has the Tenant table. All tables within the Metadata database reference (where necessary) this Tenant table via foreign key relationships.
- Audit database has a AuditTenant table. All tables within the Audit database reference (where necessary) this AuditTenant table via foreign key relationships.
- Instance database has a InstanceTenant table. All tables within the Instance database reference (where necessary) this InstanceTenant table via foreign key relationships.
Whenever a Tenant object/row is created a corresponding AuditTenant as well as InstanceTenant object/row are created and inserted (via the Service layer generating/observing events) and the reverse is done if the Tenant is deleted from the Metadata database. The AuditTenant and InstanceTenant objects/tables are pretty dumb, i.e. primary key column, tenantId, created by/on, updated by/on… and that’s it. This concept is extended to any other duplicate tables between databases, i.e. User/AuditUser/InstanceUser. Apart from inserts/deletes these tables are never touched, i.e. there’s no other data to keep in sync. As mentioned earlier this is all handled in the Service layer via Seam Events/Observers allowing decoupling between business objects. The event generation/observers design pattern was picked to allow an agnostic approach to the database (i.e. don’t use any database specific capabilities to join two databases). This allows portability between database vendors. It’s also the basis of event driven systems allowing decoupled development.
Other concepts include multi-tenancy, i.e. basically there’s a Tenant object/table and all objects/tables that have tanent-specific data … reference that Tenant object/table in a foreign-key relationship. Pretty simple. So as long as your queries reference the tenant ID… everything is grand. If you don’t … you’ll have issues. So development and testing are critical to ensure this.
Still other concepts include the notion of shards… for some tenants having all of their data co-existing with other tenants is OK as long as there’s some way to keep it all segregated. Other tenants will insist that the data is kept completely separate, i.e. another database. For JEE this can be be difficult as… there’s a single database connection configuration, i.e. driver, database name, system account username/password. It’s very difficult to have separate databases per tenant without a configuration/coding nightmare. Ideally there’s some magically way to support both single-tenent databases and multi-tenant databases using one set of code… Hibernate Shards is meant to do this but currently doesn’t have support for JPA (Entity Manager, etc). Other options are Oracle VPD (which I just learned about). So this concept of tenant specific databases is a bit more problematic but… possible with some thought/some code. Key things to consider:
- Design for provisioning – Ideally any provisioning is relatively instantaneous. Waiting up to 24 hours for a dead period in the operating cycle isn’t ideal.. (or waiting for a weekend)
- Design for operations – Ideally the system doesn’t have to come down in order to add the single tenant database to the overall system. Bringing a system down for anything but a major upgrade isn’t ideal. Opportunity for a FUBAR situation. Especially if there’s a large cluster of servers without automated configuration automation.
- Database connection pooling – Ideally there’s no impacts to any connection pooling.
In subsequent blog’s I’ll detail how object history, user defined fields, etc. is/was designed.
[...] individual mapping to databases is explained below while the database strategy is explained in the Database Strategy blog [...]
Pingback by Persistence Layer « Trap The Spark… — December 28, 2008 @ 7:52 pm