ActiveRecord database anonymization using views
When encountering problems in a production environment it is sometimes hard to recreate the issue locally or in a test system. Some issues are actually depending on very specific runtime conditions while others arise from the combination of entries in the database.
Generating the right test data is not always easy given that you do not already know what is wrong and just have to confirm it.
Testing against a production system usually is not an option (and should not be considered unless you are truly desperate).
We have a staging setup which is basically the same as our production system. This allows us to create very similar conditions in terms of runtime behaviour. That leaves the issue of a comparable database state.
The easiest way to achieve a comparable state certainly would be to just copy the database into staging, but the database might contain sensitive information (API keys, credit card or bank account information, etc.) which you should not even provide to your trusted developers if you value data protection.
One option to solve this is to copy the database and to run some queries to purge sensitive information. It solves the issue of protecting the data, but only to a certain degree. Theoretically it is possible to access the data in between copying and anonymizing it. Also if something goes wrong while purging sensitive information the data is freely available to all the developers with access to the staging system. Therefore this is an option we did not seriously consider.
The sensitive data should never leave your production system. It should simply not be possible to view any data that might be sensitive unless your are amongst the select few who can access the production database directly.
Fortunately database views provide a solution for this issue.
Using views you can obscure the data before it actually leaves the database using the functions provided by the database, such as overlay
and translate
in postgresql (depending on the specific database you use). You can also simply replace column values e.g. by NULL::unknown
in your view.
The content of the view can then be used to fill the corresponding table in the environment you are using to recreate the issue. You just have to make sure that every column is also part of your view.
This approach works rather well for us. Unfortunately you always have to adjust the views whenever you make a change to the database schema. Even worse: someone with access to the database has to do it.
I suppose this is not an issue if you make the changes to the database manually anyways. However, to avoid mistakes, to have recreatable environments, and because you are lazy you will always want to automate as much as possible. It also happened more than once for us that the view was not created again and thus automatically shipping over the data did not work.
Frameworks such as Ruby on Rails provide with ActiveRecord a very nice solution out of the box: database migrations which are executed in order delivering the same result every time.
This allows any developer to make the necessary changes and to keep the code in sync with the database. You can and should use this to create and adjust the views accordingly.
Unfortunately ActiveRecord does not provide a nice way to create such views by default. Sure, you can always just manually execute the queries in a database migration together with the change you were making anyways. It will always stay in sync that way and you can be certain of recreatable result.
What happens though if you decide to change the underlying database system, e.g. moving from postgresql to mysql? You would have to manually adapt the database migrations because not all functions might be supported by the new database.
While looking around for a nice solution to solve this, I came across a couple of Ruby gems - none of which really satisfied me or was currently under active development.
I set out to build my own solution. I wanted a something which prevents the developer from having to write database specific queries, nicely integrates with ActiveRecord and on top of that allows the developer to see which columns have been anonymized in which way without having to open up the database and actually taking a look at the view definition. Also being lazy most of this should happen automatically.
The solution I came up with resulted in a gem named tidus.
Tidus automatically hooks into the db:migrate
and db:rollback
rake tasks and clears the views belonging to all models beforehand and generates them again with the current state afterwards. This happens every time either of those tasks is executed - it does not matter if any migration was actually executed.
The different anonymizations for each column can be added in the model class. Similar to validations using validates
you can use anonymizes
followed by the column name(s), followed by the anonymization strategy and the other options depending on the strategy, e.g.:
anonymizes :column_name, strategy: :overlay, start: 15, length: 10
Specifiying the strategy in the model makes it easily adaptable by the developers and it provides a nice overview about which columns are anonymized and how. Any column which is not specifically anonymized will just return the normal value for each dataset.
In Rails the integration is extremely easy. Add the gem to your Gemfile, specify the anonymization and your are done.
In other environments such as sinatra which do not automatically require all gems this is what you will have to do on your own. Just make sure it is done after you required activerecord
.
A more specific description of the currently implemented anonymization strategies can be found in the README of the gem.
So far I only implemented strategies for postgresql since that is what we are using - it can be easily extended though.
Edit: While renaming the gem from its working name "activerecord_anonymize" to its release name I missed the const_get. I just released v1.0.1 to fix the issue.
You can still get it to work with v1.0.0 of the gem by providing the full class name of the anonymizer strategy you want to use, e.g. strategy: Tidus::OverlayAnonymizer
instead of strategy: :overlay
. I am sorry for any inconvenience.
Edit2: Also published on the Barzahlen Tech blog