We have been noticing an increase in the number of ‘prospective partners’ asking if StoreGrid supports the synthetic full backup feature. StoreGrid does not yet support this feature as we have always given a low priority to this feature in the past. But now that it is being frequently asked for, we have started implementing it and hope to have this feature in the next few months. Though we would always like to give our partners as much choice and flexibility while using StoreGrid for their online backup services business, this particular feature has been haunting me for sometime. I feel using synthetic full backup is a double edged sword; it may come to haunt you when things go wrong. In fact, some of our partners actually told us they would not use this feature at all because of the additional risks it introduces! Let me clarify some of these viewpoints and try to put all the pros and cons of the synthetic full backup feature-on the table.
What is Synthetic Full Backup anyway?
Synthetic Full Backup is a way to create a new full backup without actually doing a full backup. The way it is done is by combining a previous full backup and the subsequent differential/incremental backups to “synthesize” a new full backup. Note that all of these are done at the backup server and hence it does not involve actual transfer of data from the clients to the backup server. Here is a definition of synthetic full backup on the web.
The advantage of using a synthetic full backup is that the client systems (the production servers and the user desktops/laptops) do not have to do a complete full backup periodically. This would reduce the load on client systems and the time taken for periodic full backups quite significantly. This is especially much more attractive in the online backup world because synthetic full backups eliminate the need to transfer large amount of data (involved during full backups) over the internet every time a full backup needs to be done. So far so good! So why not implement this right away considering that the advantages are so obvious. Hold your horses…
During synthetic full backup, the process of “synthesizing” a full backup is done at the backup server end. In order to “combine” a previous full backup with subsequent incremental/differential backups, the backup server should have access to the encryption key used to encrypt the backup data. Note that in the online backup world the encryption is done at the client end (the production servers and the users’ desktop/laptops). One of the most debated topics in online backup is about the security of the data that is backed up – will the service providers have access to the backed up data of their customers? Almost all online backup solutions, including StoreGrid, encrypt the data before the data is sent over the internet to the service provider’s storage cloud. And during restores the encrypted data is first restored to the client and then decrypted at the client end. So unless the backup server is given access to the encryption password, temporarily at least, synthesizing a full backup from a previous full backup and subsequent incremental/differential backups would not be possible.
But there are workarounds that can be implemented which would avoid the need to decrypt the encrypted data in the backup server for synthesizing a new full backup. Let me describe the workaround we are planning to implement and the resultant additional risks this introduces…
Firstly, for every file, StoreGrid does a full backup and then subsequently does differential backups (which is the block level differences between the current file and the content of the original file that was backed up during the full backup). This is done because if we were to do subsequent incremental backups (that is the block level differenced between the current file and the content of the file the last time it was backed up either incrementally or fully) all the time, instead of differential backups, then it is very difficult to implement versioning.
This is because for restoring the latest file we need to maintain the full backup file and every incremental backup that was done. In the case of block level differential backups, the latest file can be restored using the full backup file and the latest differential backup that was done.
So versioning is easier as we can delete the differential backups that are not required to be kept. This is illustrated in Figure 1 below.
Considering the way we are doing full backups and differential backups, we plan to implement synthetic full backup without actually physically combining a previous full backup and a subsequent differential backup. Instead, as illustrated in the Figure 2 below, we would simply create a reference in the database for a synthetic full backup with the information about which previous full backup and the differential backup make up the synthetic full backup in question.
This information would have to be used only during restores. Thus by just keeping the references of full backups and differential backups required to make up a new synthetic full backup, we can eliminate the need to have the backup server decrypt the data for combining backups to synthesize a full backup.
What are the risks introduced by the above process?
If we have to follow the above approach (having just references in the database without actually physically combining different backups) forever by actually doing only periodic synthetic full backup (to avoid a normal full backup), then, as illustrated in Figure 3 below, restores can become more complex and time consuming.
As during restore of a latest file, the first full backup file and every subsequent synthetic full backup file have to be restored along with the latest differential backup for that file. If this involved tens or hundreds of synthetic full backups then the restore process will surely become quite inefficient. Besides a simple restore of the latest file could mean restoring data which was stored months or years before. This introduces additional risks as even if one intermediate block of data from a synthetic full that was done months before is corrupted for some reason then all the backups done after that would be invalidated and cannot be restored. This is a serious risk. This risk can be eliminated either by physically synthesizing a full backup by decrypting the data when synthetic backup is done or by actually doing periodic full backups without relying on the synthetic full backup feature. The former option would mean that the backup server should have at least temporary access to the encryption key which introduces security risk. The latter option makes the restore process inefficient in addition to increasing the risk of losing data because of a small corruption in a block of data stored months before.
What is our take?
We strongly believe that the fundamental philosophy behind having a robust and foolproof backup strategy is to have as much redundancy for the data as possible. Any backup strategy that sacrifices redundancy for storage efficiency or for reducing time taken for backups should be avoided if feasible. Hence, though StoreGrid would have support for the synthetic full backup feature in a few months time, we would strongly advise our partners to thoroughly analyze it and understand the implications before using this feature. Our recommended approach will always be to do periodic full backups of all the data. Perhaps, one can reduce the frequency of complete full backups by doing frequent synthetic full backups in combination with less frequent complete full backups. We would certainly not recommend completely doing away with a normal full backup altogether.
This was exactly the sentiment expressed by some of our partners when we spoke to them about this feature. Like in many other spheres of life, ‘natural’ is better than ‘synthetic’, I guess!
Sekar,
Then you surely have not heard about all “all backups as full backups” :)
The only advantage with incremental is that it saves storage right ?
What if the “full backups” consume lesser time and storage than “incremental” ?? Sounds impossible, please take a look at the new generation products like EMC Avamar.
Jaspreet
Jaspreet,
By mentioning EMC Avamar I believe you are talking about deduplication as a strategy to eliminate redundancy and hence saving storage and network bandwidth. Avamar, I believe, does the deduplication at the client end. Coming to think of it incremental backup is a subset of depulication. What I mean by that is that incremental backup eliminates redundancy within a single file, where as deduplication eliminates redundancy across all files with in a client system or across all files across multiple systems in an organization. Avamar claims to eliminate redundancy (by deduping) across all client systems in a single site with the dedupe being done at each individual client system before data leaves the client system.
My point in the blog post is to ask the question if eliminating redundancy for backups is a good thing and whether it is feasible in an online backup scenario. BTW, my next post is about deduplication wherein I ask the same question – how feasible it is in the online backup scenario and how important it is or is it desirable at all in the name of reducing storage and network bandwidth.
Please do read my next post which will come out early next week.
Sekar.
Sekar,
Thanks for the answer.
I mentioned Avamar, not because of deduplication but because it considers every backup as full-backup. Because of underlying technology the time taken is much less (same as incremental) and restores are simple full restores.
As you are aware, we also try and use the same technology (all full backups). Takes same time as incremental, but restores are less painful.
I fully agree that deduplication may not required for online backups. Two reasons –
1. the user’s may not have any common data
2. It may cost more than it saves.
Jaspreet
Jaspreet,
I actually do not see the point of “All Full Backups”. I do not think there is any magic anyone can do. Either you upload a full file (which is full backup), or you upload only a difference which is incremental/differential backup. And in “synthetic full backup”, a full backup is reconstructed from the original full and the subsequent incrementals/differentials.
I suppose you are probably referring to the file system storage at the backup server where all files are stored as blocks with timestamp information for each block. So during a restore you just restore a file point in time by restoring the blocks associated with the file at that particular time. So I would say the technology you are talking about is more from the file system at the backup server storage rather than the way backups are done. In our case we just use the native file system to store the full data and the incremental/differential data and reconstruct the file during restores. I do not think there is any significant degradation in restore performance between what we do and what you probably do because at some level both are reconstructing the file during restores.
Please correct me if I am missing something.
Sekar
Sekar,
Speaking of deduplication. What are your thoughts and are there any plans to include this functionality in upcoming releases?
-Chris
Sekar,
I guess we both are referring to the same point. Sorry, probably I made mistake in reading the article well.
But, instead of re-creating a full backup on the server. We store the block information in a database, so re-creation can be on the fly and takes less storage.
Jaspreet
Chris,
Yes, we will be working on deduplication in the future. There are some challenges here because encrypted data cannot be deduplicated and encryption is mandatory in the online backup world. My next blog post will talk about it and the options in front of us. Hopefully I will complete it and post it by tomorrow. Once you read that I would be happy to hear your thoughts.
Sekar.
Jaspreet,
Thanks for your clarification. I understand it better now. Even in our case we store all met-data in a relational database. Our synthetic full backup, when we implement it, will also be similar to what you do. Except that our storage model is different from yours. But our meta-data in the relational database is similar – which is used while constructing a file for restore.
Sekar.
Sounds good. I will check back for your next post. I am looking forward to it.
-Chris
Hi Sekar,
I have to say this is disappointing. I got quite excited by your product as an alternative to the Ahsay OBM we currently use. It seems the arguments against a proper synthetic full backup are mainly one exposing the encryption keys. Surely this can be solved reasonably easily using a public/private key pair? The argument then is that at the point the merge occurs, the server decrypts the data, but surely it is sufficient for this only to occur during the merge with otherwise the private key not being generally available?
The argument that block level corruption could affect a merge resulting in a corrupt “synthetic” file doesn’t really hold water if the original and delta files are actually merged. If a merged file becomes corrupt it’s far better to find this out as soon as possible rather than to only find it out when one attempts to restore the file. Then corrective action can be taken by taking a new full backup. Validating the synthetic and real files using recorded checksums should not be too much of an issue.
The problems that “incremental forever” solves (in my opinion!) far outweigh the slight risk to security. For those businesses concerned by even this slight risk, just don’t use synthetic backups!
The problems solved are not to be sniffed at. In a world where we all had unlimited bandwidth and any level of storage cost the same, then the pursist approach to backup can be taken. But in the real world, storage has to be paid for and offsite bandwidth is a major constraining factor, especially here in the UK where our PM likes to think where at the forefront of the technological revolution – pull the other one!
“Incremental forever” allows us to work within known constraints and tolerances so that offsite storage requirements are both predicable and justifiable and backup schedules can be defined with some degree of predicablility.
Adam, Regis IT Ltd.
Adam,
Thanks for your detailed comment. I agree with all the points your are making. The blog post is just to highlight the challenges involved in implementing synthetic full backup. In fact my next post which I just posted talks about deduplication in the same light. Though I tend to write with what comes across as “strong opinions” I am actually quite pragmatic and realize that giving customers the options and let them choose what is best for them is far better than being dogmatic about some theoretical argument. But at the same time I want to express my opinion highlighting the pros and cons so that our customers are better informed when they make their choices.
So please rest assured we are taking your suggestions seriously and would indeed implement these features in StoreGrid in the near future.
Once again thanks for reading the blog and your well thought out comment.
Sekar.
sekar,
i only know what ive read here in the past few minutes. if bandwidth is an expensive concern,would it be practical for the largest accounts to have everything backed up to their own server or whatever,every 24hrs or so,and then have a bonded courier switch the drives and transport the full one to a central metro location for download? a small fleet of motorcycle riders could do this in route fashion.
Hi,
What you say is possible but I am not sure how economic that would be. In order to support something like what you have described we have a feature in StoreGrid called “Local to Remote Server Migration (L2R)” which helps service providers do exactly what you have described. That is take a local backup first and move it physically to the remote backup server and then the subsequent incrementals can happen over the internet to the remote backup server.
Sekar.
Sekar,
I totally understand the worries about a synthetic backup, but I have often wondered if it was possible to do a modified synthetic backup.
Most of my clients do a full backup once a month. For some of them, that full backup can take 3-4 days. Would it be possible for StoreGrid to recognize which files have changed in the last month and then only have to transfer those over the network while the unchanged files are simply copied from one backup set to another on the backup server?
I would guess using some type of checksum handshake the server and client can determine if a file needs to be transferred again.
I will also say, if any type of synthetic backup is made, a natural full backup needs to be done at least once a year and maybe even every quarter.
Thanks,
Westley
Hi Westley,
Thanks for sharing your thoughts. Very coincidentally we have been discussing a similar idea to avoid doing a complete Full Backup. Initially we are planning to have a feature where StoreGrid will automatically do a full backup for a file or a few set of files based on how much change the file has undergone since the previous full backup. This way there probably will never be a need for a complete full backup of all files at one go. We can extend this to what you are suggesting too to make it even better. We already are working on the Synthetic full backup feature and we will incorporate these ideas as soon as possible. Once again thank your for your suggestions.
Sekar
Hi,
I have one confuse on the backups.
1. We are taking full back of database every day Friday to Monday then Monday to Wednesday I am running differential backup. Now the one of our employee asked me to take full backup on Wednesday, where he is going to do lat of changes into database. I want this backup process runs without breaking differential backup. Please let me know the option. send me a mail if possible
Thanks,
Neel
Hi Sekar,
Could you please provide an update on this – have you had any change in opinions since this article was written and you have had a chance to actively work on it?
Cheers,
Ryan
Ryan,
Thanks for checking.
The development of Synthetic full backup as an option in StoreGrid has been completed. We are now testing it. It should be available in one of the production releases soon. We need to plan and decide which release we want to include this in. Most the next release of StoreGrid SP edition it should be available.
In terms of any change of opinion, I do not think so. I still maintain that from a theoretical point of view backup is about data redundancy. So I still believe it is better not to optimize and remove redundancy for the backup data if practically feasible in terms of cost and time window etc.
The simple logic is that if the latest backed up data can be restored only based on past data stored (say, what was stored 1 year back as opposed to 1 month or 3 months) then I feel uncomfortable about that. That is the reason I believe that we always have to take a complete full back every once in a while (1 month, 3 months or may be 6 months). Instead if we start relying on always incremental or always synthetic full backup for years without ever taking complete full then the pessimist in me does not allow me to trust such a process completely.
May be its just me.
Sekar.