I was at Pixar when this happened, but I didn't hear all of the gory details, as I was in the Tools group, not Production. My memory of a conversation I had with the main System Administrator as to why the backup was not complete was that they were using a 32-bit version of tar and some of the filesystems being backed up were larger than 2GB. The script doing the backup did not catch the error. That may seems sloppy, but this sort of thing happens in the Real World all the time. At the risk of spilling secrets, I'll tell a story about the animation system, which I worked on (in the 1996-97 time frame).
The Pixar animation system at the time was written in K&R C and one of my tasks was to migrate it to ANSI C. As I did that I learned that there were aspects of this code that felt like a school assignment that had escaped from the lab. While searching for a bug, I noticed that the write() call that saved the animation data for a shot wasn't checked for errors. This seemed like a bad idea, since at the time the animation workstations were SGI systems with relatively small SCSI disks that could fill up easily. When this happened, the animation system usually would crash and work would be lost. So, I added an error check, and also code that would save the animation data to an NFS volume if the write to the local disk failed. Finally, it printed a message assuring the animator that her files were safe and it emailed a support address so they could come help. The animators loved it! I had left Pixar by the time the big crunch came in 1999 to remake TS2 in just 9 months, so I didn't see that madness first hand. But I'd like to think that TS2 is just a little, tiny bit prettier thanks to my emergency backup code that kept the animators and lighting TDs from having to redo shots they barely had time to do right the first time.
The point is that one would like to think that a place like Pixar is a model of Software Engineering Excellence, but the truth is more complex. Under the pressures of Production deadlines, sometime you just have to get it to work and hope you can clean it up later. I see the same things at NASA, where, for the most part, only Flight Software gets the full on Software Engineering treatment.
Right on the money with the "Real World" anecdote.
We do penetration tests for a wide range of clients across many industries. I would say that the bigger the company, the more childish flaws we find. For sure the complexity, scale, and multiple systems do not help towards having a good security posture , but never assume that because you are auditing a SWIFT backend you will not find anything that can lead to direct compromise.
Maybe not surprisingly, most startups that we work with have a better security posture than F500 companies. They tend to use the latest frameworks that do a good job of protecting against the standard issues, and their relatively small attack landscape doesn't leave you with much to play.
Would love to have a chat about your view on security posture between smaller and bigger companies, but couldn't find your email in your HN profile. Mine is in my profile so if you have the time, please send me a message.
One of the really interesting artifacts from the NASA flight software programs is that it helps put an upper bound of god honest ground truth level of effort to produce "perfect" software. Everything else we do is approximation to some level of fidelity. The only thing even reasonably close is maybe SQLite, and most people think the testing code for it is about 10x overkill.
It makes one start to contemplate how little we really understand about software and how nascent the field really is. We're basically stacking rocks in a modern age where other engineering disciplines are building half-km tall buildings and mile-spanning bridges.
Fast forward 2500 years and the software building techniques of the future must be as unrecognizable to us as rocket ships are to people who build mud huts.
We're stacking transistors measured in nm into worldwide communications systems, compelling simulations of reality, and systems that learn.
The scale is immense, so everything is built in multiple layers, each flawed and built upon a flawed foundation, each constantly changing, and we wouldn't achieve the heights we do if perfection, rather than satisfaction, was the goal.
Perhaps at some point the ground will stop shifting.
What the comment under is saying. The scale can just not be compared. The order of magnitude of complexity and variable in computer system are far bigger than in any other engineering discipline.
> The script doing the backup did not catch the error. That may seems sloppy, but this sort of thing happens in the Real World all the time.
I disagree. I mean, I agree those things happen, but the system administrator's job is to anticipate those Real World risks and manage them with tools like quality assurance, redundancy, plain old focus and effort, and many others.
The fundamental of backups is to test restoring them, which would have caught the problem described. It's so basic that it's a well-known rookie error and a source of jokes like, 'My backup was perfect; it was the restore that failed.' What is a backup that can't be restored?
Also, in managing those Real World risks, the system administrator has to prioritize the value of data. The company's internal newsletter gets one level of care, HR and payroll another. The company's most valuable asset and work product, worth hundreds of millions of dollars? A personal mission, no mistakes are permitted; check and recheck, hire someone from the outside, create redundant systems, etc. It's also a failure of the CIO, who should have been absolutely sure of the data's safety even if he/she had to personally test the restore, and the CEO too.
I don't know or recall the details well enough to be sure, but it's possible that they were, in fact, testing the backups but had never before exceeded the 2GB limit. Knowing that your test cases cover all possible circumstances, including ones that haven't actually occurred in the real world yet, is non-trivial.
Your post is valid from a technical and idealistic standpoint, however when you realize the size of the data sets turned over in the film / TV world in a daily basis, restoring, hashing and verifying files during production schedules is akin to painting the forth bridge - only the bridge has doubled in size by the time you get half way through, and the river keeps rising...
There are lots of companies doing very well in this industry with targeted data management solutions to help alleviate these problems (I'm not sure that IT 'solutions' exist), however these backups aren't your typical database and document dumps. In today's UHD/HDR space you are looking at potentially petabytes of data for a single production - solely getting the data to tape for archive is a full time job for many in the industry, let alone administration of the systems themselves, which often need overhauling and reconfiguring between projects.
Please don't take this as me trying to detract from your post in any way - I agree with you on a great number of points, and we should all strive for ideals in day to day operations as it makes all our respective industries better. As a fairly crude analogy however, the tactician's view of the battlefield is often very different to that of the man in the trenches, and I've been on both sides of the coin. The film and TV space is incredibly dynamic, both in terms of hardware and software evolution, to the point where standardization is having a very hard time keeping up. It's this dynamism which keeps me coming back to work every day, but also contributes quite significantly to my rapidly receding hairline!
This particular case is one that's hard to test - you'd restore the backup, look at it, and it would look fine; all the files are there, almost all of them have perfect content, and even the broken files are "mostly" ok.
As the linked article states, they restored the backup seemingly sucessfully, and it took two days of normal work until someone noticed that the restored backup is actually not complete. How would you notice that in backup testing that (presumably) shouldn't take thousands of man-hours to do?
Cool story :) You are bang on point when it comes to software engineering at what are thought to be "top tier" development houses. In the ideal world sure they will build the very best software but the real world has [unrealistic] deadlines and when you have deadlines it means corners get cut. Not always but very often. This leads to the whole "does it do exactly what is required?" and if it does then you are moved onto the next thing often with the "promise" that you will be able to go back and "fix things" at a later date. Of course we all know that promise is never kept.
"Backups aren't backups until they've been tested."
They really are Schrödinger's backups until a test restore takes place. This is one area where people cut corners a lot because no one cares about backups until they need them. But it's worth the effort to do them right, including occasional, scheduled manual testing. If you can't restore the data/system you're going to be the one working insane hours to get things working when a failure occurs.
And then there's the aftermath. Unless you are lucky enough to work for a blame free organization major data loss in a critical app due to a failure of the backup system (or lack thereof) could be a resume generating event. If you're ordered to prioritize other things over backups make sure you get that in writing. Backups are something everyone agrees is "critical" but no one wants to invest time in.
I've heard about NASA's Flight Software teams being very strict on 9-5 work hours, with lots of code review and tests. I was under the impression this wasn't as strict with the competition from SpaceX and Blue Origin now that we aren't sending people to space on our (USA) own rockets. Is my impression incorrect?
SpaceX (or rather, Elon Musk) is famous for pushing their developers hard. Elon sent his team to live on a remote island in the Pacific [1] where they were asked to stay until they could (literally!) launch.
I don't work in the manned space part of NASA and the software I deal with isn't critical, so I can't say. Most of what I know about Flight software development comes from co-workers who've done that work. They speak of getting written permission to swap two lines of code. That sort of thing. I think it would be cool to have code I wrote running on Mars, but I don't know if I could cope with that development environment.
I have some friends on the mechanical engineering side of things at SpaceX and can definitely say that 9-5 work hours don't even seem to be a suggestion. It likely varies team to team though.
My recollection of the details is lacking but this jives with what I remember about a talk I attended by a Pixar sysadmin. I think there was only a couple slides about it since it was just one part of a "journey from there to here" presentation about how they managed and versioned large binary assets with Perforce.
There are other anecdotes online about this catastrophic data loss and backup failure but I think it was, funny enough, the propensity for some end users of Perforce to mirror the entire repository that saved their bacon. I say funny because this is something a Perforce administrator will generally bark at you about since your sync of this enormous monolithic repo would be accompanied by an associated server side process that would run as long as your sync took to finish and thanks to some weird quirks of the Perforce SCM long running processes were bad and would/(could?) fuck up everyone else's day. In fact I think a recommendation from Pixar was to automatically kill long running processes server side and encourage smaller workspaces. Anyway, I digress. They were able to piece it together using local copies from certain workstations that contained all or most of the repo. Bad practices ended up saving the day.
Was that menv? I've heard stories that Pixar builds these crazy custom apps that rival things like 3D Studio and Maya but that never leave their campus!
Yes, Menv, for Modelling ENVironment, although it was a full animation package, not just a modeler. It has a new name now and has been extensively rewritten.
i think it might have turned out better if they had lost the movie - Toy Story I and III have really good plots, but the screenplay of Toy Story II isn't that stellar; is it possible that they would have changed the story of II if it had been lost? (Mr. Potato head might have said that they lost the movie on purpose)
The biggest difference, I think, was leaving the hunting for a head for a second moment, or even not doing it at all.
Commitment would be very different if people were being asked to help while some heads were rolling. Because you're a real team when everybody is going in the same direction. Any call on "people, work hard do recover while we're after the moron who deleted everything" wouldn't have done it.
You just commit to something when you know that you won't be under the fire if you do something wrong without knowing it.
I never understood the attitude of some companies to fire an employee immediately if they make a mistake such as accidentally deleting some files. If you keep this employee, then you can e pretty sure he'll never made that mistake again. If you fire him and hire someone else, that person might not have had the learning experience of completely screwing up a system.
I think that employees actually makes less mistakes and are more productive if they don't have be worried about being fired for making a mistake.
There is a great quote from Tom Watson Jr (IBM CEO):
> A young executive had made some bad decisions that cost the company several million dollars. He was summoned to Watson’s office, fully expecting to be dismissed. As he entered the office, the young executive said, “I suppose after that set of mistakes you will want to fire me.” Watson was said to have replied,
> “Not at all, young man, we have just spent a couple of million dollars educating you.” [1]
All depends on how leadership views employee growth
> I never understood the attitude of some companies to fire an employee immediately if they make a mistake such as accidentally deleting some files. If you keep this employee, then you can e pretty sure he'll never made that mistake again.
I did fire an employee who deleted the entire CVS repository.
Actually, as you say, I didn't fire him for deleting the repo. I fired him the second time he deleted the entire repo.
However there's a silver lining: this is what led us (actually Ian Taylor IIRC) to write the CVS remote protocol (client / server source control). before that it was all over NFS, though the perp in question had actually logged into the machine and done rm -rf on it directly(!).
(Nowadays we have better approaches than CVS but this was the mid 90s)
- Employee was error prone and this mistake was just the biggest one to make headlines. Could be from incompetence or apathy.
- Impacted clients demanded the employee at-fault be terminated.
- Deterrence: fire one guy, everyone else knows to take that issue seriously. Doesn't Google do this? If you leak something to press, you're fired, then a company email goes out "Hey we canned dude for running his mouth..."
It's better to engage the known and perhaps questionable justifications than to "never understand".
It's incentives. The true benefit to the company comes when people can make mistakes and learn from them. But often, the forces on management are not in alignment. Imagine Manager Mark has a direct report, Ryan, commit a big and public error. And then 1.5 years later Ryan commits another public error.
"What kind of Mickey Mouse show is Manager Mark running over there?" asks colleague Claire, "Isn't that guy Ryan the same one that screwed up the TPS reports last year?"
On the other hand, if Mark fires Ryan, then mark is a decisive manager. Even if the total number of major errors is higher, then still there will not be a risk of letting being known as a manager that let's people screw up multiple times.
Very good point. Aviation is awesome in that sense - accident investigations are focused on understanding what happened, and preventing re-occurence. Allocating blame or punishment are not part of it, at least in enlightened jurisdictions.
Furthermore, a lot of individual error's are seen in an institutionalised "systems" framework - given that people invariably will make mistakes, how can we set up the environment/institutions/systems so that errors are not catastrophic.
Not sure how that applies to movie animation, to be honest, but not primarily looking for whom to blame was certainly a very good move.
Rotorcraft PPL here: actually assigning blame is a very important part of accident analysis because it's critical to determining whether resolution will involve modifying training curriculum or if there was a mechanical failure. CFIT, engine failure due to fuel exhaustion, flight into wires, low G mast bumping, and settling with power are all sure ways to die from pilot error. And if the pilot did make a serious error their license could be suspended or revoked.
"Aviation is awesome in that sense - accident investigations are focused on understanding what happened, and preventing re-occurence. Allocating blame or punishment are not part of it, at least in enlightened jurisdictions."
Same goes for every tech company I have worked at. I have never been in a post-mortem meeting where the goal was to allocate blame. It was always emphasized that the goal of the meeting was to improve our process to make sure it never happens again, not punish the party responsible.
I remember this from the Field Guide to Understanding Human Error. Making recovering from human error a well-understood process is important, and as you point out, that process will work best if people aren't distracted by butt-covering.
The fault here does not lie with just one person. One person ran the rm -rf command. Other people failed to check the backups. Others made the decision to give everyone full root access. Really it was a large part of the company that was to blame.
Whenever there's a bug in code I reviewed, there are at least two people responsible: The person who wrote the code and me, the person who reviewed it.
I've found that that helps morale, as there's a sense of shared responsibility, but there's no blaming people for problems where I work, so I haven't actually seen what happens when people are searching for the culprit.
The usual process is "this happened because of this, this and this all went wrong to cause us to not notice the problem, and we've added this to make it less likely that it will happen again". If you have smart, experienced people and you get problems, it's because of your processes, not the people, so the former is what you should fix.
Oh god. I know someone who made all of these mistakes by themselves in a certain week. And kept his job. He was pretty.
I left and found out two months later from a friend he had managed to take down almost every single server in the place for which he had access. Even the legacy don't touch systems that just boot and run equipment.
Ed Catmull discusses this incident thoroughly in Creativity Inc.. He believed seeking retribution for this incident would've been counterproductive and discouraged Pixar's overall ethos as a safe place to experiment and make mistakes. It is this ethos and culture of vociferous, thorough experimentation and casting everyone's performance in the light of "What can we learn from that?" rather than "What ROI did we get from the last 3 months?" that Catmull credits for Pixar's success (paraphrasing here, but I believe this is an accurate summary).
Since Catmull has an engineering background (his PhD involved the invention of the Z-buffer, and he was doing computer graphics before anyone knew anything about it), he understands that mistakes and failed projects, when combined with an forthright and collaborative feedback loop, are not problematic detours, but rather necessary mile markers on the path to real innovation. We'd be so much further ahead if we put more men like Catmull in charge of things.
The biggest problem with reading Creativity Inc. is that it will rekindle the hope that there may be a sane workplace out there somewhere, when practically speaking, we know that few of us will ever find employment in one. It gave me a number of disquieting feelings as I read that the attributes of a workplace that all good engineers crave actually can and sometimes do exist out there. I had convinced myself that these things were myths, so now I'm sad that my boss isn't Ed Catmull.
That said, I do believe some evaluation and/or discipline would've been appropriate in this case, not for the person who accidentally executed a command in the wrong directory, but for the people who were supposed to be maintaining backups and data integrity.
Assuming that your primary job duties involve data integrity and system uptime, having non-functional backups of truly critical data stretches beyond the scope of "mistakes" and into the scope of incompetence.
It is, I'm sure, very possible that no one was really assigned this task at Pixar and that it would therefore by improper to punish anyone in particular for the failure to execute it, but I do believe there is a limit between mistakes en route to innovation and negligence. My experience has been that most companies strongly take one tack or the other: they either let heads roll for minor infractions (and thus never allow good people to get established and comfortable), or they never fire anyone and let the dead weight and political fiefdoms gum up the works until the gears stop altogether. There needs to be a balance, and that's a very hard thing to find out there.
> It is, I'm sure, very possible that no one was really assigned this task at Pixar and that it would therefore by improper to punish anyone in particular for the failure to execute it, but I do believe there is a limit between mistakes en route to innovation and negligence.
If indeed there was no-one assigned this task, then it was a mistake of negligence on the part of Pixar's management at the time. I'm not saying that to be snippy — that is exactly the job of management: to build the systems and processes required for employees to achieve the firm's goals. Proper backup and restore of data is one of those processes.
Creativity Inc is a great book and I also wish I worked for Ed Catmull when I finished it. Didn't somebody have an illicit backup copy of most of it and they were able to get most of what they needed back from that?
There was an incident where I work where an employee (a new hire) set up a cron job to delete his local code tree, re-sync a new copy, then re-build it using a cron job every night. A completely reasonable thing for a coder to automate.
In his crontab he put:
rm -rf /$MY_TREE_ROOT
and as everyone undoubtedly first discovered by accident, the crontab environment is stripped bare of all your ordinary shell environment. So $MY_TREE_ROOT expanded to "".
The crontab ran on Friday, IIRC, and got most of the way through deleting the entire project over the weekend before a lead came in and noticed things were disappearing. Work was more or less halted for several days while the supes worked to restore everything from backup.
Could the blunder have been prevented? Yes, probably with a higher degree of caution, but that level of subtlety in a coding mistake is made regularly by most people (especially someone right out of university); he was just unlucky that the consequences were catastrophic, and that he tripped over the weird way crontab works in the worst possible usage case. He probably even tested it in his shell. We all know to quadruple-check our rm-rfs, but we know that because we've all made (smaller) mistakes like his. It could have been anyone.
Dragging him to the guillotine would have solved nothing. In fact, the real question is "how is it possible for such an easy mistake to hose the entire project?" Some relatively small permissions changes at the directory root of the master project probably would have stopped stray `rm -rf`s from escaping local machines without severely sacrificing the frictionless environment that comes from an open-permissions shop. So if anything, the failure was systems's fault for setting up a filesystem that can be hosed so easily and so completely by an honest mistake.
The correct thing to do was (and is) to skip the witch hunt, and focus on problem-solving. I am not sure, but I think the employee was eventually hired on at the end of his internship.
For me the principle is: Standards and habits are teachable. Competence and attitude, less so. Educate and train for the former, and a failure of the former should cause you to look first at your procedures, not the people. Hire and fire for the latter.
> You just commit to something when you know that you won't be under the fire if you do something wrong without knowing it.
The other side is if you play a key role (and head could roll after the hard work is done) to simply leverage that fact (perhaps with others) as an advantage such that you have get a new contract and can't be fired for X amount of time.
In the comments, Oren addressed exactly that particular question:
"We didn't scrap the models, but yes, we scrapped almost all the animation and almost all the layout and all the lighting. And it was worth it.
Changing the script saved the film, which in turn allowed Buzz and Woody to carry on for future generation (see ToyStory3 for how awesome that universe continues to be - well done to everyone who worked on the lastest installment!) and, in some ways, set a major cornerstone in the culture of Pixar. You may have heard John or Steve or Ed mention "Quality is a good business model" over the years. Well, that moment in Pixar's history was when we tested that, and it was hard, but thankfully I think we passed the test. Toy Story 2 went on to became one of the most successful films we ever made by almost any measure.
So, suffice it to say that yes, the 2nd version (which you saw in theatres and is now on the BluRay) is about a bagillion times better than the film we were working on. The story talent at the studio came together in a pretty incredible way and reworked everything. When they came back from the winter holidays in January '99, their pitch, and Steve Jobs's rallying cry that we could in fact get it done by the deadline later that year, are a few of the most vivid and moving memories I have of my 20 years at the studio."
How is it possible to get a remake done by deadline? How did the original have so much extra time padded into its schedule?
> Steve Jobs's rallying cry that we could in fact get it done by the deadline later that year
There interesting but here is that Jobs didn't know if his cry was true. But he needed it to be true, so it was. Jobs was a member of the "action-based community", not the "reality-based community" https://en.wikipedia.org/wiki/Reality-based_community
Right, but I'm thinking from the perspective of someone who's been working on something for ages, gone through the stress of nearly losing it, then miraculously recovering it ... only to have found that a lot of their work was ditched. You're right that it probably ended up better, and sister comments are right in that it wasn't ALL for naught ... but can you imagine the moment you found out it was being reworked?
A large percentage of files in a CG production are not shot specific, so no, the recovery work was definitely not wasted. There are sets, lighting setups, props, layouts, models, textures, shaders, character rigs, procedural and effects systems, etc., etc. A few of those things might have to be redone, but when those things are set up and the script changes, the main bulk of the work is cameras and character animation, and then re-rendering.
I made a mistake like this once, I feel most that are really dogmatic about backups have something like this.
A long time ago I had a hard drive fail that had a bitcoin wallet with about 10 bitcoins in it.
At the time it was worth a hundred USD or so. I tried to fix it myself, ended up failing and throwing the drive out.
Right after that bitcoin started its meteoric climb. Every now and then I check the prices, then I go check that my backups are running, that my restores work, that my offsite backup is setup correctly, that every single one of my devices is backed up.
I believe that the original version that was scrapped was intended to be a straight-to-video release. It was completely reworked when the company decided to give the project bigscreen treatment.
Almost. TS2 was originally a direct to video film. But Disney liked the work-in-progress so much that they approved making it a feature film. And Pixar management at the time wasn't really thrilled with the idea of having an "A" team that made feature films and a "B" team that made DTV films. That could lead to morale problems in the B team. It was much later that problems in the story lead to replacing the director and doing the restart.
I worked at a few VFX studios, and everyone has deleted large swathes of shit by accident.
My favourite was when an rsync script went rogue and started deleted the entire /job directory in reverse alphabet order. Mama-mia[1] was totally wiped out, as was Wanted (that was missing some backups, so some poor fucker had to go round fishing assets out of /tmp, from around 2000 machines.)
From what I recall (this was ~2008) There was some confusion as to what was doing the deleting. Because we had at the time a large(100 or 300tb[2]) lustre file system, it didn't really give you that many clues. They had to wait till it went on a plain old NFS box before they could figure out what was causing it.
Another time honoured classic is matte painters on OSX boxes accidentally dragging whole film directories into random places.
[1]some films have code names, hence why this was first
[2]That lustre was big, physically and IO, it could sustain something like 2-5 gigabytes a second, It had at least 12 racks of disks. Now a 4u disk shelf and one server can do ~2gigabyes sustained
We lost a good chunk of Tintin (I think) when someone tried to use the OSX migration assistant to upgrade a Macbook that had some production volumes NFS mounted. It was trying in vain to copy several PB of data (I am convinced that nobody at Apple has ever seen or heard of NFS), and because it was so slow the user hit cancelled and it somehow tried to undo the copy and started deleting all the files on the NFS servers.
There was another incident where there was a script that ran out of cron to cleanup old files in /tmp, and someone NFS mounted a production volume into /tmp...
Eventually we put tarpit directories at the top of each filesystem (a directory with 1000 subdirectories each with 1000 subdirectories, several layers deep) to try and catch deletes like the one you saw, then we would alert if any directories in there were deleted so we could frantically try and figure out which workstation was doing the damage.
I had a client with a Linux server who wanted to automount the share on their OS X workstations. I cannot believe the hoops I had to jump through to make something as simple as NFS work. Every iteration of OS X seems to make traditional *nix utilities less and less compatible and remove valuable tools for no reason other than obstinance.
In the most recent VFX company I worked at, with some of the same guys, the backup sys was fucking ace. Firstly rm was aliased to a script that just moved stuff, not deleted it.
Second, there were very large nearlines that took hourly snapshots. Finally, lots and lots of tape for archive.
> The command that had been run was most likely ‘rm -r -f *’, which—roughly speaking—commands the system to begin removing every file below the current directory. This is commonly used to clear out a subset of unwanted files. Unfortunately, someone on the system had run the command at the root level of the Toy Story 2 project and the system was recursively tracking down through the file structure and deleting its way out like a worm eating its way out from the core of an apple.
As a linux neophyte, I once struggled to remember whether the trailing slash on a directory was important. So I typed "rm -rf", and pasted the directory name "schoolwork/project1 " (with a trailing space), but then I waffled and decided to add a trailing slash. So I changed it to "rm -rf schoolwork/project1 /".
it's one thing to shoot yourself in the foot, but without safe-rm you (or someone else less cautious) will eventually fire the accidental head-shot. it's happened to me a couple times; but never again since I started using safe-rm everywhere.
I have totally never done that. If I had done that, I might have been saved by a lack of permissions. Like if I was on a mounted external drive, so not in my home directory and it didn't get too far.
Edit: What would have been much more worse, if I had done it, would have been
Or, maybe you forget which directory you are in, and delete the wrong one that way: `rm -rf ../*` when you think you're in `/mnt/work/grkvlt/tmp` but are actually in `/mnt/work/grkvlt` at the time.
rm just unlinks the files at the inode level, seems like a disk forensics utility like the imagerec suite could have restored alot of the 'lost' data. In fact i've done it on source code after learning that the default behavior of untar was to overwrite all of your current directory structure. since it was text i didnt need anything fancy like imagerec, instead i just piped the raw disk to grep, and looked for parts i knew were in files i needed, then output them and the surrounding data to an external hard drive.
Back then yes. These days with SSDs, the OS will issue trim commands to the disk, zeroing the blocks from the OS point of view, and on SSDs with "secure delete", from a forensics point of view as well.
Perhaps I'm misunderstanding you, but trim doesn't secure delete anything. It merely indicates to the SSD that the sector is unused so that it can reap the underlying NAND location the next time it decides to garbage collect.
In other words, the data is still there in the flash, but only the SSD firmware (and physical access) has access to it.
That is very interesting, I didn't know there were SSDs with "secure delete."
I remember that Apple removed the "Secure Empty Trash" feature in OS X 10.11 because they didn't feel like they could guarantee secure deletion with their new fast SSDs present in most of their computers.
Yup, disk arrays can be quite a problem when it comes to forensics recovery. This has been a bit of a nemesis of mine over the years. Friends or family will decide to buy a single RAID solution for backup and configure it to write files across the disks for performance because they don't know any better. Four years later they'll come to me because something happened to the array like a failed or corrupted controller NVRAM and they want to recover the files. For backup I recommend mirrored single individual spinning disks, preferably in multiple locations.
That's really only possible on a single user system. VFX studios usually have large NFS servers (filers) typically running proprietary filesystems, and with hundreds or thousands of clients writing files simultaneously they don't get all laid down on the attached arrays in nice neat chunks representing entire files. Typically they would even try and distribute writes to multiple raid groups/arrays for a single file. Also consider the size, you were able to copy it onto an external drive to recover. Studios don't have a spare, empty filer sitting around to dump the random contents of a few racks worth of disks onto.
John Cleese's talk on Creativity recently made it to the front page of HN again https://www.youtube.com/watch?v=9EMj_CFPHYc and if you haven't watched it I highly recommend it.
I believe it was in this talk that he says the best work he ever did was when he scrapped and started over. Which from practice I think we can all admit that while its the hardest to do, it is always for the best.
Not necessarily. People often underestimate (in engineering fields) how much work it will take to rebuild something. In software there is a high degree of creativity which can have large downstream effects. You need to architect your system in such a way to make it possible to replace components when needed, this is where strong separation of concerns is important.
One thing that I've seen happen time and again is an organization bifurcating itself, so that there is one team working on the new cool replacement, and the other working on the old dead thing that everyone hates. Needless to say this creates anymosity and serverely limits an organizations ability to respond to customer demands.
The Pixar animation system at the time was written in K&R C and one of my tasks was to migrate it to ANSI C. As I did that I learned that there were aspects of this code that felt like a school assignment that had escaped from the lab. While searching for a bug, I noticed that the write() call that saved the animation data for a shot wasn't checked for errors. This seemed like a bad idea, since at the time the animation workstations were SGI systems with relatively small SCSI disks that could fill up easily. When this happened, the animation system usually would crash and work would be lost. So, I added an error check, and also code that would save the animation data to an NFS volume if the write to the local disk failed. Finally, it printed a message assuring the animator that her files were safe and it emailed a support address so they could come help. The animators loved it! I had left Pixar by the time the big crunch came in 1999 to remake TS2 in just 9 months, so I didn't see that madness first hand. But I'd like to think that TS2 is just a little, tiny bit prettier thanks to my emergency backup code that kept the animators and lighting TDs from having to redo shots they barely had time to do right the first time.
The point is that one would like to think that a place like Pixar is a model of Software Engineering Excellence, but the truth is more complex. Under the pressures of Production deadlines, sometime you just have to get it to work and hope you can clean it up later. I see the same things at NASA, where, for the most part, only Flight Software gets the full on Software Engineering treatment.
We do penetration tests for a wide range of clients across many industries. I would say that the bigger the company, the more childish flaws we find. For sure the complexity, scale, and multiple systems do not help towards having a good security posture , but never assume that because you are auditing a SWIFT backend you will not find anything that can lead to direct compromise.
Maybe not surprisingly, most startups that we work with have a better security posture than F500 companies. They tend to use the latest frameworks that do a good job of protecting against the standard issues, and their relatively small attack landscape doesn't leave you with much to play.
Of course there are exceptions.
It makes one start to contemplate how little we really understand about software and how nascent the field really is. We're basically stacking rocks in a modern age where other engineering disciplines are building half-km tall buildings and mile-spanning bridges.
Fast forward 2500 years and the software building techniques of the future must be as unrecognizable to us as rocket ships are to people who build mud huts.
The scale is immense, so everything is built in multiple layers, each flawed and built upon a flawed foundation, each constantly changing, and we wouldn't achieve the heights we do if perfection, rather than satisfaction, was the goal.
Perhaps at some point the ground will stop shifting.
I disagree. I mean, I agree those things happen, but the system administrator's job is to anticipate those Real World risks and manage them with tools like quality assurance, redundancy, plain old focus and effort, and many others.
The fundamental of backups is to test restoring them, which would have caught the problem described. It's so basic that it's a well-known rookie error and a source of jokes like, 'My backup was perfect; it was the restore that failed.' What is a backup that can't be restored?
Also, in managing those Real World risks, the system administrator has to prioritize the value of data. The company's internal newsletter gets one level of care, HR and payroll another. The company's most valuable asset and work product, worth hundreds of millions of dollars? A personal mission, no mistakes are permitted; check and recheck, hire someone from the outside, create redundant systems, etc. It's also a failure of the CIO, who should have been absolutely sure of the data's safety even if he/she had to personally test the restore, and the CEO too.
There are lots of companies doing very well in this industry with targeted data management solutions to help alleviate these problems (I'm not sure that IT 'solutions' exist), however these backups aren't your typical database and document dumps. In today's UHD/HDR space you are looking at potentially petabytes of data for a single production - solely getting the data to tape for archive is a full time job for many in the industry, let alone administration of the systems themselves, which often need overhauling and reconfiguring between projects.
Please don't take this as me trying to detract from your post in any way - I agree with you on a great number of points, and we should all strive for ideals in day to day operations as it makes all our respective industries better. As a fairly crude analogy however, the tactician's view of the battlefield is often very different to that of the man in the trenches, and I've been on both sides of the coin. The film and TV space is incredibly dynamic, both in terms of hardware and software evolution, to the point where standardization is having a very hard time keeping up. It's this dynamism which keeps me coming back to work every day, but also contributes quite significantly to my rapidly receding hairline!
Because sure, it's basic. To someone who knows that it's basic.
As the linked article states, they restored the backup seemingly sucessfully, and it took two days of normal work until someone noticed that the restored backup is actually not complete. How would you notice that in backup testing that (presumably) shouldn't take thousands of man-hours to do?
"Backups aren't backups until they've been tested."
They really are Schrödinger's backups until a test restore takes place. This is one area where people cut corners a lot because no one cares about backups until they need them. But it's worth the effort to do them right, including occasional, scheduled manual testing. If you can't restore the data/system you're going to be the one working insane hours to get things working when a failure occurs.
And then there's the aftermath. Unless you are lucky enough to work for a blame free organization major data loss in a critical app due to a failure of the backup system (or lack thereof) could be a resume generating event. If you're ordered to prioritize other things over backups make sure you get that in writing. Backups are something everyone agrees is "critical" but no one wants to invest time in.
from a brief stint in the gfx industry, you are correct.
Pixar isn't a model of software excellence, it's a model of process and (ugh) culture excellence.
[1] https://www.bloomberg.com/graphics/2015-elon-musk-spacex/
There are other anecdotes online about this catastrophic data loss and backup failure but I think it was, funny enough, the propensity for some end users of Perforce to mirror the entire repository that saved their bacon. I say funny because this is something a Perforce administrator will generally bark at you about since your sync of this enormous monolithic repo would be accompanied by an associated server side process that would run as long as your sync took to finish and thanks to some weird quirks of the Perforce SCM long running processes were bad and would/(could?) fuck up everyone else's day. In fact I think a recommendation from Pixar was to automatically kill long running processes server side and encourage smaller workspaces. Anyway, I digress. They were able to piece it together using local copies from certain workstations that contained all or most of the repo. Bad practices ended up saving the day.
Was that menv? I've heard stories that Pixar builds these crazy custom apps that rival things like 3D Studio and Maya but that never leave their campus!
Commitment would be very different if people were being asked to help while some heads were rolling. Because you're a real team when everybody is going in the same direction. Any call on "people, work hard do recover while we're after the moron who deleted everything" wouldn't have done it.
You just commit to something when you know that you won't be under the fire if you do something wrong without knowing it.
I think that employees actually makes less mistakes and are more productive if they don't have be worried about being fired for making a mistake.
> A young executive had made some bad decisions that cost the company several million dollars. He was summoned to Watson’s office, fully expecting to be dismissed. As he entered the office, the young executive said, “I suppose after that set of mistakes you will want to fire me.” Watson was said to have replied,
> “Not at all, young man, we have just spent a couple of million dollars educating you.” [1]
All depends on how leadership views employee growth
[1] http://the-happy-manager.com/articles/characteristic-of-lead...
I did fire an employee who deleted the entire CVS repository.
Actually, as you say, I didn't fire him for deleting the repo. I fired him the second time he deleted the entire repo.
However there's a silver lining: this is what led us (actually Ian Taylor IIRC) to write the CVS remote protocol (client / server source control). before that it was all over NFS, though the perp in question had actually logged into the machine and done rm -rf on it directly(!).
(Nowadays we have better approaches than CVS but this was the mid 90s)
- Employee was error prone and this mistake was just the biggest one to make headlines. Could be from incompetence or apathy.
- Impacted clients demanded the employee at-fault be terminated.
- Deterrence: fire one guy, everyone else knows to take that issue seriously. Doesn't Google do this? If you leak something to press, you're fired, then a company email goes out "Hey we canned dude for running his mouth..."
It's better to engage the known and perhaps questionable justifications than to "never understand".
"What kind of Mickey Mouse show is Manager Mark running over there?" asks colleague Claire, "Isn't that guy Ryan the same one that screwed up the TPS reports last year?"
On the other hand, if Mark fires Ryan, then mark is a decisive manager. Even if the total number of major errors is higher, then still there will not be a risk of letting being known as a manager that let's people screw up multiple times.
https://www.youtube.com/watch?v=XuL-_yOOJck
Furthermore, a lot of individual error's are seen in an institutionalised "systems" framework - given that people invariably will make mistakes, how can we set up the environment/institutions/systems so that errors are not catastrophic.
Not sure how that applies to movie animation, to be honest, but not primarily looking for whom to blame was certainly a very good move.
Same goes for every tech company I have worked at. I have never been in a post-mortem meeting where the goal was to allocate blame. It was always emphasized that the goal of the meeting was to improve our process to make sure it never happens again, not punish the party responsible.
https://www.amazon.com/dp/1472439058
I've found that that helps morale, as there's a sense of shared responsibility, but there's no blaming people for problems where I work, so I haven't actually seen what happens when people are searching for the culprit.
The usual process is "this happened because of this, this and this all went wrong to cause us to not notice the problem, and we've added this to make it less likely that it will happen again". If you have smart, experienced people and you get problems, it's because of your processes, not the people, so the former is what you should fix.
I left and found out two months later from a friend he had managed to take down almost every single server in the place for which he had access. Even the legacy don't touch systems that just boot and run equipment.
Be pretty.
Since Catmull has an engineering background (his PhD involved the invention of the Z-buffer, and he was doing computer graphics before anyone knew anything about it), he understands that mistakes and failed projects, when combined with an forthright and collaborative feedback loop, are not problematic detours, but rather necessary mile markers on the path to real innovation. We'd be so much further ahead if we put more men like Catmull in charge of things.
The biggest problem with reading Creativity Inc. is that it will rekindle the hope that there may be a sane workplace out there somewhere, when practically speaking, we know that few of us will ever find employment in one. It gave me a number of disquieting feelings as I read that the attributes of a workplace that all good engineers crave actually can and sometimes do exist out there. I had convinced myself that these things were myths, so now I'm sad that my boss isn't Ed Catmull.
That said, I do believe some evaluation and/or discipline would've been appropriate in this case, not for the person who accidentally executed a command in the wrong directory, but for the people who were supposed to be maintaining backups and data integrity.
Assuming that your primary job duties involve data integrity and system uptime, having non-functional backups of truly critical data stretches beyond the scope of "mistakes" and into the scope of incompetence.
It is, I'm sure, very possible that no one was really assigned this task at Pixar and that it would therefore by improper to punish anyone in particular for the failure to execute it, but I do believe there is a limit between mistakes en route to innovation and negligence. My experience has been that most companies strongly take one tack or the other: they either let heads roll for minor infractions (and thus never allow good people to get established and comfortable), or they never fire anyone and let the dead weight and political fiefdoms gum up the works until the gears stop altogether. There needs to be a balance, and that's a very hard thing to find out there.
If indeed there was no-one assigned this task, then it was a mistake of negligence on the part of Pixar's management at the time. I'm not saying that to be snippy — that is exactly the job of management: to build the systems and processes required for employees to achieve the firm's goals. Proper backup and restore of data is one of those processes.
Wasn't Catmull involved in wage-fixing? [http://www.cartoonbrew.com/artist-rights/ed-catmull-on-wage-...]
There was an incident where I work where an employee (a new hire) set up a cron job to delete his local code tree, re-sync a new copy, then re-build it using a cron job every night. A completely reasonable thing for a coder to automate.
In his crontab he put:
and as everyone undoubtedly first discovered by accident, the crontab environment is stripped bare of all your ordinary shell environment. So $MY_TREE_ROOT expanded to "".The crontab ran on Friday, IIRC, and got most of the way through deleting the entire project over the weekend before a lead came in and noticed things were disappearing. Work was more or less halted for several days while the supes worked to restore everything from backup.
Could the blunder have been prevented? Yes, probably with a higher degree of caution, but that level of subtlety in a coding mistake is made regularly by most people (especially someone right out of university); he was just unlucky that the consequences were catastrophic, and that he tripped over the weird way crontab works in the worst possible usage case. He probably even tested it in his shell. We all know to quadruple-check our rm-rfs, but we know that because we've all made (smaller) mistakes like his. It could have been anyone.
Dragging him to the guillotine would have solved nothing. In fact, the real question is "how is it possible for such an easy mistake to hose the entire project?" Some relatively small permissions changes at the directory root of the master project probably would have stopped stray `rm -rf`s from escaping local machines without severely sacrificing the frictionless environment that comes from an open-permissions shop. So if anything, the failure was systems's fault for setting up a filesystem that can be hosed so easily and so completely by an honest mistake.
The correct thing to do was (and is) to skip the witch hunt, and focus on problem-solving. I am not sure, but I think the employee was eventually hired on at the end of his internship.
For me the principle is: Standards and habits are teachable. Competence and attitude, less so. Educate and train for the former, and a failure of the former should cause you to look first at your procedures, not the people. Hire and fire for the latter.
The other side is if you play a key role (and head could roll after the hard work is done) to simply leverage that fact (perhaps with others) as an advantage such that you have get a new contract and can't be fired for X amount of time.
So that effort to recreate it (not to mention produce it in the first place) was pretty much all for naught? That must have been soul destroying
"We didn't scrap the models, but yes, we scrapped almost all the animation and almost all the layout and all the lighting. And it was worth it.
Changing the script saved the film, which in turn allowed Buzz and Woody to carry on for future generation (see ToyStory3 for how awesome that universe continues to be - well done to everyone who worked on the lastest installment!) and, in some ways, set a major cornerstone in the culture of Pixar. You may have heard John or Steve or Ed mention "Quality is a good business model" over the years. Well, that moment in Pixar's history was when we tested that, and it was hard, but thankfully I think we passed the test. Toy Story 2 went on to became one of the most successful films we ever made by almost any measure.
So, suffice it to say that yes, the 2nd version (which you saw in theatres and is now on the BluRay) is about a bagillion times better than the film we were working on. The story talent at the studio came together in a pretty incredible way and reworked everything. When they came back from the winter holidays in January '99, their pitch, and Steve Jobs's rallying cry that we could in fact get it done by the deadline later that year, are a few of the most vivid and moving memories I have of my 20 years at the studio."
https://www.quora.com/Did-Pixar-accidentally-delete-Toy-Stor...
> Steve Jobs's rallying cry that we could in fact get it done by the deadline later that year
There interesting but here is that Jobs didn't know if his cry was true. But he needed it to be true, so it was. Jobs was a member of the "action-based community", not the "reality-based community" https://en.wikipedia.org/wiki/Reality-based_community
"We have to keep this scene even though it's not quite perfect because otherwise it's a waste of money".
Maybe this is a bad example actually, movie industry is something you launch and market and leave.
But the best architectures I've seen have been demolished, destroyed and rebuilt from the ground up for their purpose.
Same with code.
This can also be a destructive siren call:
https://www.joelonsoftware.com/2000/04/06/things-you-should-...
But I think it's not absolute. Sometimes rewrites are imperative.
So often something happens which seems like a total disaster, the end of the world, and you struggle desperately to fix it.
In hindsight it turns out it didn't matter as much as you thought it did anyway. Has happened in so many startups I've worked at.
Life goes on.
I bet you they were suddenly industry experts in source control and data backups.
A long time ago I had a hard drive fail that had a bitcoin wallet with about 10 bitcoins in it.
At the time it was worth a hundred USD or so. I tried to fix it myself, ended up failing and throwing the drive out.
Right after that bitcoin started its meteoric climb. Every now and then I check the prices, then I go check that my backups are running, that my restores work, that my offsite backup is setup correctly, that every single one of my devices is backed up.
It was a $9,000 life lesson (as of right now...)
Reminds me of Fred Brooks quote. "Plan to throw one away. You will anyhow".
Deleted Comment
I worked at a few VFX studios, and everyone has deleted large swathes of shit by accident.
My favourite was when an rsync script went rogue and started deleted the entire /job directory in reverse alphabet order. Mama-mia[1] was totally wiped out, as was Wanted (that was missing some backups, so some poor fucker had to go round fishing assets out of /tmp, from around 2000 machines.)
From what I recall (this was ~2008) There was some confusion as to what was doing the deleting. Because we had at the time a large(100 or 300tb[2]) lustre file system, it didn't really give you that many clues. They had to wait till it went on a plain old NFS box before they could figure out what was causing it.
Another time honoured classic is matte painters on OSX boxes accidentally dragging whole film directories into random places.
[1]some films have code names, hence why this was first
[2]That lustre was big, physically and IO, it could sustain something like 2-5 gigabytes a second, It had at least 12 racks of disks. Now a 4u disk shelf and one server can do ~2gigabyes sustained
There was another incident where there was a script that ran out of cron to cleanup old files in /tmp, and someone NFS mounted a production volume into /tmp...
Eventually we put tarpit directories at the top of each filesystem (a directory with 1000 subdirectories each with 1000 subdirectories, several layers deep) to try and catch deletes like the one you saw, then we would alert if any directories in there were deleted so we could frantically try and figure out which workstation was doing the damage.
Second, there were very large nearlines that took hourly snapshots. Finally, lots and lots of tape for archive.
> The command that had been run was most likely ‘rm -r -f *’, which—roughly speaking—commands the system to begin removing every file below the current directory. This is commonly used to clear out a subset of unwanted files. Unfortunately, someone on the system had run the command at the root level of the Toy Story 2 project and the system was recursively tracking down through the file structure and deleting its way out like a worm eating its way out from the core of an apple.
That's my theory as to what they did.
it's one thing to shoot yourself in the foot, but without safe-rm you (or someone else less cautious) will eventually fire the accidental head-shot. it's happened to me a couple times; but never again since I started using safe-rm everywhere.
[1] http://serverfault.com/questions/337082/how-do-i-prevent-acc...
Under some shells (eg bash) that will expand to include `..` and `.`.
Hopefully, it was not at the root directory and we have frequent snapshots.
Edit: What would have been much more worse, if I had done it, would have been
instead ofIn other words, the data is still there in the flash, but only the SSD firmware (and physical access) has access to it.
I remember that Apple removed the "Secure Empty Trash" feature in OS X 10.11 because they didn't feel like they could guarantee secure deletion with their new fast SSDs present in most of their computers.
I believe it was in this talk that he says the best work he ever did was when he scrapped and started over. Which from practice I think we can all admit that while its the hardest to do, it is always for the best.
Not necessarily. People often underestimate (in engineering fields) how much work it will take to rebuild something. In software there is a high degree of creativity which can have large downstream effects. You need to architect your system in such a way to make it possible to replace components when needed, this is where strong separation of concerns is important.
One thing that I've seen happen time and again is an organization bifurcating itself, so that there is one team working on the new cool replacement, and the other working on the old dead thing that everyone hates. Needless to say this creates anymosity and serverely limits an organizations ability to respond to customer demands.
Starting over should be taken very seriously.