Bringing Avatar to Life
Avatar smashed all box office records with a worldwide gross in excess of $2.7 billion and still climbing. Weta Digital, the visual effects company responsible for the film, had to break some records of its own to create Avatar’s stunning 3D visuals. Weta Digital was well versed in intensive graphics rendering from its work on The Lord of the Rings trilogy and other recent films, but creating Avatar was still a tremendous technical effort.
Weta Digital pushed the limits of its compute and storage infrastructure far beyond anything it had done before. When it began working on Avatar in 2006, Weta Digital was just finishing production on King Kong. At that time it had roughly 4,400 CPU cores in its “renderwall” and about 100TB of storage. By the end of producing Avatar, the company had roughly 35,000 CPU cores in its renderwall and 3000TB of storage. The capacity of the RAM in the renderwall alone now exceeds Weta Digital’s total disk storage capacity at the end of producing King Kong.
I started working at Weta Digital in 2003 as a system administrator as the last Lord of the Rings movie was being completed. Since then my primary role was the lead for Weta Digital’s infrastructure team. This team is responsible for all servers, networks, and storage. It was our job to create the infrastructure that made Avatar possible, and to solve any technical problems that came along.
Coping with Scale
Despite the tremendous growth that Weta Digital went through during the making of Avatar, managing the change in scale wasn’t as challenging as we’d feared. Much of this was due to having a well-seasoned team that knew how to work together. The team pulled together and, when something went wrong, we jumped in and fixed it. We worked hard and, for the most part, managed to be proactive rather than reactive.
We quickly realized that we were going to have to take two big steps to get to where we needed to be for Avatar.
These two elements gave us the physical capacity to scale our infrastructure as it grew and the bandwidth to move data freely between locations. New server infrastructure for the updated renderwall was created using HP Blade servers. With 8 cores and 24GB of RAM per blade, we were able to provision 1,024 cores and 3TB of RAM per rack. The new data center was organized in rows of 10 racks, so we built out our servers in units of 10 racks or 10,240 cores. We put in the first 10,000, waited a while, added another 10,000, waited a while more to add another 10,000, and finally put in the last 5,000 cores for the final push to completion.
We have a multivendor storage infrastructure, but the core of the storage was made up of NetApp® storage systems providing about 1000TB. By the end of making Avatar, we’d replaced all of our old FAS980 and FAS3050 systems with FAS6080 clusters. In the last eight months of the project we also added four SA600 storage acceleration appliances to solve one particularly troublesome performance problem.
Accelerating Texture File Access Times with Adaptive CachingIn the visual effects industry a texture is an image that gets applied to a 3D model to make it look real. Textures are wrapped around the model to give it detail, color, and shading so it looks like more than just a smooth gray model. A “texture set” is all the different pictures that must be applied to a particular model to make it look like a tree, person, or creature. Most renders that include an object also apply textures to the object; thus, textures are in high demand from the renderwall and they get used over and over again.
A given group of texture sets could be in demand by several thousand cores at any one time. An overlapping group could be in demand by another thousand cores and so on. Anything we can do to improve the speed with which textures are served has a dramatic impact on the performance of the renderwall as a whole.
No single file server could deliver the bandwidth necessary to serve our texture sets, so we developed a publishing process that was designed to create replicas of each new texture set after it was created. This is illustrated in Figure 1.
Figure 1) Old method of increasing bandwidth for texture sets.
When a job running on the renderwall needed to access a texture set, it chose a random file server and accessed the textures from that replica. By allowing us to spread the texture load across multiple file servers, this process improved performance significantly. While it was a better solution than relying on a single file server, the publishing and replication processes were complex and required time-consuming consistency checks to make sure that replicas stayed identical.
We started looking at NetApp FlexCache® and the SA600 storage accelerator as a simpler way of solving the performance problems created by texture sets. FlexCache software creates a caching layer in your storage infrastructure that automatically adapts to changing usage patterns, eliminating performance bottlenecks. It automatically replicates and serves hot data sets anywhere in your infrastructure using local caching volumes.
Instead of manually copying our texture data to multiple file servers, FlexCache would allow us to dynamically cache the currently popular textures and serve them to the renderwall from the SA600s. We tested the solution and saw that it worked extremely well in our environment, so eight months before Avatar was due to be completed we took a gamble and installed four SA600 systems, each with two 16GB performance accelerator modules (PAMs) installed. (PAM serves as a memory cache to further reduce latency.)
Figure 2) Improved method of increasing bandwidth for texture sets using NetApp FlexCache, SA600, and PAM.
The total texture set was about 5TB, but once FlexCache was in place we discovered that only about 500GB of that was hot at any given time. Each SA600 had enough local disk to accommodate the hot data set and, as the hot data set changed, the caches adapted without us having to do anything. Aggregate throughput was in excess of 4GB/sec, far more than we’d ever achieved before.
Caching textures with FlexCache was a superb solution. It made things run faster, and simplified the job of managing texture sets. We were in the final year of a four-year movie project. If we had put the SA600s in and had problems we couldn’t resolve quickly, we probably would have had to rip them back out. But after a week had passed we pretty much forgot about them until the end of the movie. That’s about as happy as you can make an IT guy.
Storage performance has a big impact on the speed at which renders happen. Storage bottlenecks can choke the throughput of your render farm. In the final years of Avatar we started digging into what that meant and added lots of monitoring capabilities and statistics on every job.
There was a constant backlog of jobs waiting to run; each day there would be many more jobs waiting for render time than the wall could actually complete. Weta Digital’s team of “wranglers” monitors the jobs to make sure everything happens as it should. The morning after we put FlexCache in, the lead wrangler came into my office to report that everything had finished. It had run so fast he figured we’d broken something.
Why Choose NetApp?
I’ve been a NetApp fan for a long time. I first used NetApp when I was working for an ISP in Alaska during the dot-com boom of the late 90s—I was impressed enough that I subsequently introduced NetApp storage at several other companies. I was pleased to find that NetApp storage was already in use when I got to Weta Digital.
For a company like Weta Digital, it’s business as usual to break things, because nobody else pushes infrastructure in quite the same way that Weta Digital does. The key is that, when something breaks, you need vendors who will help you fix it. Even when I worked at smaller companies, there was always someone at NetApp who would take the time to work through a problem with me until it was resolved. You’d think that would be business as usual, but in my experience it’s actually pretty rare.
Storage can be complex. NetApp technology makes storage as simple as it can be. There are things that I wish NetApp did better, but, compared to everything else I’ve seen out there, NetApp makes a versatile product that’s easy to use and then they back it up with support. That’s why we continue to choose NetApp.
Got opinions about Adaptive Caching?
Ask questions, exchange ideas, and share your thoughts
online in NetApp Communities.