What is a data platform?

Alluxio rolls out new filesystem built for deep learning

Posted on

Alluxio Inc., which sells a high-performance open-source distributed filesystem, today introduced a completely overhauled version of its product fine-tuned for artificial intelligence workloads.

Alluxio Enterprise AI is aimed at data-intensive deep learning applications such as generative AI, computer vision, natural language processing, large language models and high-performance data analytics. It’s designed for high-performance model training and deployment at scale using an existing technology stack instead of specialized storage. Alluxio said enterprises can expect up to 20 times faster training speed compared to commodity storage, up to a 10-fold improvement in model serving, better than 90% graphic processing unit utilization and up to 90% lower infrastructure costs.

Architected for AI

The product is based on a new architecture called Decentralized Object Repository Architecture which Alluxio said provides infinite scale for AI workloads. Dora enables the platform to handle up to 100 billion objects using commodity object storage while supporting metadata management, high availability, and performance. The platform supports deep learning pipelines — from ingestion to extract/transfer/load, pre-processing, training and serving – across the training and deployment stages.

Enterprise AI’s distributed cache is tailored to AI workload input/output patterns, which differ markedly from traditional analytics. “Analytics typically works on files that are a few hundred megabytes or even in the gigabyte in size, but computer vision and deep learning work on extremely small files,” said Adit Madan, director of product management. “The concurrency requirements are also much higher than on the analytics base. Architectural changes had to be made to serve multiple [inputs and outputs per second] simultaneously.”

Those changes include implementing intelligent distributed caching tailored to AI workloads’ input/output patterns to enable AI engines to read and write data through the high-performance cache instead of the much slower data lake storage. Alluxio said its intelligent caching is tailored to the large-file sequential access, large file random access and massive, small-file access patterns that are typical of AI engines. Training clusters are continuously fed data from the distributed cache to achieve high utilization of GPUs, which can cost over $30,000 each.

“The typical scenario for Alluxio is to have dedicated hardware on-prem and multiple Trino or analytics clusters accessing a single Alluxio cluster,” Madan said, referring to the Trino open-source distributed SQL query engine. “In this architecture, we’re saying you don’t need dedicated hardware for Aluxio because we use the specialized compute infrastructure.

“The distinguishing factor is that we are doing this over commodity data,” Madan said. “What people are doing today is provisioning high-performance storage with different variants of what used to be parallel file systems designed for other purposes and repurposing that for machine learning and deep learning. We are co-locating with compute resources. The product’s technical specifications had to be completely different as a result.”

Different strokes

Alluxio Enterprise AI is the company’s third distributed filesystem product. The existing Alluxio Enterprise Edition will continue to be promoted as the best choice for analytic workloads and Alluxio Enterprise Data as a product for decentralized metadata.

The new platform provides a single pane of glass for enterprises to manage AI workloads across diverse infrastructure environments and enables data sharing across business units and geographical locations while removing the bottleneck of data lake silos.

In model training, for example, a PyTorch data loader can load to the Alluxio cache instead of to a virtual local path. During training, the cached datasets can be used in multiple epochs – or complete passes of the training dataset through the algorithm — so training speed isn’t bottlenecked by the need to retrieve data from Amazon Web Services Inc.’s S3 storage. GPU idle time is mostly eliminated, and PyTorch can write the model files to S3 through Alluxio.

Multiple TorchServe instances can read the model files concurrently from S3 storage during inferencing. Alluxio caches the latest model files and serves them to inference clusters with low latency. As a result, downstream AI applications can start inferencing using the most up-to-date models as soon as they are available.

Alluxio said Enterprise AI integrates with popular machine learning frameworks such as PyTorch, Apache Spark, TensorFlow and Ray. It also works with Representational State Transfer, Posix and S3 application program interfaces.

The software works on-premises and in the cloud in bare-metal or containerized environments. Supported storage systems include S3, Google LLC GCS, Microsoft Corp. Azure Blob Storage, MinIO Inc. object storage, Ceph software-defined storage and the Hadoop Distributed File System. All three major public cloud platforms are supported. Pricing is based on capacity and downloads are available immediately.

Image: Bing Image Creator

Your vote of support is important to us and it helps us keep the content FREE.

One-click below supports our mission to provide free, deep and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy


Source link

Leave a Reply

Your email address will not be published. Required fields are marked *