GPU, RAM and SSD: turbos for analytics

RAM, SSD or not, GPU or CPU … when it comes to shortening the time to generate query results on a large volume of data (structured or not), some technology providers rely on methods of acceleration This is for example the case of AeroSpike, GridGain or MapD who all made the choice to architect their solution around a hybrid support of memory and / or SSD and ultra-parallel graphics accelerators (for MapD)

All made the following observation: the solutions in place are not necessarily adapted to the will of companies that have also evolved in their uses of analytics and data. They saw the need to place at the heart of their architecture a way to accelerate the queries rendered, access to data on large amounts of data, whether from the transactional world or the unstructured and branched world of Big Data.

MapD: a GPU inflated SQL engine

Born at MIT, MapD shows that graphics processing units (GPUs) can also be applied to the world of analytics. The company has developed an open source base, the MapD Core, which is a database in column and In-memory with a GPU engine for SQL queries . This engine allows, according to the company, to query billions of lines in milliseconds from SQL queries. With “100x higher” results than those achieved on traditional CPUs. Even if finally, MapD Core also works on the CPUs: a low-level VM technology compiles indeed the requests for multiple architectures.

Obviously, the interest is to be able to browse and query very large volumes of data in record time, including those that also require special treatment. Geospatial data are also guests of MapD brands that has made it one of the specialties. Other use cases are operational analytics and of course data science. “Placing the data as close as possible to the compute is critical today, which results in fine memory management at the GPU level,” says Tood Mostak, CEO of MapD.

If we imagine that the parallelization of the GPU fleets is logically an invaluable power gain for the MapD platform, this compute capability also allows the company to do without index, he said. Without pre-aggregates or indexes, the exploration of data and their interrogation are accelerated, at the rate of their ingestion, itself considerably reinforced.

MapD Core is today the foundation of an Open Source project, the GPU Open Analytics Initiative (GOAI) whose ambition is to work on a framework to facilitate data science on GPUs.

At this heart is open, MapD has associated a data visualization tool (MapD Immerse), which exploits at best the muscular performance of GPU. The idea here is to speed up rendering – hence the interest of geospatial data analysis. Thanks to the GPU, images are rendered directly on the server. But MapD can also connect to third-party tools like Tableau or Qlik.

Finally, MapD claims to be agnostic in terms of GPU technologies. The company, however, attracted Nvidia’s attention. He even entered the capital of the company.

AeroSpike: a hybrid architecture memory / SSD

On the other hand, if AeroSpike is also leveraging its over-vitamined key / value database technology to In-Memory, the company also wanted to take advantage of Flash and SSDs. The AeroSpike engine ensures a natural cohesion between these two technologies. This NoSQL database, which is clearly similar to Redis – but with a shared-nothing architecture – relies on a mechanism optimized to store data in RAM, on SSDs or even on conventional hard disks, if necessary. . In order to speed up data access, the indexes are placed in the RAM and the data is stored on the SSDs. Technically, explains AeroSpike, the technology allows direct access to disk blocks to boost writing operations. The SSD is not only there to ensure the persistent storage mode but is a direct extension of the RAM. The company has also developed a personalized and flexible data model that it wanted to structure in a way close to a transactional model, but with its key-value specificity,as indicated on the publisher’s website . “It’s an illusion to say that adding SSD to any base will speed it up,” said AeroSpike co-founder Srini Srinivassan. You must also change the data structure. ”

If until then AeroSpike was positioned in the market of high performance NoSQL databasesand real-time for non-transactional systems, the company decided to provide version 4 of its solution with data consistency capabilities. A key advance that now allows it to position itself on more transactional use cases, or even to establish itself as the sole basis for motorizing transactional and analytical operations, all with a very real-time dimension. “AeroSpike wants to prevent companies from having to add multiple layers between the data and the application. They can delete their cache floor, “says the manager. He adds, “We need to educate the market massively and tell businesses that they do not need this layer of cache. ” For that, AeroSpike will in particular be able to count on its work done in collaboration with Intel whose goal is to optimize SDD Optane of Californian. For now, says the boss of the company, this work focuses on the reduction of latency and for the financial services market.

However, AeroSpike regularly lists the SSD providers it recommends to work with its technology. A benchmark tool to evaluate SSD performance on AeroSpike has also been developed.

With this In-Memory + SSD approach, explains Brian Bulkowski, the co-founder and CTO of the company, companies will be able to reduce their server footprint. According to him, 450 Cassandra nodes would become 60 AeroSpike nodes.

GridGain: all in memory

If SAP had shown the way – at least in the language – databases In-Memory, GridGain intends to democratize the concept . The company has developed an In-Memory database core, quickly transferred to Open Source within the Apache Ignite project, around which has been embroidered a premium offer of dedicated services and components, through two separate editions.

The principle of GridGain, and therefore of Ignite, is to fit between databases and data sources – no matter whether they are transactional databases, NoSQL or Hadoop clusters – and put them in memory in their database in order to accelerate access via an SQL interface. Unlike AeroSpike, for example, GridGain emphasizes its approach only In-Memory, while the NoSQL database is also optimized for the SSD. He especially wants to see himself as a Hana Open Source, and considers his approach closer to a Google Spanner “whose joins are scalable,” according to Nikita Ivanov, the founder of the company, also CTO. To respond to the demands of the financial sector, which is very present in GridGain’s customer base, the company alsoadded persistent storage capacity on disk .

Leave a Comment