Overview#

Accelerator#

It is helpful to begin with commonly used terms describing the MemryX AI Accelerator.

Pure Dataflow#

The MXA is built from the ground up to accelerate neural network inference. Hardware and software were co-designed together to enable pure dataflow execution of neural network workloads. The MXA is optimized for the efficient flow and processing of data. During execution of neural network models, input data is streamed through network layers according to the trained AI model until reaching the output layer. The figure below is an illustrative example of data streaming from the input node to the two output nodes via six neural network layers.

flowchart LR; Input(("Input")) --> l0 l0["\nLayer 0\n\n"] --> l1["\nLayer 1\n\n"] l0 --> l2["\nLayer 2\n\n"] l1 --> l6 l1 --> l3["\nLayer 3\n\n"] l3 --> l5 l4["\nLayer 4\n\n"] l4 --> l5["\nLayer 5\n\n"] l6 --> Output1(("Output")) l5 --> Output2(("Output")) l5 --> l6["\nLayer 6\n\n"] l2 --> l4 style Input stroke:#595959 style Output1 stroke:#595959 style Output2 stroke:#595959 style l0 fill:#99FFCC, stroke:#595959 style l1 fill:#0099CC, stroke:#595959 style l2 fill:#FF9999, stroke:#595959 style l3 fill:#FFE699, stroke:#595959 style l4 fill:#00CC99, stroke:#595959 style l5 fill:#66CCFF, stroke:#595959 style l6 fill:#BFBFBF, stroke:#595959

The MXA has programmable hardware that aligns to the flow of trained AI-Model(s). Each MXA chipset is made of hundreds of dataflow cores, called MemryX Core Engines (MCEs). The operation of each MCE and the data routes are programmed together during neural network model deployment, eliminating the need for runtime instruction scheduling. After configuration, the MXA architecture efficiently streams data from input to output.

The MXA utilizes the space multiplexing of its many cores to accelerate neural network layers. Each layer of the network is assigned a specific number of MCEs to optimize the overall inference performance and data flow, as shown in the following illustration. Each MCE can be programmed to perform a specific job driven by the data stream without the need of a universal control unit or runtime scheduler.

Concurrent Models and Streams#

Each MXA can seamlessly support concurrent operations of multiple neural network models and data streams. The user only needs to provide the neural compiler with the set of models and the selected number of MXA chips as inputs. The neural compiler will automatically map multiple concurrent models to the selected number of MXAs. The SDK will also optimally distribute hardware resources among the models to achieve the highest possible performance. Note that the exact same software is used to map one large model across many MXAs or many small models across a single MXA.

graph LR s0("Stream-0") s1("Stream-1") subgraph MPU m0["Model-0"] m1["Model-1"] end s0--> m0 s1--> m1 m0 --> o0("Output-0") m1 --> o1("Output-1") style s0 fill:#CFE8FD, stroke:#595959 style s1 fill:#CFE8FD, stroke:#595959 style o0 fill:#CFE8FD, stroke:#595959 style o1 fill:#CFE8FD, stroke:#595959 style m0 fill:#A9D18E, stroke:#595959 style m1 fill:#A9D18E, stroke:#595959 style MPU fill:#EDEDED, stroke:#595959

Inherently Scalable#

Scalability is an inherent attribute of MemryX hardware and software systems. Each MXA has a given number of MCEs. Therefore, 2X the chips means 2X the AI computing capabilities, just as 10X the chips provide 10X the MCEs and hence 10X the computing capability of a single MXA.

The user can cascade any number of MXA chips. Reasons for adding additional MXAs include supporting larger and/or more models, increasing model performance or lowering latency. The neural compiler automatically distributes the workload of any number of models over any chosen number of MXA chips. In the diagram below, two chips have been cascaded, and the neural compiler optimally distributes the workload of the NN model to achieve high inference performance.

Development Flow#

The MemryX SDK is a lightweight, high-performance software stack designed to run AI models on MemryX accelerators with minimal effort. Developed alongside our hardware, it provides a streamlined interface for deploying neural networks efficiently across systems. With a focus on simplicity and speed, the SDK includes all the tools and runtime components needed to convert models, deploy them to hardware, and execute inference workloads—whether you’re building production systems or prototyping AI applications.

The MemryX development flow consists of two main stages: Compiletime and Runtime.

1. Compiletime#

The goal of the compiletime stage is to generate the dataflow program (DFP), which will be used to configure the MXA. The first step is to compile the neural network model(s) to accelerate using the MemryX neural compiler to create a DFP. Next, program your chip using the generated DFP. Simulations of the DFP can also be used to give a quick and reliable estimate of the chip performance and latency without requiring the target hardware.

graph LR; start([User NN Model]) --> nc[Compile the Model] nc --> dfp((DFP)) dfp --> cp[Program the Chip] dfp -.-> sm[Simulate/Analyze] style start fill:#CFE8FD, stroke:#595959 style nc fill:#FFE699, stroke:#595959 style cp fill:#FFE699, stroke:#595959 style sm fill:#EDEDED, stroke:#595959 style dfp fill:#A9D18E, stroke:#595959

2. Runtime#

MemryX runtime tools and drivers interface with the MXA. Drivers are available for multiple operating systems and provide C/C++ and Python bindings for ease of integration. The tools/drivers are designed to be easily integrated with off-the-shelf image pipelines like GStreamer and OpenCV as well as IP implementations. Since runtime scheduling is not required, runtime execution of MXA is simple, deterministic and causes no overhead on the host system.

flowchart LR; start([Stream Source]) --> pre["Pre/Post Processing"] subgraph Host direction LR pre <--> dr[Driver] end dr --> m0["MemryX\nAccelerator"] m0 --> dr style start fill:#CFE8FD, stroke:#595959 style pre fill:#FFE699, stroke:#595959 style dr fill:#FFE699, stroke:#595959 style m0 fill:#A9D18E, stroke:#595959 style Host fill:#EDEDED, stroke:#595959

MemryX Solutions Requires:#

With MemryX, high-performance AI inference comes without the traditional deployment burdens. Here’s what you can confidently skip:

❌ No Pilot Data or Chip Tuning

MXAs use BFloat16 activations, so no pilot dataset or calibration is needed. Unlike INT8-only systems, runtime conditions do not require tuning.

🔧 No Hand-Tuning of Models

The MemryX Neural Compiler maps your trained model to hardware in minutes—no manual optimization required.

🎯 No Accuracy vs Performance Trade-off

Hit high throughput without compromising accuracy. The compiler ensures near-original fidelity with optimal performance.

✂️ No Pruning or Retraining

MemryX respects your original model—no pruning, no retraining. Just deploy and run with high utilization and efficiency.