Overview#

The section provides an overview for using MemryX to accelerate your AI application(s).

Accelerator#

It is helpful to begin with commonly used terms describing the MemryX AI Accelerator.

MXA

MemryX Neural Network Accelerator Chips are called ‘MXA’ for short: [M]emry[X] [A]cclerator

MX3

There were 2 internal prototypes before the release of our first product. Therefore, this generation is called the ‘MX3’.

MCE

Within a MXA are hierarchically organized Compute Engines containing custom Neural Logic, MACs, ALUs and more. Each MemryX Compute Engine is called an MCE. There are hundreds of MCE’s in each MXA.

Pure Dataflow#

The MXA is built from the ground up to accelerate neural network inference. Hardware and software were co-designed together to enable pure dataflow execution of neural network workloads. The MXA is optimized for the efficient flow and processing of data. During execution of neural network models, input data is streamed through network layers according to the trained AI model until reaching the output layer. The figure below is an illustrative example of data streaming from the input node to the two output nodes via six neural network layers.

flowchart LR; Input(("Input")) --> l0 l0["\nLayer 0\n\n"] --> l1["\nLayer 1\n\n"] l0 --> l2["\nLayer 2\n\n"] l1 --> l6 l1 --> l3["\nLayer 3\n\n"] l3 --> l5 l4["\nLayer 4\n\n"] l4 --> l5["\nLayer 5\n\n"] l6 --> Output1(("Output")) l5 --> Output2(("Output")) l5 --> l6["\nLayer 6\n\n"] l2 --> l4 style Input stroke:#595959 style Output1 stroke:#595959 style Output2 stroke:#595959 style l0 fill:#99FFCC, stroke:#595959 style l1 fill:#0099CC, stroke:#595959 style l2 fill:#FF9999, stroke:#595959 style l3 fill:#FFE699, stroke:#595959 style l4 fill:#00CC99, stroke:#595959 style l5 fill:#66CCFF, stroke:#595959 style l6 fill:#BFBFBF, stroke:#595959

The MXA has programmable hardware that aligns to the flow of trained AI-Model(s). Each MXA chipset is made of hundreds of dataflow cores, called MemryX Core Engines (MCEs). The operation of each MCE and the data routes are programmed together during neural network model deployment, eliminating the need for runtime instruction scheduling. After configuration, the MXA architecture efficiently streams data from input to output.

The MXA utilizes the space multiplexing of its many cores to accelerate neural network layers. Each layer of the network is assigned a specific number of MCEs to optimize the overall inference performance and data flow, as shown in the following illustration. Each MCE can be programmed to perform a specific job driven by the data stream without the need of a universal control unit or runtime scheduler.

Inputs Outputs

Concurrent Models and Streams#

Each MXA can seamlessly support concurrent operations of multiple neural network models and data streams. The user only needs to provide the neural compiler with the set of models and the selected number of MXA chips as inputs. The neural compiler will automatically map multiple concurrent models to the selected number of MXAs. The SDK will also optimally distribute hardware resources among the models to achieve the highest possible performance. Note that the exact same software is used to map one large model across many MXAs or many small models across a single MXA.

graph LR s0("Stream-0") s1("Stream-1") subgraph MPU m0["Model-0"] m1["Model-1"] end s0--> m0 s1--> m1 m0 --> o0("Output-0") m1 --> o1("Output-1") style s0 fill:#CFE8FD, stroke:#595959 style s1 fill:#CFE8FD, stroke:#595959 style o0 fill:#CFE8FD, stroke:#595959 style o1 fill:#CFE8FD, stroke:#595959 style m0 fill:#A9D18E, stroke:#595959 style m1 fill:#A9D18E, stroke:#595959 style MPU fill:#EDEDED, stroke:#595959

Inherently Scalable#

Scalability is an inherent attribute of MemryX hardware and software systems. Each MXA has a given number of MCEs. Therefore, 2X the chips means 2X the AI computing capabilities, just as 10X the chips provide 10X the MCEs and hence 10X the computing capability of a single MXA.

The user can cascade any number of MXA chips. Reasons for adding additional MXAs include supporting larger and/or more models, increasing model performance or lowering latency. The neural compiler automatically distributes the workload of any number of models over any chosen number of MXA chips. In the diagram below, two chips have been cascaded, and the neural compiler optimally distributes the workload of the NN model to achieve high inference performance.

Inputs Outputs

Development Flow#

There are two main stages to implement the MemryX solution to accelerate your AI workload: An offline phase and a runtime phase.

Offline#

The goal of the offline stage is to generate the dataflow program (DFP), which will be used to configure the MXA. The first step is to compile the neural network model(s) to accelerate using the MemryX neural compiler to create a DFP. Next, program your chip using the generated DFP. Simulations of the DFP can also be used to give a quick and reliable estimate of the chip performance and latency without requiring the target hardware.

graph LR; start([User NN Model]) --> nc[Compile the Model] nc --> dfp((DFP)) dfp --> cp[Program the Chip] dfp -.-> sm[Simulate/Analyze] style start fill:#CFE8FD, stroke:#595959 style nc fill:#FFE699, stroke:#595959 style cp fill:#FFE699, stroke:#595959 style sm fill:#EDEDED, stroke:#595959 style dfp fill:#A9D18E, stroke:#595959

Runtime#

MemryX runtime tools and drivers interface with the MXA. Drivers are available for multiple operating systems and provide C/C++ and Python bindings for ease of integration. The tools/drivers are designed to be easily integrated with off-the-shelf image pipelines like GStreamer and OpenCV as well as IP implementations. Since runtime scheduling is not required, runtime execution of MXA is simple, deterministic and causes no overhead on the host system.

flowchart LR; start([Stream Source]) --> pre["Pre/Post Processing"] subgraph Host direction LR pre <--> dr[Driver] end dr --> m0["MemryX\nAccelerator"] m0 --> dr style start fill:#CFE8FD, stroke:#595959 style pre fill:#FFE699, stroke:#595959 style dr fill:#FFE699, stroke:#595959 style m0 fill:#A9D18E, stroke:#595959 style Host fill:#EDEDED, stroke:#595959

MemryX Solutions Require:#

No Need to Collect pilot images and tune the chip ✘#

You do not need a pilot dataset to tune the chip. The MXA uses floating-point activations (Bfloat16) and therefore does not need to be tuned for specific runtime conditions as in the case of INT-8 only systems.

No Need to Hand tune or modify trained AI models ✘#

You do not need to spend any time hand-tuning models to achieve target performance. The MemryX neural compiler will optimally map the selected model to the hardware resources within minutes.

No Need Trade-off accuracy and performance ✘#

There are NO unique parameters that you will need to tune or monitor to obtain desired accuracy with high performance. For a given number of MXAs, the neural compiler will ensure to map the model to obtain highest possible performance with negligible to no loss in accuracy.

No Need to Prune or retrain models ✘#

There is no need to prune (and retrain) model(s) to achieve the target performance. Also, note that the MemryX neural compiler is faithful to the provided models and does not perform any under the hood pruning. The MXA efficiently executes the compiled models with high utilization, performance, and accuracy.