AI Modeling Approaches for Speech BCIs

27 min read

Over the past several years, multiple research groups have developed systems that decode speech from the electrical activity of neurons. Despite differences in electrode technology, implant location, and target application, all systems share a remarkably similar signal processing pipeline. While core steps are nearly identical, systems diverge most significantly in their neural decoder architectures and their integration of language models. The latter provides ripe territory for excitement; in the UC Davis system, LLM rescoring reduced word error rates from 10% to 2.5% (Card et al., 2024).

This article gathers my reading on the matter, breaking down both the common signal processing pipeline and the different decoder architectures used across current speech BCIs. I have also enjoyed drawing diagrams depicting various functions in Figma.

Universal Signal Processing Pipeline

The voltage fluctuations recorded from electrodes contain a mixture of genuine neural activity, electrical noise from equipment, physiological artifacts from heartbeat and respiration, and baseline drift from changing electrode properties. The signal processing pipeline transforms this raw data into clean, standardized features that neural networks can learn from.

The pipeline described here represents the consensus approach across UCSF, UC Davis, Stanford, and Precision Neuroscience systems. Individual implementations vary in specific parameters, but the overall structure is consistent.

Raw Signal Acquisition

All systems digitize neural signals at high sampling rates; most sample at 30 kHz (30,000 samples per second) while Neuralink samples at 20 kHz and Precision gives a range of 20-30 kHz. This rate is dictated by the physics of neural signaling.

Diagram depicting spiking activity
Diagram depicting spiking activity

Neurons communicate through action potentials, which are brief electrical events lasting 1–2 milliseconds. These "spikes" are the fundamental unit of neural information. To accurately capture a signal, the sampling rate must be at least twice the highest frequency component present (the Nyquist theorem).

The Nyquist-Shannon theorem states that to accurately reconstruct a signal, the sampling rate must exceed twice the highest frequency component present.
The Nyquist-Shannon theorem states that to accurately reconstruct a signal, the sampling rate must exceed twice the highest frequency component present.

Since action potentials contain meaningful information up to approximately 5,000 Hz, sampling at 20–30 kHz provides sufficient headroom to capture spike waveforms without distortion.

The analog-to-digital conversion typically uses 10–16 bit resolution. At 16-bit resolution with a ±5 mV input range, the smallest detectable voltage step is approximately 150 nanovolts. Neural signals of interest range from 10–500 microvolts, well above this quantization floor.

Therefore, a 256-channel array sampled at 30 kHz with 16-bit resolution generates:

Data rate=256×30,000×16=122.88 Mbps\text{Data rate} = 256 \times 30{,}000 \times 16 = 122.88 \text{ Mbps}

For the 1,024-channel arrays used by Precision Neuroscience and Neuralink, this exceeds 400 Mbps. This data volume motivates the subsequent processing steps that reduce dimensionality while preserving relevant information.

Referencing for Noise Removal

Every electrode picks up not only local neural activity but also noise sources that affect the entire array. Referencing removes these global noise sources by exploiting the fact that they appear similarly across all electrodes while genuine neural signals are spatially localized.

When electrodes have similar impedance and noise characteristics, Common Average Referencing (CAR) is the simplest approach. At each moment in time, compute the mean voltage across all electrodes and subtract it from each electrode:

xref[i,t]=xraw[i,t]1Nj=1Nxraw[j,t]x_{\text{ref}}[i,t] = x_{\text{raw}}[i,t] - \frac{1}{N}\sum_{j=1}^{N} x_{\text{raw}}[j,t]

Here ii indexes the electrode, tt indexes the time sample, and NN is the total number of electrodes. The subtraction removes any signal component that appears identically on all electrodes, which is precisely the definition of common-mode noise.

Diagram depicting Common Average Referencing (CAR)
Diagram depicting Common Average Referencing (CAR)

However, for intracortical arrays where electrode impedance can vary from 50 kΩ to 500 kΩ due to tissue encapsulation and manufacturing variation, noise couples differently to different electrodes.

Linear Regression Referencing (LRR), used by the UC Davis system, addresses this limitation. Instead of assuming equal noise coupling, LRR learns the optimal subtraction weight for each electrode:

βi=Cov(xi,xˉ)Var(xˉ)\beta_i = \frac{\text{Cov}(x_i, \bar{x})}{\text{Var}(\bar{x})}

xref[i,t]=xraw[i,t]βixˉ[t]x_{\text{ref}}[i,t] = x_{\text{raw}}[i,t] - \beta_i \cdot \bar{x}[t]

where xˉ\bar{x} is the mean (or median) across electrodes. Each electrode gets its own coefficient βi\beta_i that reflects how strongly noise couples to that particular electrode. An electrode with high impedance picks up more noise and receives a larger correction; an electrode with low impedance receives less correction. This electrode-specific approach achieves superior noise removal for intracortical recordings.

Diagram depicting Linear Regression Referencing (LRR)
Diagram depicting Linear Regression Referencing (LRR)

The order of referencing and bandpass filtering varies between implementations. The UCSF system applies CAR before filtering (Metzger et al., 2023). The UC Davis system applies a 250 Hz high-pass filter before computing LRR coefficients (Card et al., 2024). Both orderings have justification: referencing first removes common-mode noise (such as 60 Hz power line interference) before it can interact with the filter; filtering first ensures that LRR coefficients are computed on the frequency band of interest rather than being influenced by low-frequency drift or high-frequency noise outside the spike band.

Bandpass Filtering

After referencing, the signal still contains frequency components outside the bands of interest. Bandpass filtering isolates specific frequency ranges that carry different types of neural information.

Butterworth Filters

All documented systems use Butterworth filters. A Butterworth filter has a "maximally flat" frequency response in its passband, meaning all frequencies within the desired range are treated identically without ripples or distortion. Neural signals like spike waveforms are composed of multiple frequency components, so unequal treatment would distort their shape.

Filter "order" determines how sharply the filter cuts off at its boundaries. A 4th-order Butterworth filter attenuates signals at 24 dB per octave outside the passband. In practical terms, a signal one octave below the cutoff frequency is reduced by a factor of approximately 16.

Diagram depicting 4th Order Butterworth Filter. Notice how the 4th order has a steeper gradient.
Diagram depicting 4th Order Butterworth Filter. Notice how the 4th order has a steeper gradient.

Key Frequency Bands

These three frequency bands appear are of particular interest:

BandFrequency RangeWhat It Captures
Spike band250–5,000 HzIndividual action potentials
High-gamma70–170 HzLocal population activity correlated with firing rate
Low-frequency0.3–40 HzSlow oscillations, event-related potentials

The spike band (250–5,000 Hz) isolates the frequencies where action potential waveforms have their energy. This is the band used for detecting individual neural spikes. The Stanford system uses this band for threshold crossing detection (Willett et al., 2023).

The high-gamma band (typically 70–150 Hz for UCSF, 130–170 Hz for Precision) correlates with local neuronal population spiking activity. When many neurons in an area fire together, this creates energy in the high-gamma range even when individual spikes cannot be cleanly resolved. Surface electrodes (ECoG) primarily use high-gamma because they lack the spatial resolution to detect individual spikes.

The low-frequency band (0.3–40 Hz) captures slower neural dynamics including theta, alpha, and beta oscillations. These reflect large-scale population activity and event-related potentials. Precision Neuroscience uses 0–40 Hz (Hettick et al., 2025); UCSF uses 0.3–17 Hz (Metzger et al., 2023).

Causal vs. Acausal Filtering

Filters can be applied causally (using only past samples) or acausally (using both past and future samples). Acausal filtering eliminates phase distortion, which is the phenomenon where different frequency components get shifted by different amounts in time, but requires access to future data. For offline analysis, acausal filtering produces cleaner spike waveforms. For real-time operation, only causal filtering is possible.

Feature Extraction

After filtering, the pipeline extracts features that summarize neural activity in forms suitable for machine learning. Two complementary features appear across nearly all systems: threshold crossings (or spike counts) and signal power (or envelope amplitude).

Threshold Crossings

When the bandpass-filtered signal drops below a negative threshold, this indicates a likely action potential. The detection rule is simple: count every time the voltage crosses below a threshold set at –4.5 times the root-mean-square (RMS) of the signal:

Threshold=4.5×1Ni=1Nxi2\text{Threshold} = -4.5 \times \sqrt{\frac{1}{N}\sum_{i=1}^{N} x_i^2}

The threshold is negative because extracellular electrodes detect action potentials as negative-going deflections; when a neuron fires, positive ions rush into the cell, making the nearby extracellular space more negative.

The multiplier of 4.5 represents a trade-off between sensitivity and specificity. At –4.5 × RMS, approximately 99.9997% of Gaussian noise stays below threshold, while the majority of true action potentials (which typically exceed 100 µV peak, compared to ~30 µV RMS noise) are detected. Lower thresholds (–3.0 ×) catch more true spikes but also more false positives; higher thresholds (–6.0 ×) miss weaker spikes.

Diagram depicting space under -4.5 RMS
Diagram depicting space under -4.5 RMS

The threshold crossing count within a time window provides a proxy for the firing rate of neurons near that electrode.

To my knowledge, all processing pipelines use -4.5RMS.

Spike Band Power

Spike band power captures the aggregate energy in the spike frequency band, computed as the RMS amplitude within each time bin:

SBP=1Ni=1Nxi2\text{SBP} = \sqrt{\frac{1}{N}\sum_{i=1}^{N} x_i^2}

This feature is useful when individual spikes cannot be cleanly isolated, for example, when multiple neurons near an electrode fire in overlapping patterns. Even without detecting discrete spikes, the overall energy in the spike band reflects population activity.

Threshold crossings give discrete counts; spike band power gives continuous amplitude. Using both provides complementary information.

Diagram depicting Spike Band Power detection
Diagram depicting Spike Band Power detection

High-Gamma Envelope via Hilbert Transform

For surface electrode systems that use high-gamma activity, the feature of interest is the instantaneous amplitude (envelope) of oscillations in the 70–150 Hz band. The Hilbert transform extracts this envelope without requiring arbitrary smoothing parameters.

The mathematical intuition is as follows. Any oscillating signal can be thought of as the projection of a rotating vector onto the real number line. The observed signal is:

x(t)=A(t)cos(ωt+ϕ)x(t) = A(t) \cdot \cos(\omega t + \phi)

where A(t)A(t) is the time-varying amplitude (what we want) and ωt+ϕ\omega t + \phi is the phase (which varies rapidly and is not of interest).

The Hilbert transform generates the "imaginary" component of this rotating vector:

H{x(t)}=A(t)sin(ωt+ϕ)\mathcal{H}\{x(t)\} = A(t) \cdot \sin(\omega t + \phi)

Combining both components gives the analytic signal:

z(t)=x(t)+iH{x(t)}=A(t)ei(ωt+ϕ)z(t) = x(t) + i\mathcal{H}\{x(t)\} = A(t) \cdot e^{i(\omega t + \phi)}

The envelope is simply the magnitude of this complex signal:

A(t)=z(t)=x(t)2+H{x(t)}2A(t) = |z(t)| = \sqrt{x(t)^2 + \mathcal{H}\{x(t)\}^2}

The phase cancels out because the magnitude of a complex number is independent of its angle. What remains is only the time-varying amplitude A(t)A(t), the envelope that tracks how strong the high-gamma oscillations are at each moment.

The Hilbert transform is only meaningful for narrowband signals, which is why the data must first be bandpass filtered to the high-gamma range before applying it.

Diagram depicting Hilbert Transform
Diagram depicting Hilbert Transform

Feature Dimensionality

With two features per electrode, the feature vector size is:

Features per time bin=Nelectrodes×2\text{Features per time bin} = N_{\text{electrodes}} \times 2

For the UC Davis system with 256 electrodes: 512 features. For the UCSF system with 253 electrodes: 506 features. For Precision Neuroscience with 1,024 electrodes: 2,048 features.

Diagram depicting Feature Dimensionality
Diagram depicting Feature Dimensionality

Downsampling

The raw sampling rate of 20–30 kHz far exceeds what is needed for the features of interest. After feature extraction, systems progressively downsample to reduce computational burden while preserving relevant temporal dynamics.

A typical cascade for a speech BCI:

StageSampling RatePurpose
Raw acquisition30,000 HzCapture spike waveforms
After initial processing1,000 HzSufficient for all frequency bands
After feature extraction200 HzSufficient for high-gamma (up to 150 Hz)
Decoder input33 HzMatches phoneme timescales (~80–100 ms per phoneme)
Diagram depicting Downsampling
Diagram depicting Downsampling

Each downsampling step requires anti-aliasing filtering first. This is a low-pass filter that removes frequencies above half the new sampling rate. Without this, high-frequency energy would "fold back" into the lower frequencies, corrupting the signal.

The Nyquist theorem states that a sampling rate of fsf_s can only faithfully represent frequencies up to fs/2f_s/2. At 200 Hz sampling, the highest representable frequency is 100 Hz. Since high-gamma extends to 150 Hz, some systems use higher intermediate rates. At 200 Hz sampling, the Nyquist limit is 100 Hz, which is sufficient only after the high-gamma envelope has been extracted (the envelope varies slowly even though the underlying oscillation is fast).

The final downsampling to ~33 Hz (one sample every 30 ms) matches the timescale of phonemes. Since phonemes last approximately 80–100 ms, 33 Hz provides roughly 3 samples per phoneme - enough to track phoneme transitions while dramatically reducing sequence length for the neural network.

Normalization

Neural signals drift over time due to electrode impedance changes, tissue adaptation, participant arousal and fatigue, and environmental noise variations. Without normalization, a decoder trained on day one would perform poorly on day two as the signal statistics shift.

Z-Score Normalization

Z-scoring transforms raw values into standard deviation units:

z=xμσz = \frac{x - \mu}{\sigma}

where μ\mu is the mean and σ\sigma is the standard deviation. After z-scoring, the data has approximately zero mean and unit variance.

Diagram depicting z-score Normalisation
Diagram depicting z-score Normalisation

Sliding Window Implementation

The UCSF system computes z-score statistics within a 30-second sliding window (Metzger et al., 2023). For each time point, the mean and standard deviation are computed over the surrounding 30 seconds of data, and the current sample is normalized relative to those local statistics.

This approach adapts to slow drifts in signal properties while preserving the fast dynamics that encode speech. A 30-second window is long enough to provide stable statistics but short enough to track changes over minutes to hours.

Diagram depicting Sliding Window Implementation
Diagram depicting Sliding Window Implementation

Cross-Day Normalization

For systems that accumulate training data across multiple recording sessions, z-score statistics may be computed across days rather than within a single session. This allows the model to learn patterns that are stable across the variability that naturally occurs between sessions.

Temporal Binning and Windowing

The final preprocessing step organizes continuous features into discrete chunks for the neural network.

Binning

Features are computed in non-overlapping time bins. Every system I know of uses 20ms bins, which provides temporal resolution fast enough to track phoneme transitions while containing enough samples for stable statistics (20 ms at 1 kHz = 20 samples).

Within each 20 ms bin, threshold crossing counts are summed and spike band power is averaged.

Diagram depicting Temporal Binning
Diagram depicting Temporal Binning

Stacking Bins for RNN Input

Rather than feeding one bin at a time to the neural network, systems typically stack multiple consecutive bins into a single input vector. The Stanford system stacks 4 bins (80 ms) of features at each input step (Willett et al., 2023):

RNN input=Binst3,Binst2,Binst1,Binst\text{RNN input} = \text{Bins}_{t-3}, \text{Bins}_{t-2}, \text{Bins}_{t-1}, \text{Bins}_{t}

For 256 features per bin, this produces 1,024 values per RNN timestep. The stacking provides the network with immediate temporal context, helping it recognize patterns that span multiple bins.

Diagram depicting Stacked Binning
Diagram depicting Stacked Binning

Gaussian Smoothing

Neural firing is inherently noisy. Even when encoding the same intended movement, exact spike timing varies from moment to moment. The Stanford system applies Gaussian kernel smoothing to average out this variability (Willett et al., 2023).

A Gaussian kernel weights nearby samples more heavily than distant ones, following a bell-curve shape, in the Willet case, with an 80 ms kernel width. This produces smoother output than a flat moving average, which would create abrupt changes when noisy samples enter or exit the window.

Diagram depicting Gaussian Smoothing
Diagram depicting Gaussian Smoothing

Output Rate

The RNN typically produces one output every 80 ms (12.5 predictions per second). This rate is fast enough for real-time speech (approximately 3 phonemes per second in natural speech) while being slow enough for the network to integrate meaningful context.


Neural Decoder Architectures

The signal processing pipeline produces a sequence of feature vectors, one every 80 ms, each containing several hundred dimensions representing neural activity across all electrodes. The neural decoder's job is to map this sequence to intended output: phonemes, words, cursor movements, or other control signals.

Different decoder architectures represent different points on a trade-off space involving accuracy, latency, training data requirements, and computational complexity. This section presents the major architectures in order of increasing sophistication.

Kalman Filters

Used by the BrainGate consortium for historical motor BCIs and Neuralink as a component of their hybrid decoder, the Kalman filter is a mathematical algorithm developed in 1960 for aerospace navigation and has since become a standard tool in control systems, robotics, and neural engineering. It is primarily used for estimating the state of a system from noisy measurements.

Diagram depicting Kalman Filters used during prediction
Diagram depicting Kalman Filters used during prediction

Consider decoding intended cursor movement from neural activity. At each moment, we want to estimate the cursor's position and velocity (the "state") from the firing rates of recorded neurons (the "measurements"). Both the neural signals and our models of how they relate to movement are imperfect.

The Kalman filter maintains two quantities:

  1. The current best estimate of the state
  2. The uncertainty in that estimate

At each time step, it performs two operations:

Predict: Use a model of how the state evolves over time to predict the next state. For cursor movement, this might assume that velocity changes smoothly, as the cursor won't teleport.

x^kk1=Fx^k1k1\hat{x}_{k|k-1} = F \cdot \hat{x}_{k-1|k-1}

Pkk1=FPk1k1FT+QP_{k|k-1} = F \cdot P_{k-1|k-1} \cdot F^T + Q

Here x^\hat{x} is the state estimate, PP is the uncertainty (covariance), FF is the state transition matrix that encodes assumptions about motion dynamics, and QQ is the process noise covariance representing how much the state can change unpredictably.

Update: Incorporate the new neural measurement to correct the prediction.

Kk=Pkk1HT(HPkk1HT+R)1K_k = P_{k|k-1} \cdot H^T \cdot (H \cdot P_{k|k-1} \cdot H^T + R)^{-1}

x^kk=x^kk1+Kk(zkHx^kk1)\hat{x}_{k|k} = \hat{x}_{k|k-1} + K_k \cdot (z_k - H \cdot \hat{x}_{k|k-1})

Here zkz_k is the measurement (neural features), HH is the observation matrix relating states to measurements, RR is the measurement noise covariance, and KK is the "Kalman gain" that determines how much to trust the measurement versus the prediction.

The Kalman gain is adaptive. When measurements are noisy (RR is large), the gain is small and the filter trusts the prediction more; when measurements are reliable (RR is small), the gain is large and the filter trusts the measurement more.

The Kalman filter is computationally efficient, requiring only matrix multiplications and inversions. It provides optimal estimates when the system dynamics and measurement relationships are linear and the noise is Gaussian. It produces smooth outputs without the jerky movements that can arise from noisier decoders.

However, the assumption of linear relationships between neural activity and intended output is often violated. The mapping from neural firing to intended speech is highly nonlinear; the same neural pattern can mean different things depending on context. Kalman filters cannot capture the complex temporal dependencies present in speech production.

For motor BCIs controlling cursor movement, where the relationship between motor cortex activity and intended velocity is approximately linear, Kalman filters work well.

Simple Discriminative Classifiers

For tasks that require classifying discrete categories rather than generating continuous sequences, simpler classifiers can be effective, especially when training data is limited.

Gaussian Naive Bayes

The Kunz et al. inner speech study used Gaussian Naive Bayes to classify 7 single-syllable words from neural activity (Kunz et al., 2025). The classifier assumes that features are conditionally independent given the class and that each feature follows a Gaussian distribution.

For a feature vector x\mathbf{x} and class cc, the classifier computes:

P(cx)P(c)i=1nP(xic)P(c|\mathbf{x}) \propto P(c) \prod_{i=1}^{n} P(x_i|c)

where each P(xic)P(x_i|c) is modeled as a Gaussian with class-specific mean and variance.

The classifier uses a fixed 500 ms analysis window of neural data, sufficient to capture the neural activity associated with producing a single word.

Diagram depicting Gaussian Naive Bayes classification
Diagram depicting Gaussian Naive Bayes classification

For 7-word classification, where chance performance is 14.3%, Gaussian Naive Bayes achieved 85–92% accuracy on attempted speech and 72.6% accuracy on inner speech using motoric imagery. However, discriminative classifiers are inherently limited: they output one of a fixed set of classes per input window and cannot generate variable-length output sequences, handle continuous speech, or model temporal dependencies between successive outputs. They are tools for controlled experiments, not continuous communication.

Recurrent Neural Networks with CTC Loss

Used by Stanford, UC Davis, UCSF, and the BrainGate inner speech work, RNNs with CTC loss are the dominant architecture for current speech BCIs.

Recurrent neural networks (RNNs) process sequential input by maintaining a "hidden state" that accumulates information over time. Unlike feedforward networks that treat each input independently, RNNs can learn temporal patterns and dependencies.

The GRU Architecture

The Gated Recurrent Unit (GRU) is a specific RNN architecture designed to model long-range dependencies while avoiding the "vanishing gradient" problem that makes training standard RNNs difficult.

Diagram depicting GRU architecture
Diagram depicting GRU architecture

At each time step, the GRU updates its hidden state based on the current input and previous hidden state:

Update gate ztz_t controls how much of the new candidate state to use versus keeping the old state:

zt=σ(Wz[ht1,xt])z_t = \sigma(W_z \cdot [h_{t-1}, x_t])

Reset gate rtr_t controls how much of the previous state to forget when computing the candidate:

rt=σ(Wr[ht1,xt])r_t = \sigma(W_r \cdot [h_{t-1}, x_t])

Candidate state h~t\tilde{h}_t is computed using the reset gate:

h~t=tanh(W[rtht1,xt])\tilde{h}_t = \tanh(W \cdot [r_t \odot h_{t-1}, x_t])

New hidden state blends old and candidate states:

ht=(1zt)ht1+zth~th_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t

Here σ\sigma is the sigmoid function (outputs between 0 and 1), \odot is element-wise multiplication, and WzW_z, WrW_r, WW are learned weight matrices.

The gates allow the network to selectively remember or forget information. If the update gate is near zero, the hidden state barely changes; if near one, it largely replaces the old state with the new candidate. This gating mechanism enables GRUs to maintain relevant information over many time steps.

Stacked and Bidirectional GRUs

Speech BCIs typically use 4–5 stacked GRU layers with 500–512 hidden units per layer. Each layer operates on the output of the previous layer, learning increasingly abstract representations:

  • Layer 1: Articulator-level activity patterns
  • Layer 2: Phonetic feature combinations
  • Layer 3: Phoneme representations
  • Layer 4: Coarticulatory context
  • Layer 5: Higher-order sequential patterns

Bidirectional GRUs process the sequence in both forward and backward directions simultaneously. This matters for speech because of coarticulation, where each sound is influenced by both preceding and following sounds. The forward pass captures "what came before" (carryover coarticulation); the backward pass captures "what comes after" (anticipatory coarticulation).

Bidirectional processing requires buffering future data, introducing latency. For real-time systems that cannot tolerate this latency (like avatar animation), unidirectional GRUs are used instead.

Connectionist Temporal Classification (CTC) Loss

The key innovation enabling RNN-based speech decoding is CTC loss, which solves the problem of alignining neural signals to the phoneme the user was attempting to say. Manual annotation of phoneme boundaries at millisecond resolution would be impractical and subjective. CTC defines a "valid alignment" as any sequence of per-frame outputs that, after collapsing, produces the target transcript. It then computes the probability of the target as the sum over all valid alignments.

Diagram depicting CTC
Diagram depicting CTC

CTC introduces a blank token (ε) representing "no output at this timestep." The collapse function removes blanks and merges consecutive identical non-blank symbols:

B(ε ε K K K ε AE AE ε T T ε)=K AE TB(\text{ε ε K K K ε AE AE ε T T ε}) = \text{K AE T}

Multiple paths collapse to the same output. CTC sums over all of them:

P(targetinput)=paths that collapse to targetP(pathinput)P(\text{target}|\text{input}) = \sum_{\text{paths that collapse to target}} P(\text{path}|\text{input})

The number of valid paths grows exponentially with sequence length, and naive enumeration is computationally impossible. CTC uses dynamic programming (the forward-backward algorithm) to compute the sum efficiently in O(T×L)O(T \times L) time, where TT is the input sequence length and LL is the target length.

The algorithm defines a forward variable α(t,s)\alpha(t, s) as the probability of outputting the first ss symbols of the target in the first tt timesteps. By computing this recursively and summing over all ways to reach each cell, the algorithm efficiently marginalizes over all valid alignments.

The loss is the negative log probability of the target:

LCTC=logP(targetinput)\mathcal{L}_{CTC} = -\log P(\text{target}|\text{input})

Gradients are computed using the forward-backward variables and backpropagated through the network. The gradient for a particular output is proportional to how much probability mass flows through that output when generating the target; outputs that appear in many valid paths receive strong gradients.

RNN + N-gram Language Model

Used by Stanford, UCSF and UC Davis in the first stage, the RNN + N-gram Language Model represents the standard two-stage decoding approach.

The RNN decoder outputs phoneme probabilities, but these are noisy and ambiguous. The neural signal for /p/ and /b/ may be nearly identical; both are bilabial stops differing only in voicing, which may not have distinct neural signatures. Without additional constraints, the decoder might output "bizza" when the participant intended "pizza."

Language models inject knowledge of English vocabulary and syntax, selecting the most plausible word sequence consistent with the phoneme probabilities.

N-gram Language Models

An n-gram model estimates the probability of a word given the previous (n-1) words:

P(wnw1,...,wn1)P(wnwn(k1),...,wn1)P(w_n | w_1, ..., w_{n-1}) \approx P(w_n | w_{n-(k-1)}, ..., w_{n-1})

For a 5-gram model, each word's probability depends on the 4 preceding words. The model is trained on large text corpora, counting how often each 5-word sequence appears.

Beam Search Decoding

To find the most likely word sequence given both the neural decoder output and the language model, systems use beam search, a heuristic search algorithm that explores multiple hypotheses in parallel.

At each timestep, beam search:

  1. Expands each hypothesis in the current "beam" by considering all possible next phonemes
  2. Scores each expansion using: neural probability + (weight × language model probability)
  3. Keeps only the top-K scoring hypotheses (typically K=100)
  4. Continues until the sequence ends

The language model weight controls the trade-off between trusting the neural decoder versus trusting linguistic constraints. Higher weight makes output more grammatical but potentially misses words the neural decoder is confident about.

Diagram depicting Beam Search
Diagram depicting Beam Search

WFST Implementation

Production systems often implement decoding using Weighted Finite State Transducers (WFSTs). A WFST represents the search space as a graph where edges have weights corresponding to probabilities. Three transducers are composed:

  • T: Token transducer (maps CTC outputs to phonemes)
  • L: Lexicon transducer (maps phonemes to words)
  • G: Grammar transducer (encodes the language model)

The composition TLGT \circ L \circ G creates a single search graph that jointly optimizes across all three constraints. Standard WFST algorithms find the lowest-cost path through this graph efficiently.

The language model provides substantial accuracy improvements. For the UC Davis system, the neural decoder alone achieved approximately 30% phoneme error rate, which dropped to 10% word error rate with a 5-gram language model, a 3× improvement from the language model correcting ambiguous or incorrect phoneme predictions based on lexical and grammatical constraints (Card et al., 2024). However, n-gram models have fundamental limitations: a 5-gram model only sees 4 preceding words; phrases like "I want to eat breakfast" and "I want to eat lunch" are stored as separate entries with no shared representation; and most possible 5-word sequences never appear in training data. These limitations motivate the use of neural language models for rescoring.

Convolutional-Recurrent Networks (CRNN)

Precision Neuroscience using CRNNs for their high-density cortical arrays.

The CRNN architecture augments the standard RNN approach with convolutional layers for local feature extraction before temporal modeling.

Diagram depicting Convolutional RNN model with 1D convolution and multiple feature layers. You can see the forward and backward passes on the right side of the model.
Diagram depicting Convolutional RNN model with 1D convolution and multiple feature layers. You can see the forward and backward passes on the right side of the model.

Convolutional layers apply learned filters that slide across the input. In the temporal domain, a 1D convolution with kernel size K processes K consecutive timesteps at once, producing a single output that summarizes local patterns:

yt=k=0K1Wkxt+k+by_t = \sum_{k=0}^{K-1} W_k \cdot x_{t+k} + b

where WkW_k are the learned filter weights and bb is a bias term.

Convolutions are applied across multiple filters (typically 256–500), each learning to detect different local patterns. The output is a transformed representation where each dimension encodes the presence of a particular temporal motif.

For high-density arrays with 1,024+ channels, the input dimensionality is substantial—2,048 features per timestep for Precision Neuroscience. Convolutional layers address this challenge by capturing short-timescale dynamics (tens of milliseconds) through local filters before passing them to recurrent layers. Parameter sharing also helps: the same filter is applied at all time positions, reducing the number of learnable parameters. This creates a natural hierarchy where convolutions extract low-level features and RNNs model higher-level temporal dynamics.

The Precision Neuroscience decoder stacks a 1D convolution layer, two bidirectional GRU layers, and a fully connected output layer (Hettick et al., 2025). The input is a 600 ms window (600 samples at 1 kHz) with 2,048 features, producing classification over 8–30 classes depending on the task.

For somatosensory decoding, the CRNN achieved 92.75% mean accuracy on 8-class classification, where chance performance is 12.5%. For binary speech detection, the system achieved 79.8% accuracy with only 4 minutes of training data—a notable result given that most speech BCIs require hours of data (Hettick et al., 2025).

RNN + LLM Rescoring (State of the Art)

UC Davis achieved 97.5% accuracy using RNN + LLM rescoring—the highest reported for speech BCIs (Card et al., 2024). Large language models like GPT and OPT represent a qualitative advance over n-gram models. They learn distributed representations of words, generalize across similar contexts, and most importantly, incorporate long-range dependencies spanning entire sentences or conversations. The UC Davis system uses a two-stage approach. First, 5-gram beam search generates the top 100 candidate sentences from the neural decoder output—a computationally efficient step that produces a diverse set of plausible transcriptions. Second, OPT-6.7B rescoring evaluates each candidate using a large language model. OPT (Open Pre-trained Transformer) is a 6.7 billion parameter transformer model trained on large text corpora that assigns a probability to each candidate sentence based on linguistic plausibility. The final output is the candidate that maximizes a weighted combination of neural decoder score and LLM score.

The advantage of LLMs becomes clear with an example. Consider the following sentence:

"Sarah, the doctor, reviewed the chart. The doctor said ___ would call tomorrow."

A 5-gram model sees only:

"The doctor said ___"

and assigns equal probability to "he" and "she."

An LLM sees the entire context, including "Sarah, the doctor," and strongly prefers "she" because it can resolve the coreference. This ability to use full-sentence context is crucial for conversational speech, where pronouns, ellipsis, and discourse structure create dependencies spanning many words.

LLM rescoring is computationally expensive. OPT-6.7B requires approximately 12.4 GB of GPU memory, and scoring 100 candidates takes meaningful inference time. The two-stage approach addresses this by using efficient 5-gram search to prune the search space before applying the expensive LLM; only 100 candidates out of potentially millions are evaluated. The results justify the cost. With 5-gram alone, UC Davis achieved approximately 10% word error rate, while OPT rescoring reduced this to 2.5%, a 4× improvement that approaches the 5% level of commercial dictation software.

Neuralink's approach deserves special mention for how significantly it diverges from the academic systems described above. The core constraint is full implantability, no external computers, no percutaneous connectors. This reshapes nearly every design decision, starting with the bandwidth problem. The N1 implant records from 1,024 electrodes at 20 kHz with 10-bit resolution, generating 204.8 Mbps of raw data. This far exceeds what can be wirelessly transmitted from an implanted device, so the implant must compress data dramatically before transmission.

Neuralink addresses this through a custom compression engine implemented on an ASIC. Like other systems, it uses threshold crossing for spike detection, but computes the threshold using Mean Absolute Deviation (MAD) rather than RMS, a choice that provides more robust noise estimation when occasional large spikes are present. The Buffered Online Spike Detector (BOS) performs pattern matching using a ring buffer architecture with 172 bits per channel, enough to hold one complete spike waveform. Multiplexed dynamic programming allows a single processing unit to cycle through channels at high speed, achieving spike detection in under 900 nanoseconds. All computations use fixed-point arithmetic rather than floating-point, yielding greater than 99.7% match to floating-point accuracy while dramatically reducing power consumption. After detection, candidate spikes undergo probability evaluation, comparing characteristics like peak amplitude and waveform width against learned thresholds to filter out noise.

For downstream decoding, Neuralink employs a hybrid architecture combining Kalman filters for smooth trajectory estimation, LSTMs for nonlinear temporal dynamics, and particle filters for probabilistic inference over discrete states. The entire system operates at 24.7 mW, orders of magnitude less than the external computers running 5-layer GRUs in academic systems. Whether this on-chip approach can match the 97.5% accuracy achieved by RNN + LLM rescoring remains to be demonstrated.

References

  1. Card, N., et al. (2024). An accurate and rapidly calibrating speech neuroprosthesis. New England Journal of Medicine, 391(7), 609-618. doi:10.1056/NEJMoa2314132
  2. Metzger, S., et al. (2023). A high-performance neuroprosthesis for speech decoding and avatar control. Nature, 620, 1037-1046. doi:10.1038/s41586-023-06443-4
  3. Willett, F., et al. (2023). A high-performance speech neuroprosthesis. Nature, 620, 1031-1036. doi:10.1038/s41586-023-06377-x
  4. Hettick, M., et al. (2025). A high-density cortical recording and stimulation platform for neural interface applications. Journal of Neural Engineering, 22(1), 016021. doi:10.1088/1741-2552/ada627
  5. Kunz, E., et al. (2025). Decoding inner speech from single neurons in a paralyzed person. Nature. doi:10.1038/s41586-024-08280-7