MLX

Use this skill for MLX or MLX-LM engineering work where correctness depends on current upstream behavior, not model memory.

When to Use

Auditing or patching MLX or MLX-LM repos
Fact-checking "latest" MLX or MLX-LM behavior
Porting PyTorch or JAX code to MLX
Debugging MLX indexing, lazy evaluation, compilation, or stream behavior
Deciding when to use stock ops, mx.fast.*, mx.fast.metal_kernel(...), or a deeper extension path
Profiling or debugging MLX GPU execution with Metal capture hooks
Profiling MLX memory usage or allocator/cache behavior on Apple silicon
Reviewing MLX-LM model load, cache, prompt-cache, quantization, or generation code
Validating local MLX model paths on Apple silicon

Core Rules

If the user asks for current or latest MLX facts, verify releases and source first.
Prefer upstream docs/source plus runtime checks over memory.
Treat undocumented runtime behavior as unstable.
Distinguish documented contracts from observed caveats.
Keep MLX-LM inference checks local and minimal: lazy=True, short prompts, small max_tokens.

Quick Start

Set the skill path once:

export CODEX_HOME="${CODEX_HOME:-$HOME/.codex}"
export MLX_SKILL="$CODEX_HOME/skills/mlx"

Check latest upstream releases:

"$MLX_SKILL/scripts/mlx_release_info.sh"

Run the bundled runtime probe:

"$MLX_SKILL/scripts/mlx_probe.sh"

The launcher checks both python3 and python and picks one that can import mlx.

Run the probe with a local MLX model:

MLX_LM_LOCAL_MODEL=/path/to/model "$MLX_SKILL/scripts/mlx_probe.sh"

Workflow

1. Classify the task

Current facts: verify latest mlx / mlx-lm releases, then inspect source
Repo validation: run the repo's own validator if it exists; otherwise use the bundled probe
Porting or debugging: check the current facts reference, then validate the specific behavior locally
Local model inference: use a local MLX model path and keep decode checks short

2. For current upstream facts

Use authenticated GitHub workflows when possible:

"$MLX_SKILL/scripts/mlx_release_info.sh"
gh repo clone ml-explore/mlx /tmp/mlx-upstream -- --depth 1
gh repo clone ml-explore/mlx-lm /tmp/mlx-lm-upstream -- --depth 1

Inspect only the files relevant to the question. Typical targets:

MLX: docs/src/usage/indexing.rst, lazy_evaluation.rst, compile.rst, numpy.rst, python/data_types.rst, python/memory_management.rst, python/mlx/nn/layers/convolution.py, docs/src/dev/custom_metal_kernels.rst, docs/src/dev/metal_debugger.rst, docs/src/dev/extensions.rst
MLX-LM: mlx_lm/generate.py, mlx_lm/utils.py, mlx_lm/models/base.py, mlx_lm/models/cache.py

3. For runtime validation

If the repo already has an MLX validator, prefer that first.

Otherwise run:

"$MLX_SKILL/scripts/mlx_probe.sh"

The bundled probe checks high-signal MLX and MLX-LM behavior:

indexing and mask limitations
slice-copy vs aliasing
compile and retracing rules
training flow and optimizer semantics
channels-last inputs
stream APIs
custom Metal kernel and capture-hook surface
MLX-LM API surface, attention mask, caches, prompt-cache roundtrip
AutoAWQ/GPTQ transform helpers

4. For local model checks

Use a local MLX model path when load/generate behavior matters:

MLX_LM_LOCAL_MODEL=/path/to/model "$MLX_SKILL/scripts/mlx_probe.sh"

This adds:

real load(..., lazy=True)
one-step generate(...)
stream_generate(...) response validation
prompt-cache save/load on the actual model cache
generation-stream / async_eval / clear_cache checks

5. For porting or reviews

Check current-facts.md first.

Then use porting-checklist.md for the common MLX-specific failure modes:

boolean mask selection unsupported
slices are copies, not views
no tensor backward() pattern
explicit mx.eval(...) required in training and timing
channels-last activations
stream-aware benchmarking
MLX-LM cache and generation API differences

Kernel Escalation Path

Start with stock MLX ops.
If there is already a tuned kernel in mx.fast.*, prefer that first.
Use mx.fast.metal_kernel(...) for Apple-only fused hot paths when the stock op graph is the bottleneck.
Be explicit about contiguity: ensure_row_contiguous=True can hide copies.
Use @mx.custom_function when the custom kernel also needs custom gradient logic.
Move to C++ Primitive extensions only when Python-level Metal kernels are not enough.
For serious GPU profiling, capture a .gputrace with mx.metal.start_capture(...) / mx.metal.stop_capture() and inspect it in Xcode.

High-Signal MLX Differences

Training is nn.value_and_grad(...) plus optimizer.update(...) plus mx.eval(model.parameters(), optimizer.state).
Module parameters are created lazily; explicit mx.eval(model.parameters()) matters before timing and export.
Conv inputs are channels-last: NLC, NHWC, NDHWC.
mx.compile(...) retraces on dtype, rank, and input-arity changes.
shapeless=True avoids shape-only retracing but can break shape-dependent code.
Streams are first-class, and timing without mx.eval(...) or mx.synchronize(...) is often wrong.
Memory profiling should use the top-level mx.get_*_memory() helpers and mx.device_info(), not deprecated mx.metal.* aliases.
MLX has a real Python-level fused-kernel escape hatch in mx.fast.metal_kernel(...).

High-Signal MLX-LM Differences

generate(...) and stream_generate(...) accept strings or token IDs.
batch_generate(...) expects token ID lists, not raw strings.
stream_generate(...) yields GenerationResponse objects.
Prompt caches are not always pure KV caches; hybrid models can mix ArraysCache and KVCache.
Current mlx-lm==0.31.0 caveat: batch_generate(..., max_tokens=1) can hit a ZeroDivisionError.

References

Current validated facts and caveats: current-facts.md
Porting and review checklist: porting-checklist.md

Helpers

Release helper: scripts/mlx_release_info.sh
Runtime probe launcher: scripts/mlx_probe.sh
Runtime probe implementation: scripts/mlx_probe.py