SKILL.md

MLX

Use this skill for MLX or MLX-LM engineering work where correctness depends on current upstream behavior, not model memory.

When to Use

  • Auditing or patching MLX or MLX-LM repos
  • Fact-checking "latest" MLX or MLX-LM behavior
  • Porting PyTorch or JAX code to MLX
  • Debugging MLX indexing, lazy evaluation, compilation, or stream behavior
  • Deciding when to use stock ops, mx.fast.*, mx.fast.metal_kernel(...), or a deeper extension path
  • Profiling or debugging MLX GPU execution with Metal capture hooks
  • Profiling MLX memory usage or allocator/cache behavior on Apple silicon
  • Reviewing MLX-LM model load, cache, prompt-cache, quantization, or generation code
  • Validating local MLX model paths on Apple silicon

Core Rules

  • If the user asks for current or latest MLX facts, verify releases and source first.
  • Prefer upstream docs/source plus runtime checks over memory.
  • Treat undocumented runtime behavior as unstable.
  • Distinguish documented contracts from observed caveats.
  • Keep MLX-LM inference checks local and minimal: lazy=True, short prompts, small max_tokens.

Quick Start

Set the skill path once:

export CODEX_HOME="${CODEX_HOME:-$HOME/.codex}"
export MLX_SKILL="$CODEX_HOME/skills/mlx"

Check latest upstream releases:

"$MLX_SKILL/scripts/mlx_release_info.sh"

Run the bundled runtime probe:

"$MLX_SKILL/scripts/mlx_probe.sh"

The launcher checks both python3 and python and picks one that can import mlx.

Run the probe with a local MLX model:

MLX_LM_LOCAL_MODEL=/path/to/model "$MLX_SKILL/scripts/mlx_probe.sh"

Workflow

1. Classify the task

  • Current facts: verify latest mlx / mlx-lm releases, then inspect source
  • Repo validation: run the repo's own validator if it exists; otherwise use the bundled probe
  • Porting or debugging: check the current facts reference, then validate the specific behavior locally
  • Local model inference: use a local MLX model path and keep decode checks short

2. For current upstream facts

Use authenticated GitHub workflows when possible:

"$MLX_SKILL/scripts/mlx_release_info.sh"
gh repo clone ml-explore/mlx /tmp/mlx-upstream -- --depth 1
gh repo clone ml-explore/mlx-lm /tmp/mlx-lm-upstream -- --depth 1

Inspect only the files relevant to the question. Typical targets:

  • MLX: docs/src/usage/indexing.rst, lazy_evaluation.rst, compile.rst, numpy.rst, python/data_types.rst, python/memory_management.rst, python/mlx/nn/layers/convolution.py, docs/src/dev/custom_metal_kernels.rst, docs/src/dev/metal_debugger.rst, docs/src/dev/extensions.rst
  • MLX-LM: mlx_lm/generate.py, mlx_lm/utils.py, mlx_lm/models/base.py, mlx_lm/models/cache.py

3. For runtime validation

If the repo already has an MLX validator, prefer that first.

Otherwise run:

"$MLX_SKILL/scripts/mlx_probe.sh"

The bundled probe checks high-signal MLX and MLX-LM behavior:

  • indexing and mask limitations
  • slice-copy vs aliasing
  • compile and retracing rules
  • training flow and optimizer semantics
  • channels-last inputs
  • stream APIs
  • custom Metal kernel and capture-hook surface
  • MLX-LM API surface, attention mask, caches, prompt-cache roundtrip
  • AutoAWQ/GPTQ transform helpers

4. For local model checks

Use a local MLX model path when load/generate behavior matters:

MLX_LM_LOCAL_MODEL=/path/to/model "$MLX_SKILL/scripts/mlx_probe.sh"

This adds:

  • real load(..., lazy=True)
  • one-step generate(...)
  • stream_generate(...) response validation
  • prompt-cache save/load on the actual model cache
  • generation-stream / async_eval / clear_cache checks

5. For porting or reviews

Check current-facts.md first.

Then use porting-checklist.md for the common MLX-specific failure modes:

  • boolean mask selection unsupported
  • slices are copies, not views
  • no tensor backward() pattern
  • explicit mx.eval(...) required in training and timing
  • channels-last activations
  • stream-aware benchmarking
  • MLX-LM cache and generation API differences

Kernel Escalation Path

  • Start with stock MLX ops.
  • If there is already a tuned kernel in mx.fast.*, prefer that first.
  • Use mx.fast.metal_kernel(...) for Apple-only fused hot paths when the stock op graph is the bottleneck.
  • Be explicit about contiguity: ensure_row_contiguous=True can hide copies.
  • Use @mx.custom_function when the custom kernel also needs custom gradient logic.
  • Move to C++ Primitive extensions only when Python-level Metal kernels are not enough.
  • For serious GPU profiling, capture a .gputrace with mx.metal.start_capture(...) / mx.metal.stop_capture() and inspect it in Xcode.

High-Signal MLX Differences

  • Training is nn.value_and_grad(...) plus optimizer.update(...) plus mx.eval(model.parameters(), optimizer.state).
  • Module parameters are created lazily; explicit mx.eval(model.parameters()) matters before timing and export.
  • Conv inputs are channels-last: NLC, NHWC, NDHWC.
  • mx.compile(...) retraces on dtype, rank, and input-arity changes.
  • shapeless=True avoids shape-only retracing but can break shape-dependent code.
  • Streams are first-class, and timing without mx.eval(...) or mx.synchronize(...) is often wrong.
  • Memory profiling should use the top-level mx.get_*_memory() helpers and mx.device_info(), not deprecated mx.metal.* aliases.
  • MLX has a real Python-level fused-kernel escape hatch in mx.fast.metal_kernel(...).

High-Signal MLX-LM Differences

  • generate(...) and stream_generate(...) accept strings or token IDs.
  • batch_generate(...) expects token ID lists, not raw strings.
  • stream_generate(...) yields GenerationResponse objects.
  • Prompt caches are not always pure KV caches; hybrid models can mix ArraysCache and KVCache.
  • Current mlx-lm==0.31.0 caveat: batch_generate(..., max_tokens=1) can hit a ZeroDivisionError.

References

Helpers