SKILL.md
MLX
Use this skill for MLX or MLX-LM engineering work where correctness depends on current upstream behavior, not model memory.
When to Use
- Auditing or patching MLX or MLX-LM repos
- Fact-checking "latest" MLX or MLX-LM behavior
- Porting PyTorch or JAX code to MLX
- Debugging MLX indexing, lazy evaluation, compilation, or stream behavior
- Deciding when to use stock ops,
mx.fast.*,mx.fast.metal_kernel(...), or a deeper extension path - Profiling or debugging MLX GPU execution with Metal capture hooks
- Profiling MLX memory usage or allocator/cache behavior on Apple silicon
- Reviewing MLX-LM model load, cache, prompt-cache, quantization, or generation code
- Validating local MLX model paths on Apple silicon
Core Rules
- If the user asks for current or latest MLX facts, verify releases and source first.
- Prefer upstream docs/source plus runtime checks over memory.
- Treat undocumented runtime behavior as unstable.
- Distinguish documented contracts from observed caveats.
- Keep MLX-LM inference checks local and minimal:
lazy=True, short prompts, smallmax_tokens.
Quick Start
Set the skill path once:
export CODEX_HOME="${CODEX_HOME:-$HOME/.codex}"
export MLX_SKILL="$CODEX_HOME/skills/mlx"
Check latest upstream releases:
"$MLX_SKILL/scripts/mlx_release_info.sh"
Run the bundled runtime probe:
"$MLX_SKILL/scripts/mlx_probe.sh"
The launcher checks both python3 and python and picks one that can import
mlx.
Run the probe with a local MLX model:
MLX_LM_LOCAL_MODEL=/path/to/model "$MLX_SKILL/scripts/mlx_probe.sh"
Workflow
1. Classify the task
- Current facts: verify latest
mlx/mlx-lmreleases, then inspect source - Repo validation: run the repo's own validator if it exists; otherwise use the bundled probe
- Porting or debugging: check the current facts reference, then validate the specific behavior locally
- Local model inference: use a local MLX model path and keep decode checks short
2. For current upstream facts
Use authenticated GitHub workflows when possible:
"$MLX_SKILL/scripts/mlx_release_info.sh"
gh repo clone ml-explore/mlx /tmp/mlx-upstream -- --depth 1
gh repo clone ml-explore/mlx-lm /tmp/mlx-lm-upstream -- --depth 1
Inspect only the files relevant to the question. Typical targets:
- MLX:
docs/src/usage/indexing.rst,lazy_evaluation.rst,compile.rst,numpy.rst,python/data_types.rst,python/memory_management.rst,python/mlx/nn/layers/convolution.py,docs/src/dev/custom_metal_kernels.rst,docs/src/dev/metal_debugger.rst,docs/src/dev/extensions.rst - MLX-LM:
mlx_lm/generate.py,mlx_lm/utils.py,mlx_lm/models/base.py,mlx_lm/models/cache.py
3. For runtime validation
If the repo already has an MLX validator, prefer that first.
Otherwise run:
"$MLX_SKILL/scripts/mlx_probe.sh"
The bundled probe checks high-signal MLX and MLX-LM behavior:
- indexing and mask limitations
- slice-copy vs aliasing
- compile and retracing rules
- training flow and optimizer semantics
- channels-last inputs
- stream APIs
- custom Metal kernel and capture-hook surface
- MLX-LM API surface, attention mask, caches, prompt-cache roundtrip
- AutoAWQ/GPTQ transform helpers
4. For local model checks
Use a local MLX model path when load/generate behavior matters:
MLX_LM_LOCAL_MODEL=/path/to/model "$MLX_SKILL/scripts/mlx_probe.sh"
This adds:
- real
load(..., lazy=True) - one-step
generate(...) stream_generate(...)response validation- prompt-cache save/load on the actual model cache
- generation-stream /
async_eval/clear_cachechecks
5. For porting or reviews
Check current-facts.md first.
Then use porting-checklist.md for the common MLX-specific failure modes:
- boolean mask selection unsupported
- slices are copies, not views
- no tensor
backward()pattern - explicit
mx.eval(...)required in training and timing - channels-last activations
- stream-aware benchmarking
- MLX-LM cache and generation API differences
Kernel Escalation Path
- Start with stock MLX ops.
- If there is already a tuned kernel in
mx.fast.*, prefer that first. - Use
mx.fast.metal_kernel(...)for Apple-only fused hot paths when the stock op graph is the bottleneck. - Be explicit about contiguity:
ensure_row_contiguous=Truecan hide copies. - Use
@mx.custom_functionwhen the custom kernel also needs custom gradient logic. - Move to C++
Primitiveextensions only when Python-level Metal kernels are not enough. - For serious GPU profiling, capture a
.gputracewithmx.metal.start_capture(...)/mx.metal.stop_capture()and inspect it in Xcode.
High-Signal MLX Differences
- Training is
nn.value_and_grad(...)plusoptimizer.update(...)plusmx.eval(model.parameters(), optimizer.state). - Module parameters are created lazily; explicit
mx.eval(model.parameters())matters before timing and export. - Conv inputs are channels-last:
NLC,NHWC,NDHWC. mx.compile(...)retraces on dtype, rank, and input-arity changes.shapeless=Trueavoids shape-only retracing but can break shape-dependent code.- Streams are first-class, and timing without
mx.eval(...)ormx.synchronize(...)is often wrong. - Memory profiling should use the top-level
mx.get_*_memory()helpers andmx.device_info(), not deprecatedmx.metal.*aliases. - MLX has a real Python-level fused-kernel escape hatch in
mx.fast.metal_kernel(...).
High-Signal MLX-LM Differences
generate(...)andstream_generate(...)accept strings or token IDs.batch_generate(...)expects token ID lists, not raw strings.stream_generate(...)yieldsGenerationResponseobjects.- Prompt caches are not always pure KV caches; hybrid models can mix
ArraysCacheandKVCache. - Current
mlx-lm==0.31.0caveat:batch_generate(..., max_tokens=1)can hit aZeroDivisionError.
References
- Current validated facts and caveats: current-facts.md
- Porting and review checklist: porting-checklist.md
Helpers
- Release helper: scripts/mlx_release_info.sh
- Runtime probe launcher: scripts/mlx_probe.sh
- Runtime probe implementation: scripts/mlx_probe.py