Explore projects
-
Updated
-
Updated
-
Updated
-
Updated
-
Updated
-
Updated
-
Updated
-
Updated
-
Updated
-
Updated
-
Updated
-
Updated
-
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
Updated -
KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems
Updated -
llmc is an efficient LLM compression tool with various advanced compression methods, supporting multiple inference backends.
Updated -
Updated
-
Updated
-
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
Updated -
-