Ruff v0.4.0: a hand-written recursive descent parser for Python
19 April 2024 | 5:00 am

Ruff v0.4.0: a hand-written recursive descent parser for Python

The latest release of Ruff - a Python linter and formatter, written in Rust - includes a complete rewrite of the core parser. Previously Ruff used a parser borrowed from RustPython, generated using the LALRPOP parser generator. Victor Hugo Gomes contributed a new parser written from scratch, which provided a 2x speedup and also added error recovery, allowing parsing of invalid Python - super-useful for a linter.

I tried Ruff 0.4.0 just now against Datasette - a reasonably large Python project - and it ran in less than 1/10th of a second. This thing is Fast.


A POI Database in One Line
19 April 2024 | 2:44 am

A POI Database in One Line

Overture maps offer an extraordinarily useful freely licensed databases of POI (point of interest) listings, principally derived from partners such as Facebook and including restaurants, shops, museums and other locations from all around the world.

Their new "overturemaps" Python CLI utility makes it easy to quickly pull subsets of their data... but requires you to provide a bounding box to do so.

Drew Breunig came up with this delightful recipe for fetching data using LLM and gpt-3.5-turbo to fill in those bounding boxes:

overturemaps download --bbox=$(llm 'Give me a bounding box for Alameda, California expressed as only four numbers delineated by commas, with no spaces, longitude preceding latitude.') -f geojsonseq --type=place | geojson-to-sqlite alameda.db places - --nl --pk=id

Via @dbreunig


Andrej Karpathy's Llama 3 review
18 April 2024 | 8:50 pm

Andrej Karpathy's Llama 3 review

The most interesting coverage I've seen so far of Meta's Llama 3 models (8b and 70b so far, 400b promised later).

Andrej notes that Llama 3 trained on 15 trillion tokens - up from 2 trillion for Llama 2 - and they used that many even for the smaller 8b model, 75x more than the chinchilla scaling laws would suggest.

The tokenizer has also changed - they now use 128,000 tokens, up from 32,000. This results in a 15% drop in the tokens needed to represent a string of text.

The one disappointment is the context length - just 8,192, 2x that of Llama 2 and 4x LLaMA 1 but still pretty small by today's standards.

If early indications hold, the 400b model could be the first genuinely GPT-4 class openly licensed model. We'll have to wait and see.



More News from this Feed See Full Web Site