Sunday 5 November 2023

PrivateGPT Installation Notes

These notes work as of 07/11/2023 using Xubuntu 22.04 - your milage may vary.

PrivateGPT

PrivateGPT is a production-ready AI project that allows you to ask questions about your documents using the power of Large Language Models (LLMs), even in scenarios without an Internet connection. 100% private, no data leaves your execution environment at any point.

Repo

https://github.com/imartinez/privateGPT

Docs

https://docs.privategpt.dev

Install

https://docs.privategpt.dev/#section/Installation-and-Settings

Install git

sudo apt install git

Install python

sudo apt install python3

Install pip

sudo apt install python3-pip

Install pyenv

cd ~
curl https://pyenv.run | bash

Add the commands to ~/.bashrc by running the following in your terminal:

echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(pyenv init -)"' >> ~/.bashrc

If you have ~/.profile, ~/.bash_profile or ~/.bash_login, add the commands there as well. If you have none of these, add them to ~/.profile:

echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.profile
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.profile
echo 'eval "$(pyenv init -)"' >> ~/.profile

Restart your shell for the changes to take effect.

Install Python 3.11

pyenv install 3.11
pyenv local 3.11

If you see these errors and warnings, install the required dependencies:

ModuleNotFoundError: No module named '_bz2'
WARNING: The Python bz2 extension was not compiled. Missing the bzip2 lib?
Traceback (most recent call last):
  File "", line 1, in 
  File "/home/adrian/.pyenv/versions/3.11.6/lib/python3.11/curses/__init__.py", line 13, in 
    from _curses import *

ModuleNotFoundError: No module named '_curses'
WARNING: The Python curses extension was not compiled. Missing the ncurses lib?
Traceback (most recent call last):
  File "", line 1, in 
  File "/home/adrian/.pyenv/versions/3.11.6/lib/python3.11/ctypes/__init__.py", line 8, in 
    from _ctypes import Union, Structure, Array

ModuleNotFoundError: No module named '_ctypes'
WARNING: The Python ctypes extension was not compiled. Missing the libffi lib?
Traceback (most recent call last):
  File "", line 1, in 

ModuleNotFoundError: No module named 'readline'
WARNING: The Python readline extension was not compiled. Missing the GNU readline lib?
Traceback (most recent call last):
  File "", line 1, in 
  File "/home/adrian/.pyenv/versions/3.11.6/lib/python3.11/ssl.py", line 100, in 
    import _ssl             # if we can't import it, let the error propagate
    ^^^^^^^^^^^
ModuleNotFoundError: No module named '_ssl'
ERROR: The Python ssl extension was not compiled. Missing the OpenSSL lib?

ModuleNotFoundError: No module named '_sqlite3'
WARNING: The Python sqlite3 extension was not compiled. Missing the SQLite3 lib?
Traceback (most recent call last):
  File "", line 1, in 
  File "/home/adrian/.pyenv/versions/3.11.6/lib/python3.11/tkinter/__init__.py", line 38, in 
    import _tkinter # If this fails your Python may not be configured for Tk
    ^^^^^^^^^^^^^^^

ModuleNotFoundError: No module named '_tkinter'
WARNING: The Python tkinter extension was not compiled and GUI subsystem has been detected. Missing the Tk toolkit?
Traceback (most recent call last):
  File "", line 1, in 
  File "/home/adrian/.pyenv/versions/3.11.6/lib/python3.11/lzma.py", line 27, in 
    from _lzma import *

ModuleNotFoundError: No module named '_lzma'
WARNING: The Python lzma extension was not compiled. Missing the lzma lib?

Install dependencies:

sudo apt update
sudo apt install libbz2-dev
sudo apt install libncurses-dev
sudo apt install libffi-dev
sudo apt install libreadline-dev
sudo apt install libssl-dev
sudo apt install libsqlite3-dev
sudo apt install tk-dev
sudo apt install liblzma-dev

Try installing Python 3.11 again:

pyenv install 3.11
pyenv local 3.11

Install pipx

python3 -m pip install --user pipx
python3 -m pipx ensurepath

Restart your shell for the changes to take effect.

Install poetry

pipx install poetry

Clone the privateGPT repo

cd ~
git clone https://github.com/imartinez/privateGPT
cd privateGPT

Install dependencies

poetry install --with ui,local

Download Embedding and LLM models

poetry run python scripts/setup

Run the local server

PGPT_PROFILES=local make run

Navigate to the UI

http://localhost:8001/

Shutdown

ctrl-c

GPU Acceleration

Verify the machine has a CUDA-Capable GPU

lspci | grep -i nvidia

Install the NVIDIA CUDA Toolkit

sudo apt update
sudo apt upgrade
sudo apt install nvidia-cuda-toolkit

Verify installation

nvcc --version
nvidia-smi

Install llama.cpp with GPU support

Find your version of llama_cpp_python:

poetry run pip list | grep llama_cpp_python

Substitue the version in the next command:

cd ~/privateGPT
CMAKE_ARGS='-DLLAMA_CUBLAS=on' poetry run pip install --force-reinstall --no-cache-dir llama-cpp-python==0.2.13

If you see an error like this, try specifitying the location of nvcc:

Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  × Building wheel for llama-cpp-python (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [35 lines of output]
      *** scikit-build-core 0.6.0 using CMake 3.27.7 (wheel)
      *** Configuring CMake...
      loading initial cache file /tmp/tmp591ifmq4/build/CMakeInit.txt
      -- The C compiler identification is GNU 11.4.0
      -- The CXX compiler identification is GNU 11.4.0
      -- Detecting C compiler ABI info
      -- Detecting C compiler ABI info - done
      -- Check for working C compiler: /usr/bin/cc - skipped
      -- Detecting C compile features
      -- Detecting C compile features - done
      -- Detecting CXX compiler ABI info
      -- Detecting CXX compiler ABI info - done
      -- Check for working CXX compiler: /usr/bin/c++ - skipped
      -- Detecting CXX compile features
      -- Detecting CXX compile features - done
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
      -- Found Threads: TRUE
      -- Found CUDAToolkit: /usr/local/cuda/include (found version "12.3.52")
      -- cuBLAS found
      -- The CUDA compiler identification is unknown
      CMake Error at /tmp/pip-build-env-h3vy91ne/normal/lib/python3.11/site-packages/cmake/data/share/cmake-3.27/Modules/CMakeDetermineCUDACompiler.cmake:603 (message):
        Failed to detect a default CUDA architecture.
      
      
      
        Compiler output:
      
      Call Stack (most recent call first):
        vendor/llama.cpp/CMakeLists.txt:258 (enable_language)
      
      
      -- Configuring incomplete, errors occurred!
      
      *** CMake configuration failed
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for llama-cpp-python
Failed to build llama-cpp-python
ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects

Build with the location of nvcc:

CUDACXX=/usr/local/cuda-12/bin/nvcc CMAKE_ARGS='-DLLAMA_CUBLAS=on' poetry run pip install --force-reinstall --no-cache-dir llama-cpp-python==0.2.13

Start the server

cd ~/privateGPT
pyenv local 3.11
PGPT_PROFILES=local make run

If you see this error, configure the number of layers offloaded to VRAM:

CUDA error 2 at /tmp/pip-install-pqg0kmzj/llama-cpp-python_a94e4e69cdce4224adec44b01749f74a/vendor/llama.cpp/ggml-cuda.cu:7636: out of memory
current device: 0
make: *** [Makefile:36: run] Error 1

Configure the number of layers offloaded to VRAM:

cp ~/privateGPT/private_gpt/components/llm/llm_component.py ~/privateGPT/private_gpt/components/llm/llm_component.py.backup
vim ~/privateGPT/private_gpt/components/llm/llm_component.py

change:

model_kwargs={"n_gpu_layers": -1},

to:

model_kwargs={"n_gpu_layers": 10},

Try to start the server again:

cd ~/privateGPT
pyenv local 3.11
PGPT_PROFILES=local make run

If the server is using the GPU you will see something like this in the output:

...
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA RTX A1000 Laptop GPU, compute capability 8.6
...
llm_load_tensors: ggml ctx size =    0.11 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 2902.35 MB
llm_load_tensors: offloading 10 repeating layers to GPU
llm_load_tensors: offloaded 10/35 layers to GPU
llm_load_tensors: VRAM used: 1263.12 MB
...............................................................................................
llama_new_context_with_model: n_ctx      = 3900
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =  487.50 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 282.00 MB
llama_new_context_with_model: VRAM scratch buffer: 275.37 MB
llama_new_context_with_model: total VRAM used: 1538.50 MB (model: 1263.12 MB, context: 275.37 MB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
...

Ingest

For example, to download and ingest an html copy of A Little Riak Book:
cd ~/privateGPT
mkdir ${PWD}/ingest
wget -P ${PWD}/ingest https://raw.githubusercontent.com/basho-labs/little_riak_book/master/rendered/riaklil-en.html
poetry run python scripts/ingest_folder.py ${PWD}/ingest

Configure Temperature

cp ~/privateGPT/private_gpt/components/llm/llm_component.py ~/privateGPT/private_gpt/components/llm/llm_component.py.backup
vim ~/privateGPT/private_gpt/components/llm/llm_component.py

change:

temperature=0.1

to:

temperature=0.2

Restart the server

crtl+c
cd ~/privateGPT
pyenv local 3.11
PGPT_PROFILES=local make run

Sunday 10 September 2023

Riak-like secondary index queries for S3

This is an idea for how to provide secondary index queries, similar to Riak 2i, on top of Amazon S3, using nothing but S3, boto3 and some Python.

This code hasn't been anywhere near a production environment, never benchmarked, only processed trivial amounts of data and tested only against localstack. It's not even commented. As such, it should not be used by anybody for any reason - ever.

If you do give it a try, let me know how it went.

s32i.sh

from concurrent.futures.thread import ThreadPoolExecutor
import re

from botocore.exceptions import ClientError


class S32iDatastore():

    __EXECUTOR = ThreadPoolExecutor(max_workers=os. cpu_count() - 1)

    INDEXES_FOLDER = 'indexes'
    LIST_OBJECTS = 'list_objects_v2'

    def __init__(self, s3_resource, bucket_name):

        self.s3_resource = s3_resource
        self.bucket_name = bucket_name

    def __run_in_thread(self, fn, *args):

        return self.__EXECUTOR.submit(fn, *args)

    def get(self, key):

        record = self.s3_resource.Object(self.bucket_name, key).get()
        indexes = record['Metadata']
        data = record['Body'].read()

        return data, indexes

    def head(self, key):

        record = self.s3_resource.meta.client.head_object(Bucket=self.bucket_name, Key=key)
        return record['Metadata']

    def exists(self, key):

        try:
            self.head(key)
            return True
        except ClientError:
            return False

    def put(self, key, data='', indexes={}):

        self.__run_in_thread(self.create_secondary_indexes, key, indexes)

        return self.s3_resource.Object(self.bucket_name, key).put(
            Body=data,
            Metadata=indexes)

    def delete(self, key):

        self.__run_in_thread(self.remove_secondary_indexes, key, self.head(key))

        return self.s3_resource.Object(self.bucket_name, key).delete()

    def create_secondary_indexes(self, key, indexes):

        for index, values in indexes.items():
            for value in values.split(','):
                self.put(f'{self.INDEXES_FOLDER}/{index}/{value}/{key}')

    def remove_secondary_indexes(self, key, indexes):

        for index, values in indexes.items():
            for value in values.split(','):
                self.s3_resource.Object(self.bucket_name, f'{self.INDEXES_FOLDER}/{index}/{value}/{key}').delete()

    def secondary_index_range_query(self,
                                    index,
                                    start, end=None,
                                    page_size=1000, max_results=10000,
                                    term_regex=None, return_terms=False):

        if end is None:
            end = start

        if term_regex:
            pattern = re.compile(f'^{self.INDEXES_FOLDER}/{index}/{term_regex}$')

        start_key = f'{self.INDEXES_FOLDER}/{index}/{start}'
        end_key = f'{self.INDEXES_FOLDER}/{index}/{end}'

        paginator = self.s3_resource.meta.client.get_paginator(self.LIST_OBJECTS)
        pages = paginator.paginate(
            Bucket=self.bucket_name,
            StartAfter=start_key,
            PaginationConfig={
                'MaxItems': max_results,
                'PageSize': page_size})

        for page in pages:
            for result in page['Contents']:

                result_key = result['Key']

                if result_key[0:len(end_key)] > end_key:
                    return

                if term_regex and not pattern.match(result_key):
                    continue

                parts = result_key.split('/')

                if return_terms:
                    yield (parts[-1], parts[-2])
                else:
                    yield parts[-1]

s32i_test.sh

import json
import unittest

import boto3

from s32i import S32iDatastore


class S32iDatastoreTest(unittest.TestCase):

    LOCALSTACK_ENDPOINT_URL = "http://localhost.localstack.cloud:4566"
    TEST_BUCKET = 's32idatastore-test-bucket'

    @classmethod
    def setUpClass(cls):

        cls.s3_resource = cls.create_s3_resource()
        cls.bucket = cls.create_bucket(cls.TEST_BUCKET)
        cls.datastore = S32iDatastore(cls.s3_resource, cls.TEST_BUCKET)
        cls.create_test_data()

    @classmethod
    def tearDownClass(cls):

        cls.delete_bucket()

    @classmethod
    def create_s3_resource(cls, endpoint_url=LOCALSTACK_ENDPOINT_URL):

        return boto3.resource(
        's3',
        endpoint_url=endpoint_url)

    @classmethod
    def create_bucket(cls, bucket_name):

        return cls.s3_resource.create_bucket(Bucket=bucket_name)

    @classmethod
    def delete_bucket(cls):

        cls.bucket.objects.all().delete()

    @classmethod
    def create_test_data(cls):

        cls.datastore.put(
            'KEY0001',
            json.dumps({'name': 'Alice', 'dob': '19700101', 'gender': '2'}),
            {'idx-gender-dob': '2|19700101'})

        cls.datastore.put(
            'KEY0002',
            json.dumps({'name': 'Bob', 'dob': '19800101', 'gender': '1'}),
            {'idx-gender-dob': '1|19800101'})

        cls.datastore.put(
            'KEY0003',
            json.dumps({'name': 'Carol', 'dob': '19900101', 'gender': '2'}),
            {'idx-gender-dob': '2|19900101'})

        cls.datastore.put(
            'KEY0004',
            json.dumps({'name': 'Dan', 'dob': '20000101', 'gender': '1'}),
            {'idx-gender-dob': '1|20000101'})

        cls.datastore.put(
            'KEY0005',
            json.dumps({'name': 'Eve', 'dob': '20100101', 'gender': '2'}),
            {'idx-gender-dob': '2|20100101'})

        cls.datastore.put(
            'KEY0006',
            json.dumps({'name': ['Faythe', 'Grace'], 'dob': '20200101', 'gender': '2'}),
            {'idx-gender-dob': '2|20200101', 'idx-name': 'Faythe,Grace'})

        cls.datastore.put('KEY0007', indexes={'idx-same': 'same'})
        cls.datastore.put('KEY0008', indexes={'idx-same': 'same'})
        cls.datastore.put('KEY0009', indexes={'idx-same': 'same'})

        cls.datastore.put(
            'KEY9999',
            json.dumps({'name': 'DELETE ME', 'dob': '99999999', 'gender': '9'}),
            {'idx-gender-dob': '9|99999999'})

    def test_get_record(self):

        data, indexes = self.datastore.get('KEY0001')

        self.assertDictEqual({'name': 'Alice', 'dob': '19700101', 'gender': '2'}, json.loads(data))
        self.assertDictEqual({'idx-gender-dob': '2|19700101'}, indexes)

    def test_head_record(self):

        indexes = self.datastore.head('KEY0002')

        self.assertDictEqual({'idx-gender-dob': '1|19800101'}, indexes)

    def test_2i_no_results(self):

        keys = self.datastore.secondary_index_range_query('idx-gender-dob', '3|30100101')

        self.assertListEqual([], list(keys))

    def test_2i_index_does_not_exist(self):

        keys = self.datastore.secondary_index_range_query('idx-does-not-exist', '3|30100101')

        self.assertListEqual([], list(keys))

    def test_2i_exact_value(self):

        keys = self.datastore.secondary_index_range_query('idx-gender-dob', '2|20100101')

        self.assertListEqual(['KEY0005'], list(keys))

    def test_2i_gender_2(self):

        keys = self.datastore.secondary_index_range_query('idx-gender-dob', '2|')

        self.assertListEqual(['KEY0001', 'KEY0003', 'KEY0005', 'KEY0006'], sorted(list(keys)))

    def test_2i_gender_2_max_results_2(self):

        keys = self.datastore.secondary_index_range_query('idx-gender-dob', '2|', max_results=2)

        self.assertListEqual(['KEY0001', 'KEY0003'], sorted(list(keys)))

    def test_2i_gender_1_dob_19(self):

        keys = self.datastore.secondary_index_range_query('idx-gender-dob', '1|19')

        self.assertListEqual(['KEY0002'], list(keys))

    def test_2i_gender_2_dob_19(self):

        keys = self.datastore.secondary_index_range_query('idx-gender-dob', '2|19')

        self.assertListEqual(['KEY0001', 'KEY0003'], sorted(list(keys)))

    def test_2i_gender_2_dob_1990_2000(self):

        keys = self.datastore.secondary_index_range_query('idx-gender-dob', '2|1990', '2|2000')

        self.assertListEqual(['KEY0003'], list(keys))

    def test_2i_term_regex(self):

        keys = self.datastore.secondary_index_range_query('idx-gender-dob', '1|', '2|', term_regex='[1|2]\|20[1|2]0.*')

        self.assertListEqual(['KEY0005', 'KEY0006'], list(keys))

    def test_2i_return_terms(self):

        key_terms = self.datastore.secondary_index_range_query(
            'idx-gender-dob', '1|', '2|',
            return_terms=True)

        self.assertListEqual([
            ('KEY0001', '2|19700101'),
            ('KEY0002', '1|19800101'),
            ('KEY0003', '2|19900101'),
            ('KEY0004', '1|20000101'),
            ('KEY0005', '2|20100101'),
            ('KEY0006', '2|20200101')],
            sorted(list(key_terms)))

    def test_2i_term_regex_return_terms(self):

        key_terms = self.datastore.secondary_index_range_query(
            'idx-gender-dob', '1|', '2|',
            term_regex='[1|2]\|20[1|2]0.*',
            return_terms=True)

        self.assertListEqual([('KEY0005', '2|20100101'), ('KEY0006', '2|20200101')], list(key_terms))

    def test_exists(self):

        self.assertTrue(self.datastore.exists('KEY0001'))
        self.assertFalse(self.datastore.exists('1000YEK'))

    def test_multiple_index_values(self):

        indexes = self.datastore.head('KEY0006')
        self.assertDictEqual({'idx-gender-dob': '2|20200101', 'idx-name': 'Faythe,Grace'}, indexes)

        keys = self.datastore.secondary_index_range_query('idx-name', 'Faythe')
        self.assertListEqual(['KEY0006'], list(keys))

        keys = self.datastore.secondary_index_range_query('idx-name', 'Grace')
        self.assertListEqual(['KEY0006'], list(keys))

    def test_multiple_keys_same_index(self):

        keys = self.datastore.secondary_index_range_query('idx-same', 'same')
        self.assertListEqual(['KEY0007', 'KEY0008', 'KEY0009'], sorted(list(keys)))

    def test_delete(self):

        self.assertTrue(self.datastore.exists('KEY9999'))

        keys = self.datastore.secondary_index_range_query('idx-gender-dob', '9|99999999')
        self.assertListEqual(['KEY9999'], list(keys))

        self.datastore.delete('KEY9999')

        self.assertFalse(self.datastore.exists('KEY9999'))

        keys = self.datastore.secondary_index_range_query('idx-gender-dob', '9|99999999')
        self.assertListEqual([], list(keys))

Saturday 27 March 2021

National Statistics Postcode Lookup Radius Search With Redis

Of all the questions posed by Plato, the profundity of one stands head and shoulders above the rest:

To answer Plato's question we're going need some geographic information about UK postcodes:

National Statistics Postcode Lookup

This data set is probably the right one for the job. It's from a reliable source, it contains longitude and lattitude for 2.6 million postcodes and best of all - it's free.

The data is downloadable from geoportal.statistics.gov.uk, first item under the 'Postcodes' menu. The dataset appears to be released quarterly every February, May, August and November.

At the time of writing, the latest dowload link points to:

www.arcgis.com/sharing/rest/content/items/7606baba633d4bbca3f2510ab78acf61/data

Interestingly, the domain is www.arcgis.com, the website for a well known commercial Geographic Information System - ArcGIS, from Esri.

Other data sets are available

Code-Point Open

Code-Point Open from Ordnance Survey, free but location information is coded as Eastings and Northings, not ideal for this project.

PostZon

Part of the PAF datasets from Royal Mail, mentioned in the PAF Programmers Guide, longitude and lattitude, but not much information beyond that. Non-free and was apparently leaked by Wikileaks in 2009:

Was the leak of Royal Mail's PostZon database a good or bad thing?

UK Postcodes to Longitudes Latitudes Table

Provided by postcodeaddressfile.co.uk - a Royal Mail reseller. Appears to be a combination of PAF and OS data, has longitude and lattitude data but costs £199 for an Organisation Licence.

Geospatial Index

Redis provides geospatial indexing and a bunch of related commands, awesome - as long as you can provide it with longitude and lattitude data:

Ideal for answering the question "How many postcodes are within a given radius of a given postcode" is the GEORADIUSBYMEMBER command.

Data Load

This bash script downloads the February 2021 release of National Statistics Postcode Lookup ZIP file, unzips the file we need, parses the data and formats into Redis commands which are piped to Redis.

The script uses the csvtool command line utility which will need to be installed if you don't already have it.

load-nspl.sh

#!/bin/bash
# Data URL from: https://geoportal.statistics.gov.uk/datasets/national-statistics-postcode-lookup-february-2021
DATA_URL='https://www.arcgis.com/sharing/rest/content/items/7606baba633d4bbca3f2510ab78acf61/data'
ZIP_FILE='/tmp/nspl.zip'
CSV_FILE='/tmp/nspl.csv'
CSV_REGEX='NSPL.*UK\.csv'
REDIS_KEY='nspl' # NSPL - National Statistics Postcode Lookup
POSTCODE_FIELD=3 # PCDS - Unit postcode variable length version
LAT_FIELD=34 # LAT - Decimal degrees latitude
LONG_FIELD=35 # LONG - Decimal degrees longitude
START_TIME="$(date -u +%s)"

# Download data file if it doesn't exist
if [ -f "$ZIP_FILE" ]
then
    echo "'$ZIP_FILE' exists, skipping download"
else
    echo "Downloading '$ZIP_FILE'"
    wget $DATA_URL -O $ZIP_FILE
fi

# Unzip data if it doesn't exist
if [ -f "$CSV_FILE" ]
then
    echo "'$ZIP_FILE' exists, skipping unzipping"  
else
    echo "Unzipping data to '$CSV_FILE'"
    unzip -p $ZIP_FILE $(unzip -Z1 $ZIP_FILE | grep -E $CSV_REGEX) > $CSV_FILE
fi

# Process data file, create Redis commands, pipe to redis-cli
echo "Processing data file '$CSV_FILE'"
csvtool format "GEOADD $REDIS_KEY %($LONG_FIELD) %($LAT_FIELD) \"%($POSTCODE_FIELD)\"\n" $CSV_FILE \
| redis-cli --pipe

# Done
END_TIME="$(date -u +%s)"
ELAPSED_TIME="$(($END_TIME-$START_TIME))"
MEMBERS=$(echo "zcard nspl" | redis-cli | cut -f 1)
echo "$MEMBERS postcodes loaded"
echo "Elapsed: $ELAPSED_TIME seconds"

Expect output from the script similar to this:

Downloading '/tmp/nspl.zip'
...
196050K ......                                                100% 47.2M=54s
...
Unzipping data to '/tmp/nspl.csv'
Processing data file '/tmp/nspl.csv'
...
ERR invalid longitude,latitude pair 0.000000,99.999999
...
All data transferred. Waiting for the last reply...
Last reply received from server.
errors: 23258, replies: 2656252
2632994 postcodes loaded
Elapsed: 18 seconds

Don't worry about the errors:

ERR invalid longitude,latitude pair 0.000000,99.999999

There are about 23,000 entries in the data file with invalid longitude and lattitude values which Redis will reject. The NSPL User Guide (available in the downloaded ZIP file - NSPL User Guide Feb 2021.pdf) has this to say about them:

"Decimal degrees latitude - The postcode coordinates in degrees latitude to six decimal places; 99.999999 for postcodes in the Channel Islands and the Isle of Man, and for postcodes with no grid reference."

and

"Decimal degrees longitude - The postcode coordinates in degrees longitude to six decimal places; 0.000000 for postcodes in the Channel Islands and the Isle of Man, and for postcodes with no grid reference."

Queries

Once we've got a full dataset loaded we can run some queries with redis-cli:

127.0.0.1:6379> geopos nspl "YO24 1AB"
1) 1) "-1.0930296778678894"
   2) "53.95831391882791195"
127.0.0.1:6379> geopos nspl "YO1 7HH"
1) 1) "-1.0816839337348938"
   2) "53.96135558421912037"
127.0.0.1:6379> geodist nspl "YO24 1AB" "YO1 7HH" km
"0.8159"
127.0.0.1:6379> georadiusbymember nspl "YO24 1AB" 100 m WITHDIST
1) 1) "YO24 1AY"
   2) "29.0576"
2) 1) "YO1 6HT"
   2) "2.0045"
3) 1) "YO2 2AY"
   2) "2.0045"
4) 1) "YO24 1AB"
   2) "0.0000"
5) 1) "YO24 1AA"
   2) "69.7119"
127.0.0.1:6379> georadiusbymember nspl "YO1 7HH" 50 m WITHDIST
1) 1) "YO1 2HT"
   2) "32.6545"
2) 1) "YO1 7HT"
   2) "32.6545"
3) 1) "YO1 7HH"
   2) "0.0000"
4) 1) "YO1 2HZ"
   2) "40.3405"
5) 1) "YO1 2HL"
   2) "37.6516"
6) 1) "YO1 7HL"
   2) "38.9421"

REST API

Here's a super basic Flask based REST service to query the geographic index. Postcode, distance and units can be provided as search parameters in the request URL. Postcodes within the requested radius are returned as JSON, along with their distance from the provided postcode.

nspl-rest.py

from flask import Flask, jsonify
from redis import Redis


REDIS_HOST = 'localhost'
REDIS_PORT = 6379
REDIS_DB = 0
REDIS_KEY = 'nspl'

app = Flask(__name__)
r = Redis(host=REDIS_HOST, port=REDIS_PORT, db=REDIS_DB)


@app.route('/radius/<postcode>/<distance>/<unit>', methods=['GET'])
def radius(postcode, distance, unit):

    try:
        results = r.georadiusbymember(REDIS_KEY,
                                      postcode, distance, unit,
                                      withdist=True)
    except Exception as e:
        results = {}

    return jsonify([{
        'postcode': result[0],
        'distance':result[1]
    } for result in results])


app.run()

API Example Usage

$ curl localhost:5000/radius/YO24%201AB/100/m | json_pp
[
   {
      "distance" : 29.0576,
      "postcode" : "YO24 1AY"
   },
   {
      "distance" : 2.0045,
      "postcode" : "YO1 6HT"
   },
   {
      "distance" : 2.0045,
      "postcode" : "YO2 2AY"
   },
   {
      "distance" : 0,
      "postcode" : "YO24 1AB"
   },
   {
      "distance" : 69.7119,
      "postcode" : "YO24 1AA"
   }
]

Source Code

Saturday 3 October 2020

Code-Point Open Postcode Distance AWS Lambda

Redis supports calculating distances using longitude and latitude with GEODIST, but I wanted to use eastings and northings to calculate distance between postcodes.

This project uses the Code-Point Open dataset, loaded in to AWS ElastiCache (Redis) from an AWS S3 bucket, and provides an AWS Lambda REST API to query the distance between two given postcodes.

The Code-Point Open dataset is available as a free download from the Ordnance Survey Data Hub.

Dataset

CSV Zip Download - Code-Point Open

Source Code

Code available in GitHub - codepoint-distance

Build and Run

Build using Maven:
mvn clean install

See the README.md file on GitHub for AWS deployment instructions using the AWS Command Line Interface.

Example Usage

The REST API takes two postcodes as URL parameters and returns the distance in meters, along with each postcode's eastings and northings.

Using curl from the Linux command line:

curl -s https://77waizvyq3.execute-api.eu-west-2.amazonaws.com/Prod/codepoint/distance/YO241AB/YO17HH | json_pp
{
   "distance" : 817.743235985477,
   "toCodePoint" : {
      "postcode" : "YO1 7HH",
      "eastings" : 460350,
      "northings" : 452085
   },
   "fromCodePoint" : {
      "postcode" : "YO24 1AB",      
      "eastings" : 459610,
      "northings" : 451737
   }
}

Saturday 16 November 2019

Start Stop Continue

Start Stop Continue is a virtual post-it note board for Start / Stop / Continue style retrospectives. It is implemented using Java, jQuery, and JSON files for persistence.

The project is designed for simplicity and the option for extension, rather than scalability. Even logging and error handling are secondary concerts at this point in the project.

An example instance of the site is hosted here: ststpcnt.com.

Source Code

Code available in GitHub - start-stop-continue

Setup

This project requires a minimum of Java 11 JDK to build.

Build and Run

Build using Maven:
mvn clean install

Run by executing the built jar file:
java -jar start-stop-continue-jar-with-dependencies.jar

Browse to:
http://localhost:8080/startstopcontinue

A new post-it note board with a unique URL will be created and notes can be added, edited and deleted. If this project is deployed to a publicly available host, the URL can be shared with other retrospective participants.

Future Improvements

Possible future improvements may include:

  • Add logging and more robust error handling
  • Integrate with a scalable datastore such as Apache Cassandra
  • Integrate with a scalable caching solution such as Redis
  • Use websockets for add/edit/delete live updates without refreshing the page
  • Port to AWS or other cloud based hosting provider

Saturday 27 July 2019

Raspberry Pi 4 Official Case Temperature

My Raspberry Pi 4, running without a case, has an idle temperature of 54°C. With the official Pi 4 case the idle temperature jumps to 72°C.

The official case is completely hotboxed, allowing for absolutely no airflow. Since the Pi 4 begins to throttle the CPU at 80°C, this makes the official case a design disaster and useless without the addition of active cooling.

The noctua range of fans get great reviews and are super well made – but you pay a premium for quality; they're pricey compared to other brands. I picked the 40mm x 20mm NF-A4x20 5v for mounting on the outside of the Pi case.

If you wanted a slimmer fan to mount inside the case, go for the 40mm x 10mm NF-A4x10 5v.

Case Modding

I cut a 38mm hole in the top part of the case with a hole saw, at the end of the case away from where the Pi's USB and Ethernet ports are. Placing the fan over the hole, I marked out and drilled some screw holes for the screws provided with the fan.

In the side of the Pi case base, I've drilled 6, 2mm holes at 1cm intervals as an air inlet/exhaust.

Fan Connector Modding

The fan comes with a big fat 3 pin connector, too big to fit on the Pi's GPIO pins. The fan does come with a 2 pin adapter which you can add your own connectors to, but I chose not to use it as it would just take up space in the Pi case. Instead, I cut off the original connector, removed some of the wire insulation and crimped some new DuPont connectors.

The black wire connects to one of the Pi's ground pins. The red wire connects to one of the Pi's 5v pins. The yellow wire is not required - I crimped a connector anyway, but then just keep it out of the way with some tape.

Suck vs Blow

Should you mount the fan to blow cooler air on to the Pi board and vent the warmer air through the side holes, or use the side holes as an inlet for cooler air and suck the warmer air away from the Pi board?

The only way to really know is to mount the fan both ways, stress test the Pi, measure the temperature and compare the results. Install the stress package on the Pi using apt with command:

sudo apt-get install stress

For the tests below I have used the stress command with the cpu, io, vm and hdd parameters, with 4 workers for each, running for 5 minutes (300 seconds):

stress -c 4 -i 4 -m 4 -d 4 -t 300

The Pi's temperature can be measured with:

vcgencmd measure_temp

For the tests below, I sample the temperature every 5 seconds in a loop for 7 minutes (84 iterations) to record temperature rise and drop off:

for i in {1..84}; do printf "`date "+%T"`\t`vcgencmd measure_temp | sed "s/[^0-9.]//g"`\n"; sleep 5; done

Test 1 – Blow

Mounting the fan with the sticker side down to blow air onto the board, connecting the power pins, closing the case and running the stress test gave the following results:

$ stress -c 4 -i 4 -m 4 -d 4 -t 300
stress: info: [1074] dispatching hogs: 4 cpu, 4 io, 4 vm, 4 hdd
stress: info: [1074] successful run completed in 303s
$ for i in {1..84}; do printf "`date "+%T"`\t`vcgencmd measure_temp | sed "s/[^0-9.]//g"`\n"; sleep 5; done
10:59:42        38.0
10:59:47        37.0
10:59:52        43.0
10:59:57        45.0
11:00:02        47.0
11:00:07        48.0
11:00:12        48.0
11:00:17        49.0
11:00:22        49.0
11:00:27        50.0
11:00:32        50.0
11:00:37        51.0
11:00:42        51.0
11:00:48        52.0
11:00:53        52.0
11:00:58        51.0
11:01:03        53.0
11:01:08        52.0
11:01:13        52.0
11:01:18        53.0
11:01:23        53.0
11:01:28        53.0
11:01:34        53.0
11:01:42        52.0
11:01:48        53.0
11:01:55        52.0
11:02:00        54.0
11:02:05        54.0
11:02:10        54.0
11:02:15        53.0
11:02:20        53.0
11:02:25        53.0
11:02:30        53.0
11:02:35        54.0
11:02:41        54.0
11:02:46        54.0
11:02:51        53.0
11:02:56        52.0
11:03:01        54.0
11:03:06        53.0
11:03:11        54.0
11:03:16        53.0
11:03:21        54.0
11:03:26        54.0
11:03:31        54.0
11:03:36        54.0
11:03:41        54.0
11:03:46        54.0
11:03:51        54.0
11:03:56        54.0
11:04:01        53.0
11:04:06        54.0
11:04:11        53.0
11:04:16        54.0
11:04:21        53.0
11:04:26        54.0
11:04:31        53.0
11:04:37        54.0
11:04:42        53.0
11:04:47        54.0
11:04:52        49.0
11:04:57        46.0
11:05:02        45.0
11:05:07        44.0
11:05:12        46.0
11:05:17        43.0
11:05:22        42.0
11:05:27        42.0
11:05:32        41.0
11:05:37        40.0
11:05:42        41.0
11:05:47        40.0
11:05:52        40.0
11:05:57        41.0
11:06:02        39.0
11:06:07        40.0
11:06:12        39.0
11:06:17        39.0
11:06:22        38.0
11:06:27        38.0
11:06:32        38.0
11:06:37        38.0
11:06:42        39.0
11:06:47        38.0

Test 2 – Suck

Re-mounting the fan with the sticker side up to suck air away from the board, connecting the power pins, closing the case and running the stress test gave the following results:

$ stress -c 4 -i 4 -m 4 -d 4 -t 300
stress: info: [1041] dispatching hogs: 4 cpu, 4 io, 4 vm, 4 hdd
stress: info: [1041] successful run completed in 302s
$ for i in {1..84}; do printf "`date "+%T"`\t`vcgencmd measure_temp | sed "s/[^0-9.]//g"`\n"; sleep 5; done
11:22:41        39.0
11:22:46        40.0
11:22:51        46.0
11:22:56        49.0
11:23:01        50.0
11:23:06        51.0
11:23:11        52.0
11:23:16        52.0
11:23:21        52.0
11:23:26        52.0
11:23:31        53.0
11:23:36        54.0
11:23:41        54.0
11:23:46        54.0
11:23:51        55.0
11:23:56        55.0
11:24:01        55.0
11:24:06        54.0
11:24:11        55.0
11:24:16        55.0
11:24:22        55.0
11:24:27        54.0
11:24:37        55.0
11:24:42        56.0
11:24:47        57.0
11:24:52        56.0
11:24:57        57.0
11:25:02        55.0
11:25:07        56.0
11:25:12        56.0
11:25:17        57.0
11:25:22        56.0
11:25:27        57.0
11:25:32        56.0
11:25:37        57.0
11:25:42        58.0
11:25:47        58.0
11:25:53        58.0
11:25:58        58.0
11:26:03        57.0
11:26:08        58.0
11:26:13        57.0
11:26:18        58.0
11:26:23        58.0
11:26:28        57.0
11:26:33        58.0
11:26:38        57.0
11:26:43        57.0
11:26:48        58.0
11:26:53        58.0
11:26:58        59.0
11:27:03        58.0
11:27:08        58.0
11:27:13        57.0
11:27:18        58.0
11:27:23        59.0
11:27:28        58.0
11:27:33        58.0
11:27:38        58.0
11:27:43        58.0
11:27:48        55.0
11:27:53        51.0
11:27:58        49.0
11:28:03        48.0
11:28:09        47.0
11:28:14        46.0
11:28:19        46.0
11:28:24        46.0
11:28:29        45.0
11:28:34        45.0
11:28:39        44.0
11:28:44        44.0
11:28:49        43.0
11:28:54        44.0
11:28:59        44.0
11:29:04        42.0
11:29:09        42.0
11:29:14        42.0
11:29:19        42.0
11:29:24        43.0
11:29:29        43.0
11:29:34        42.0
11:29:39        42.0
11:29:44        42.0

Comparison

Blowing air keeps the Pi cooler than sucking air, with temperature ranges of 37°C-54°C and 39°C-59°C respectively for this fan/vent combination.

When sucking air, the Pi doesn't reach the original idle temperature 2 minutes after the stress test has ended.

Parts list and prices

Part Price Link
38mm Hole Saw £4.59 https://www.ebay.co.uk/itm/143196534863
DuPont Connectors £2.60 https://www.ebay.co.uk/itm/264250195674
Noctua NF-A4x20 5V £13.40 https://www.amazon.co.uk/gp/product/B071W6JZV8

Saturday 6 July 2019

Raspberry Pi Backup Server

Getting Old

Recently I've found myself lying awake at night worrying if my documents, code and photos are backed up and recoverable. Or to put it another way - I've officially become old :-(

With a new Raspberry Pi 4B on order it's time to re-purpose the old Raspberry Pi 3B to create a backup solution.

Hardware

I want my backup solution and backup media to be small, cheap and redundant. Speed isn't really an issue, so I've chosen micro SD as my backup media for this project.

I've picked up an Anker 4-Port USB hub, 2 SanDisk 64 GB micro SD cards and 2 SanDisk MobileMate micro SD card readers. I ordered this kit from Amazon and the prices at the time of writing were:

ComponentPrice
Anker 4-Port USB 3.0 Ultra Slim Data Hub £10.99
SanDisk Ultra 64 GB microSDXC £11.73
SanDisk MobileMate USB 3.0 Reader £7.50

They fit together really well, with room for two more SD cards and readers if I need to expand:

The plan is to make one of the SD cards available over the network as a share, via the Pi using SAMBA. The share can be mapped as a Windows network drive and files can easily be dragged and dropped for backup. In case the first backup SD card fails, the Pi will copy the files and folders from the first SD card to the second SD card using rsync to create a backup of the backup.

Software

Download and upgrade the Pi 3B to the lastest version of Raspbian. I've chosen Rapbian Lite to save a bit of space on the Pi's SD card:

https://downloads.raspberrypi.org/raspbian_lite_latest

At the time of writing the lastest download was: 2019-06-20-raspbian-buster-lite.zip

Write the OS to the Pi's SD card using Etcher. Top tip - Etcher can write a .zip file, but it's much quicker to extract the .iso file from the .zip file and write that instead.

Don't forget to add an empty ssh file to the boot partition on the Pi's SD card if you are going to run the Pi headless.

Put the Pi's SD card into the Pi, attached the USB hub and micro SD cards, and boot the Pi and login via SSH. Update and upgrade any new packages first, enable unattended security updates and install your editor of choice:

$ sudo apt-get update
$ sudo apt-get upgrade
$ sudo apt-get install unattended-upgrades
$ sudo apt-get install vim

Because I've got a Pi 4 on the way, I want to call this Pi 'raspberrypi3'. Modify the /etc/hostname and /etc/hosts files:

$ sudo vim /etc/hostname

raspberrypi3
$ sudo vim /etc/hosts

127.0.1.1       raspberrypi3
$ sudo reboot

At this point, the backup SD cards should be available to Linux as devices /dev/sda and /dev/sdb.

I want the backup SD cards to be readable on Linux and Windows machines using the exFAT file system. A good tutorial on how to do this on Linux using FUSE and gdisk is available here:

https://matthew.komputerwiz.net/2015/12/13/formatting-universal-drive.html

$ sudo apt-get install exfat-fuse exfat-utils
$ sudo apt-get install gdisk

Use gdisk to remove any existing partitions, create a new partition and write this to the SD cards. Make sure to create the new partition as type 0700 (Microsoft basic data) when prompted:

$ sudo gdisk /dev/sda

GPT fdisk (gdisk) version 0.8.8

Partition table scan:
  MBR: not present
  BSD: not present
  APM: not present
  GPT: not present

Creating new GPT entries.

Command (? for help):
Command (? for help): o
This option deletes all partitions and creates a new protective MBR.
Proceed? (Y/N): Y
Command (? for help): n
Partition number (1-128, default 1):
First sector (34-16326462, default = 2048) or {+-}size{KMGTP}:
Last sector (2048-16326462, default = 16326462) or {+-}size{KMGTP}:
Current type is 'Linux filesystem'
Hex code or GUID (L to show codes, Enter = 8300): 0700
Changed type of partition to 'Microsoft basic data'
Command (? for help): w

Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING
PARTITIONS!!

Do you want to proceed? (Y/N): Y
OK; writing new GUID partition table (GPT) to /dev/sda.
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot.
The operation has completed successfully.

Repeat for the second SD card:

$ sudo gdisk /dev/sdb

Create exFAT partitions on both SD cards and label the partitions PRIMARY and SECONDARY:

$ sudo mkfs.exfat /dev/sda1
$ sudo exfatlabel /dev/sda1 PRIMARY
$ sudo mkfs.exfat /dev/sdb1
$ sudo exfatlabel /dev/sdb1 SECONDARY

Create directories to mount the new partitions on:

$ sudo mkdir -p /media/usb/backup/primary
$ sudo mkdir -p /media/usb/backup/secondary

Modify /etc/fstab to mount the SD cards by partition label. This allows us to mount the correct card regardless of it's device path or UUID:

$ sudo vim /etc/fstab

LABEL=PRIMARY /media/usb/backup/primary exfat defaults 0 0
LABEL=SECONDARY /media/usb/backup/secondary exfat defaults 0 0

Mount the SD cards:

$ sudo mount /media/usb/backup/primary
$ sudo mount /media/usb/backup/secondary

Create a cron job to rsync files from the primary card to the secondary card. The following entry syncs the files every day at 4am:

$ sudo crontab -e

0 4 * * * rsync -av --delete /media/usb/backup/primary/ /media/usb/backup/secondary/

To sync files immediately, rsync can be run from the command line at any time with:

$ sudo rsync -av --delete /media/usb/backup/primary/ /media/usb/backup/secondary/

To make the primary SD card available as a Windows share, install and configure SAMBA:

$ sudo apt-get install samba samba-common-bin
$ sudo vim /etc/samba/smb.conf

[backup]
   comment = Pi backup share
   path = /media/usb/backup/primary
   public = yes
   browseable = yes
   writable = yes
   create mask = 0777
   directory mask = 0777

$ sudo service smbd restart

Finally, install and configure UFW firewall, allowing incoming connections for SSH and SAMBA only:

$ sudo apt-get install ufw
$ sudo ufw default deny incoming
$ sudo ufw default allow outgoing
$ sudo ufw allow ssh
$ sudo ufw allow samba
$ sudo ufw enable