$ sudo dmidecode -t cache
Saturday, March 9, 2024
Big Data and CPU Caches
$ sudo dmidecode -t cache
Saturday, February 24, 2024
Home made Kubernetes cluster
When trying to run ArgoCD, I came across this problem that was stopping me from connecting. Using kubectl port-forward..., I was able to finally connect. But even then, if I ran:
$ kubectl get services --namespace argocd
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
argocd-applicationset-controller ClusterIP 10.98.20.142 <none> 7000/TCP,8080/TCP 19h
argocd-dex-server ClusterIP 10.109.252.231 <none> 5556/TCP,5557/TCP,5558/TCP 19h
argocd-metrics ClusterIP 10.106.130.22 <none> 8082/TCP 19h
argocd-notifications-controller-metrics ClusterIP 10.109.57.97 <none> 9001/TCP 19h
argocd-redis ClusterIP 10.100.158.58 <none> 6379/TCP 19h
argocd-repo-server ClusterIP 10.111.224.112 <none> 8081/TCP,8084/TCP 19h
argocd-server LoadBalancer 10.102.214.179 <pending> 80:30081/TCP,443:30838/TCP 19h
argocd-server-metrics ClusterIP 10.96.213.240 <none> 8083/TCP 19h
Anyway, this was the puppy and now the cluster seems to be behaving well.
kubectl cluster-info dump
Thursday, February 15, 2024
Spark and Schemas
I helped somebody on Discord with a tricksy problem. S/he was using a Python UDF in PySpark and seeing NullPointerExceptions. This suggests a Java problem as the Python error message for an NPE looks more like "AttributeError: 'NoneType' object has no attribute ..." But why would Python code cause Spark to throw an NPE?
The problem was the UDF was defining a returnType struct that stated a StructField was not nullable.
Note that Spark regards the nullable field as advisory only.
When you define a schema where all columns are declared to not have null values , Spark will not enforce that and will happily let null values into that column. The nullable signal is simply to help Spark SQL optimize for handling that column.- Spark, The Definitive Guide
Sunday, January 28, 2024
The Death of Data Locality?
So, why have many architects largely abandoned data locality? It's generally a matter of economics as the people at MinIO point out here. The idea is that if your data is not homogenous, you might be paying for, say, 16 CPUs on a node that's just being used for storage. An example might be that you have a cluster with 10 years of data but you mainly use that last two years. If the data for the first eight years is living on expensive hardware and rarely accessed, that could be a waste of money.
Tuesday, January 23, 2024
Avoiding Spark OOMEs
By using a lazy Iterator, Spark can write far more memory than it has to disk. As Spark consumes from the Iterator, it measures its memory. When it starts looking a bit full, it flushes to disk. Here is the memory usage of this code that uses mapPartitions to write to /tmp/results_parquet a data set that is much larger than the JVMs heap:
Spark with 0.5gb heap writing 1.3gb files |
watch "du -sh /tmp/results_parquet"
Thursday, January 11, 2024
Hilbert Curves
When you want to cluster data together over multiple dimensions, you can use Z-Order. But a better algorithm is the Hilbert Curve, a fractal that makes a best attempt to keep adjacent points together in a 1-dimensional space.
From DataBrick's Liquid Cluster design doc we get this graphical representation of what it looks like:
Dotted line squares represent files |
Z-ordering. Lines indicate contiguous data. Colours indicate different files. |
A Hilbert curve over sparse data |
A Hilbert curve over sparse data |
Z-Order over a similar sparse space |
Wednesday, January 3, 2024
GPU vs CPU vs AVX
Vector databases are all the rage. So, I looked at three different ways of multiplying vectors: CPU, GPU and Advanced Vector Extensions that leverages SIMD instructions if your hardware supports them. To access the GPU, I'm using the Tornado Java VM. For AVX, I'm using the JVM's jdk.incubator.vector module, available since JDK16.