Lenny.is

Learning Go: Building a Universal API Syncer


A while ago I set out to learn the Go language. Not sure what about it, but learning a new paradigm of doing things that differs so much from the way I am used to doing things entices me.

Fortunately I got married, but that put Go on hold. Now that things have settled down I am excited to get back to it.

I always like learning by trying to fix a problem or need I have. And right now, as I am working on building out my data warehouse, I need some scripts to sync in data. That made me think — since most REST APIs follow similar structures, can I build something that will let me sync… everything?

The Idea

Not even building a web UI, just an easy to configure tool that automatically syncs APIs into my data warehouse. Starting off with a config.json that looks like this:

{
  "output_dir": "output",
  "sources": [
    {
      "name": "JSON Placeholder",
      "url": "https://jsonplaceholder.typicode.com/posts",
      "method": "GET",
      "headers": {},
      "returns": "array"
    },
    {
      "name": "Scam Archive Scams",
      "url": "https://scamarchive.org/api/scams",
      "method": "GET",
      "headers": {},
      "returns": "object",
      "list-field": "rows"
    }
  ]
}

Obviously as it grows I will need to specify how to handle rate limits, API keys, etc. But let’s get started with this first.

Two endpoints that return JSON — one returns a direct array of records, and the other returns an object that has a field for rows.

Getting Started with the Fetch

I can write Python and Node/Bun applications for the most part without checking documentation at this point. But even to do a basic thing like fetching data from a URL I need to learn a completely different way of doing it.

What’s critical for me is that I understand what is going on, because once you learn that, the rest is just syntax.

We start off in main.go with just:

package main

func main() {
    fmt.Println("Hello World")
}

Running go run main.go — if that doesn’t work, then nothing else down the line will.

Now that we have the base, let’s fill in our main. Go comes with a great standard library and we can create a request with:

req, err := http.NewRequest(http.MethodGet, "https://scamarchive.org/api/scams", nil)

There’s a bit here that helps us better understand Go.

First, req, err — Go can return multiple values from a function. Instead of .then or other error handling chains, the function returns an err value that you handle right after:

if err != nil {
    fmt.Printf("Error when creating the request %s\n", err)
}

A lot of people dislike this because it creates a lot of boilerplate, but personally it helps simplify the code, makes it very readable, and gives a consistent and safe route for error handling.

After we create the request, we should probably use it:

res, err := http.DefaultClient.Do(req)

if err != nil {
    fmt.Printf("Error making HTTP request %s\n", err)
}

That was the fetch! Now we read the body:

resBody, err := io.ReadAll(res.Body)

Parsing the Response

Now that we have the response, we need to do something with it.

Normally you would create a struct with the structure of the incoming JSON. But since we want this to handle any data that comes through and treat it more or less the same, I will forsake my ancestors and use any.

any is a generic (I think) which means it can hold any type. This is very much a cop-out, but when dealing with data that is dynamic in structure it can come in handy.

var data any

// This does a thing and only returns an error — what does the & do?
err := json.Unmarshal(resBody, &data)

Normalizing the Data

How do we get unstructured data into a database? Well… we need to structure it, unless you want to use a document database or JSONB. Which I don’t want to do — I want this to output into a clean Postgres table.

My thought process is to use DuckDB to turn a .jsonl file that has the data into a Parquet (structured and defined), and then copy the Parquet into my PG database. Since I’ll be using Parquets I might as well.

The two situations I anticipated earlier: the API directly returns an array, or it returns an object with an array in a field. Let’s normalize those into a list of rows:

var rows []any // an array of 'any' items

if source.Returns == "array" {
    rows = data.([]any) // cast data into the same format as rows
} else {
    m := data.(map[string]any) // cast data as a map
    rows = m["OurFieldName"].([]any)
}

Writing to JSONL and Converting to Parquet

Since I want DuckDB to do a clean copy to Parquet, let’s make a temporary JSONL file:

tmpFile, _ := os.CreateTemp("", "sync-everything-*.jsonl")
for _, row := range rows {
    line, _ := json.Marshal(row)
    tmpFile.Write(line)
    tmpFile.Write([]byte("\n"))
}
tmpFile.Close()
defer os.Remove(tmpFile.Name())

Then convert it to Parquet using DuckDB:

db, err := sql.Open("duckdb", "")
if err != nil {
    fmt.Printf("Error starting DuckDB %s\n", err)
    os.Exit(1)
}
defer db.Close()

query := fmt.Sprintf(
    "COPY (SELECT * FROM read_json_auto('%s')) TO 'sync-%s-%d.parquet' (FORMAT PARQUET)",
    tmpFile.Name(),
    source.Name,
    time.Now().Unix(),
)

_, err = db.Exec(query)
if err != nil {
    fmt.Printf("Error executing DuckDB script %s\n", err)
    os.Exit(1)
}

Parsing the Config

While it should have been the first step, for me it’s last. Let’s make it actually parse the config.json and operate off that.

First I defined my structs:

type Source struct {
    Name      string `json:"name"`
    Url       string `json:"url"`
    Method    string `json:"method"`
    Returns   string `json:"returns"`
    ListField string `json:"list-field"`
    // Headers we will do later
}

type Config struct {
    OutputDirectory string   `json:"output_dir"`
    Sources         []Source `json:"sources"`
}

Config is the file, and Source is each of the sources. Those backtick tags at the end (json:"sources") tell Go which fields from the JSON to grab and map.

Now let’s open the file, read it, and unmarshal it:

jsonFile, err := os.Open("config.json")
if err != nil {
    fmt.Printf("Error opening JSON file %s\n", err)
}

fmt.Printf("Opened Config File Successfully")

byteValue, err := io.ReadAll(jsonFile)
if err != nil {
    fmt.Printf("Error reading file")
    return
}

var config Config

err = json.Unmarshal(byteValue, &config)
if err != nil {
    fmt.Printf("Error unmarshaling configuration %s\n", err)
    return
}

Putting It All Together

package main

import (
    "database/sql"
    "encoding/json"
    "fmt"
    "io"
    "net/http"
    "os"
    "time"

    _ "github.com/duckdb/duckdb-go/v2"
)

type Source struct {
    Name      string `json:"name"`
    Url       string `json:"url"`
    Method    string `json:"method"`
    Returns   string `json:"returns"`
    ListField string `json:"list-field"`
    // Headers we will do later
}

type Config struct {
    OutputDirectory string   `json:"output_dir"`
    Sources         []Source `json:"sources"`
}

func main() {
    jsonFile, err := os.Open("config.json")
    if err != nil {
        fmt.Printf("Error opening JSON file %s\n", err)
    }

    fmt.Printf("Opened Config File Successfully")

    byteValue, err := io.ReadAll(jsonFile)
    if err != nil {
        fmt.Printf("Error reading file")
        return
    }

    var config Config

    err = json.Unmarshal(byteValue, &config)
    if err != nil {
        fmt.Printf("Error unmarshaling configuration %s\n", err)
        return
    }

    for _, source := range config.Sources {
        fmt.Printf("Now running source %s\n", source.Name)

        req, err := http.NewRequest(http.MethodGet, source.Url, nil)
        if err != nil {
            fmt.Printf("error making http request: %s\n", err)
            os.Exit(1)
        }

        res, err := http.DefaultClient.Do(req)
        if err != nil {
            fmt.Printf("error making http request: %s\n", err)
            os.Exit(1)
        }

        resBody, err := io.ReadAll(res.Body)
        if err != nil {
            fmt.Printf("Could not parse response body")
            os.Exit(1)
        }

        var data any
        if err := json.Unmarshal(resBody, &data); err != nil {
            fmt.Printf("Could not unmarshal response body %s\n", err)
            os.Exit(1)
        }

        var rows []any
        if source.Returns == "array" {
            rows = data.([]any)
        } else {
            m := data.(map[string]any)
            rows = m[source.ListField].([]any)
        }

        tmpFile, _ := os.CreateTemp("", "sync-everything-*.jsonl")
        for _, row := range rows {
            line, _ := json.Marshal(row)
            tmpFile.Write(line)
            tmpFile.Write([]byte("\n"))
        }
        tmpFile.Close()
        defer os.Remove(tmpFile.Name())

        db, err := sql.Open("duckdb", "")
        if err != nil {
            fmt.Printf("Error starting DuckDB %s\n", err)
            os.Exit(1)
        }
        defer db.Close()

        query := fmt.Sprintf(
            "COPY (SELECT * FROM read_json_auto('%s')) TO 'sync-%s-%d.parquet' (FORMAT PARQUET)",
            tmpFile.Name(),
            source.Name,
            time.Now().Unix(),
        )

        _, err = db.Exec(query)
        if err != nil {
            fmt.Printf("Error executing DuckDB script %s\n", err)
            os.Exit(1)
        }
    }
}

That’s it for today. I have some other work to jump to but I am happy with this progress. It fetches a URL, unmarshals the JSON, handles both array and object responses, writes line by line to a JSONL file, and then executes a DuckDB command that copies it to a Parquet. Next time: API keys!