Last Friday at 11 PM, I was scrolling Reddit when my phone started vibrating like a jackhammer. Production monitoring was screaming—a core reporting query had timed out, P99 jumping from 300ms to 12 seconds. My first thought was “who the hell ran a full table scan again?” Turns out, the culprit was a seemingly innocent SELECT statement.
The problem was a settings column. It was defined as text in PostgreSQL, but we’d been writing YAML-formatted configuration data into it. Yeah, you read that right—YAML strings stuffed into a text column, parsed only at the application layer. The real kicker? We wanted to leverage JSONB’s query capabilities directly in the database, but the column was completely unusable with ->> operators.
This article is about exactly that: how to parse YAML-formatted strings into JSON using a SELECT statement, then safely migrate to a JSONB column. I hit every pothole along the way, so here’s the full playbook.
Symptoms: You Think You Can Query, But You Can’t
Here’s what the table looked like:
CREATE TABLE user_settings (
id SERIAL PRIMARY KEY,
user_id INTEGER,
settings TEXT -- Contains YAML strings
);
The settings column had data like this:
language: en-US
theme: dark
notifications:
email: true
push: false
We wanted to query a user’s email notification toggle directly in SQL:
SELECT settings->>'notifications'->>'email' FROM user_settings;
PostgreSQL immediately shot back: operator does not exist: text ->> unknown. Of course—text doesn’t support JSON operators. What about casting to jsonb?
SELECT settings::jsonb->'notifications' FROM user_settings;
Error: invalid input syntax for type json. Because YAML isn’t JSON, and PostgreSQL’s ::jsonb casting has no idea what to do with it.
So here’s the symptom summary: Your data is YAML, your column type is text, and you can neither query it as JSON nor cast it to JSONB directly. What makes it worse? If you’re using an ORM like Rails with serialize, it defaults to YAML serialization and stores it in a text column. This has been roasted on Reddit endlessly, yet people still do it.
Root Cause: YAML Is Not JSON, and PostgreSQL Won’t Clean Up After You
The root cause is dead simple, but most people (myself included) overlook it:
- PostgreSQL natively supports JSON/JSONB, not YAML. There’s no built-in YAML parsing function. You can’t expect
::jsonbto handle YAML syntax. - YAML and JSON have syntax differences. YAML allows indentation, comments, multi-line strings, anchors, aliases—none of which JSON supports. Direct conversion will fail.
- Legacy baggage. Many older projects (especially Rails 3/4 era apps) love
serialize :settings, Hash, which defaults to YAML. Then everyone discovered JSONB was better, but the data was already rotting in the database.
There’s a thread on Reddit in /r/rails where someone posted “35 days, 0 dollars spent on ads. Here’s my realistic SEO playbook without selling you a SEO product.” One reply said: “I spent 3 months migrating YAML columns to JSONB. Worst decision ever was using serialize in the first place.” Dozens of +1’s followed.
Honestly, this is the classic “saved time now, pay debt forever” situation.
The Fix: Converting YAML Strings to JSONB Step by Step
Alright, enough complaining. Here’s the meat. The core approach is: Write a YAML parsing function in PostgreSQL using PL/Python (or PL/Perl), then run a single UPDATE statement to batch-convert everything. If you can’t install extensions, you’ll need to handle it at the application layer—but this is the pure database solution.
Step 1: Install the PL/Python Extension
PL/Python lets you write Python functions inside PostgreSQL, which means you can use Python’s yaml library directly.
# Make sure plpython3u is installed
sudo apt-get install postgresql-plpython3-16 # Adjust version to match your PG
# Create the extension in your database
psql -U your_user -d your_db -c "CREATE EXTENSION IF NOT EXISTS plpython3u;"
Heads up: PL/Python isn’t installed by default. You’ll need server root access. If you’re on a managed DB like RDS, you’re probably out of luck—go with the application-layer approach instead.
Step 2: Create the YAML-to-JSON Function
CREATE OR REPLACE FUNCTION yaml_to_jsonb(yaml_text TEXT)
RETURNS JSONB
LANGUAGE plpython3u
AS $$
import yaml
import json
if yaml_text is None:
return None
try:
parsed = yaml.safe_load(yaml_text)
return json.dumps(parsed)
except Exception as e:
raise ValueError(f"YAML parsing error: {str(e)}")
$$;
Critical pitfalls:
- Always use
yaml.safe_load(), neveryaml.load(). The latter can execute arbitrary Python code. There’s a Reddit thread that says it plainly: “If you use yaml.load() in production, you deserve to get pwned.” - The return type is TEXT that gets implicitly cast to JSONB.
json.dumps()returns a string, and PostgreSQL handles the conversion. - Handle exceptions carefully. If the YAML is malformed, the function throws. Always test with SELECT first.
Step 3: Test the Function
Run it against a small sample:
SELECT
id,
settings,
yaml_to_jsonb(settings) AS settings_jsonb
FROM user_settings
WHERE id IN (1, 2, 3)
LIMIT 10;
If settings_jsonb returns valid JSON (no errors), you’re good to go.
Step 4: Batch Convert and Update
-- Add a new JSONB column
ALTER TABLE user_settings ADD COLUMN settings_jsonb JSONB;
-- Batch convert using the function
UPDATE user_settings
SET settings_jsonb = yaml_to_jsonb(settings)
WHERE settings IS NOT NULL;
-- Verify the conversion
SELECT
COUNT(*) AS total,
COUNT(settings_jsonb) AS converted,
COUNT(*) - COUNT(settings_jsonb) AS failed
FROM user_settings;
If failed isn’t 0, some rows have malformed YAML. Find them:
SELECT id, settings
FROM user_settings
WHERE yaml_to_jsonb(settings) IS NULL AND settings IS NOT NULL;
Step 5: Drop the Old Column, Rename, and Index
-- After confirming everything works, drop the old column
ALTER TABLE user_settings DROP COLUMN settings;
-- Rename the new column
ALTER TABLE user_settings RENAME COLUMN settings_jsonb TO settings;
-- Create a GIN index for fast JSONB queries
CREATE INDEX idx_user_settings_gin ON user_settings USING GIN (settings);
Step 6: Query Like a Pro
Now you can finally use JSONB operators:
-- Find all users with email notifications enabled
SELECT user_id, settings->>'language' AS lang
FROM user_settings
WHERE settings @> '{"notifications": {"email": true}}';
Query time dropped from 12 seconds to under 50ms. That’s the kind of improvement that gets you a beer from your team.
Comparison: Which Approach Should You Use?
| Approach | Pros | Cons | Best For |
|---|---|---|---|
| PL/Python Function | Pure database, no app changes | Requires PL/Python extension, potential permission issues | Large datasets with DB admin access |
| Application-Layer Batch Script | No DB extensions needed, flexible | Data transfer overhead, needs separate script | Small datasets or restricted DB permissions |
| PL/Perl Function | Perl has built-in YAML::XS module | More complex syntax, harder to maintain | Teams that know Perl (rare) |
| Rails Migration Script | Most natural for Rails projects | Rails-only, requires downtime | Rails projects that can handle brief downtime |
I strongly recommend the PL/Python approach—it’s the cleanest. But if you’re on a managed database like RDS without extension privileges, go with the application layer.
FAQ
Q: How to convert string to jsonb in PostgreSQL?
A: If the string is valid JSON, use ::jsonb casting: SELECT '{"key": "value"}'::jsonb;. If it’s YAML, you need to parse it first. Use the yaml_to_jsonb() function I described above, or handle it in your application code with a Python script. Note that to_jsonb() converts SQL values to JSONB—it won’t parse string formats.
Q: How to query jsonb column in PostgreSQL?
A: Use the -> and ->> operators. -> returns a JSONB object, ->> returns text. Example: SELECT settings->'notifications'->>'email' FROM user_settings;. For more complex queries, use the @> containment operator: SELECT * FROM user_settings WHERE settings @> '{"theme": "dark"}';. Always create a GIN index on your JSONB column—without it, you’re doing full table scans.
Q: How to fix JSON parse unexpected character at line 1 column 1 of the JSON data?
A: This error usually means your data isn’t JSON. Common causes:
- Data is YAML (exactly our scenario). Solution: convert to JSON first.
- Data has a BOM header. Strip it with
regexp_replace:SELECT regexp_replace(data, '\xEF\xBB\xBF', '', 'g')::jsonb; - Data is an empty string. Check for NULL and empty strings:
SELECT CASE WHEN data IS NULL OR data = '' THEN NULL ELSE data::jsonb END; - Data is a bare value (number, boolean). JSON requires the top-level to be an object or array. Bare strings/numbers/booleans may fail in some PG versions.
Q: Does PostgreSQL support JSONB?
A: Yes, since PostgreSQL 9.4. JSONB stores data in a binary format, supports indexing (GIN indexes), and is significantly faster than the json type for queries. It also supports operators like @>, <@, ?, ?|, and ?& for complex document queries. For any new project, use JSONB—skip json and definitely skip text.
Final Thoughts
Honestly, I deserved this headache. I stored YAML in a text column because I thought “the app layer can parse it anyway”—and now I’m paying double the time to clean it up. There’s a Reddit comment that nails it: “Technical debt is like a payday loan. It feels good at the moment, but you’ll pay interest forever.”
If you’re still storing YAML in a text column, migrate now. Don’t wait until your monitoring blows up at 11 PM on a Friday.
One more thing: if you’re on Rails, look into ActiveRecord’s attribute method for automatic serialization/deserialization, but keep the underlying column as JSONB. Rails 7.1+ has native JSONB column support—stop using serialize.
And for the love of everything holy, test thoroughly. Our first UPDATE failed because one malformed YAML entry blew up the entire function. It took 30 minutes to roll back. Back up your data, batch your updates, and add proper exception handling.
I’m going to bed now. Good luck.