You can't unit test for taste

(dev.karltryggvason.com)

301 points by kalli 4 days ago|141 comments

•

trjordan 3 days ago

You can't unit test for taste if you haven't written down what you mean by taste. If you can externalize it, then you can.

Follow this line of thinking, and the AI-friendly answer is easy: we just have to externalize everything we know, so Claude can implement what I want.

Except that I can't fully externalize myself. Debugging a system takes more resources than running the system. If I could write down everything I know and hand it to a machine, I'd do that, but it impossible.

People aren't books or hashmaps. If you want to build something, you need to use the tools, not teach the tools to use you.

[edit: I'm trying to figure out if there's something to be done about this. Email me if you want to chat -- tr at tern dot sh]

•

bonzini 3 days ago

It can't be written down as code, that's the point.

I am more familiar with taste in coding and it can at best be described—that the resulting code is too subtly different from something else in the codebase, that you're masking a different bug, that you're not following what the code tells you. The good part is that while this cannot be unit tested, you can write documentation and code comments about it that tell people what they need to know.

But for taste of the kind described in the article there's not even a definition. The logic ended up being "trust a bunch of opaque weights the most"

•

fragmede 2 days ago

Apple's human interface guidelines says that some things can be written down though. It's a very thurough look at UX and while they don't adhere to them perfectly themselves, it's very much a north star to a some ideals. You can't unit test for taste, but you can integration test that bad tastes haven't happened.

•

sscaryterry 2 days ago

I think Apple lost a bit of credibility after the round-corner fiasco that still persists on Tahoe.

•

InsideOutSanta 2 days ago

They wrote the HIG before Alan came in and trashed the place.

•

sscaryterry 2 days ago

Indeed, I'm sure Steve Jobs is rolling in his grave.

•

vkou 2 days ago

Steve Jobs was also responsible for brilliant bits of usability like puck mice, and the need to have two functioning hands in order to right-click.

•

InsideOutSanta 2 days ago

As somebody who uses a claw grip, I loved the puck mouse. Now the stupid mouse where the charger plugs in at the bottom, that one actually sucks.

•

usef- 2 days ago

.... and ensuring the entire UI did not require right click to function. Everything was visible to click.

The usability of iPhones and iPads is a great example of how he was right. They're very easy to use and no functionality was hidden in a right click menu: it had to be visible somewhere.

Right click was still always available as a shortcut for advanced users.

•

InsideOutSanta 2 days ago

Yeah, I think people who didn't use Macs at the time misunderstand the whole "second mouse button"/"context menu" thing. If you were on Windows, you literally couldn't use the computer without context menus. But Mac OS at the time was designed such that every action the user could access was visible in the regular UI, either through a button or through the menubar.

When the context menu was introduced, it was initially designed as a shortcut to actions that were already available elsewhere in the UI.

•

vkou 21 hours ago

By the time the context menu was ubiquitous, their mice still did not have two buttons.

Just because the feature is available somewhere else in the UI doesn't mean that the shortcut for it must be a two-handed one.

•

usef- 18 hours ago

I've been able to plug in any two buttoned mouse for the 22 years since I first used a Mac. Their own trackpads and mice allow two finger tap to be enabled for advanced users (but on a laptop one finger can press ctrl while the other taps). I don't know how far back you're talking about when you imply no support for them.

But I remember noticing years ago a large room of tech professionals and 100% of the Windows users had mice plugged into their laptops, and zero percent of Mac users did. It was a failure of the Windows ecosystem that people needed those imho.

•

sscaryterry 17 hours ago

This is only because IMHO, the trackpad is something you can "live with" (edit: on a Mac) temporarily. It beats carrying a mouse around. Having said that, I know a UX designer that only uses the trackpad. Boggles my mind.

•

usef- 16 hours ago

It's not temporary: Mac trackpads are precise and the multitouch gestures are integrated well with the system. Mice don't support them.

•

vkou 17 hours ago

> I don't know how far back you're talking about when you imply no support for them.

It's not 'no support', it was an insane default. For all the talk of 'easy to use', there's a reason context menus exist. You can't just cram every context-specific interaction into an omnibar or a leftclick. Non-trivial software is complicated. Adding that friction to its use does nobody any favours.

Yes, in the decades since... Trackpads have gotten a lot better, but at the time Jobs was pushing for that nonsense, they simply weren't good enough. (And didn't exist at all for non-laptop computers.)

•

usef- 16 hours ago

Defaults are for the normal consumer, non trivial software is not, I think? What's something you think must only exist in a context menu?

Note that in non-trivial or professional software it's typical to have a hand on the keyboard, because not even a second mouse button is enough. Hold 'q' while dragging to adjust exposure in capture one, etc. Or they have dedicated input hardware like mixing consoles. Or they plug in a speciality mouse.

•

vkou 11 hours ago

All software is non-trivial. You weren't buying a $3000 computer in 1999 to only use 'trivial' software.

•

usef- 8 hours ago

Ok. So what's an example of something that should only exist in a right click context menu, for the average consumer?

•

computomatic 2 days ago

Wasn’t it introduced on Tahoe? (Perhaps my memory is failing me here.) Do you mean it still persists on Golden Gate? They seem to have addressed the majority of issues I heard about - unless you mean the issue is that rounded corners exist at all.

•

sscaryterry 2 days ago

See: https://medium.com/@makalin/reclaiming-the-screen-a-develope...

•

trumpdong 2 days ago

AI written article with two fullscreen popups?

•

koiueo 2 days ago

Apple lost all credibility in UI around the time they introduced colorful vomit instead of app icons.

•

Chris2048 3 days ago

Technically, AI is code, just very complex code.

I'd say there are "simple" simple things you can do though, like take automated screenshots and detect colours for jarring colourschemes.

•

Chris2048 20 hours ago

Must have hit some nerves.

•

fny 2 days ago

You absolutely cannot unit test for taste.

I had this experience doing a port from Big Query to Postgres using Opus. I had unit tests to guarantee parity with the original code, and Opus insisted on building this bespoke query builder (e.g. `def _where(very_complicated_params)`) on top of sqlglot.

Even with the original code being straightforward and legible and repeated instructions to match, I had to fight with it to get close.

In the end, I ended up doing things the "old fashion way" where I copied chunks code into Claude proper and gave explicit instructions for each piece.

I clearly had externalized the requirements, and yet that wasn't sufficient. The only way to unit test further would be to use an AST to evaluate the output against metrics I couldn't even encode.

•

ElevenLathe 2 days ago

The bigger problem I have as a worker is that, once I externalize it (by writing a skill or whatever), it becomes a work-for-hire whose copyright is owned by my employer. Technically this is true of a few other things I do for work, like my .emacs and .bashrc files, small scripts I keep in ~/bin on my workstation, etc., but no employer cares to assert this unless they're being assholes for some unrelated reason. Agent skill files, especially ones that seem to semi-reliably do what they say on the tin (the white whale!), are not like that at all, and I can see them pursuing you if you try to use them at a future employer.

•

hammock 2 days ago

This is a solid point and the only answer to it I can think of is that execution is 99x harder than ideas. Even if you enumerate everything someone else trying to use it is still going to muck it up

•

giancarlostoro 3 days ago

What's kind of funny is this is how I implemented "gates" for the ticketing system I built for Claude, because Beads would just close tickets without validation. I have tickets that are literally "Human validation" tier, so it will work on the next available thing until I personally tell the model to close it. So, in that spirit, yeah, you can unit test for taste, if you implement external validation.

Unit test runs, waits for human input before passing or failing, which might seem out of the norm, but we already have QA do manual testing.

•

coldtea 2 days ago

>You can't unit test for taste if you haven't written down what you mean by taste. If you can externalize it, then you can.

If you can externalize it, you only captured the small part of taste that can be externalized in concrete rules.

You can of course pretend anything else doesn't exist, like a person denying anything that can't be measured by their instruments.

•

playorizaya 24 hours ago

Gets into the Hard Problem of Consciousness vs AGI which is a discussion that needs to happen in AI.

Subjective "taste" and "feel" are experiences one has, rather than language one predicts out. Language is only produced to report on the experience, like "Wow, that's an ugly couch".

A vision model doesn't model how it experiences or feels (internally) about the image, just objective information about features of the image itself (external).

There are layers to aesthetics - part of it is functionality, utility, the environment vs your needs, but a big part of your style is directly related to your personality, memories, experiences, and how you physically fit with it. It's not correct/incorrect, it's optimizing for the entire circumstance, internal and external.

It can be hard to find the words to explain why an aesthetic works, or feels right (or wrong). What's even more important is when another person agrees. When you can have cohorts, trends, cliques, and hype.

AI can't do any of these inter/intra social activities, and so, like other acts of creation it can never operate at the cutting edge the way a human mind can. But with better and better vision models paired with good language models, synthetic subjectivity will do the job soon enough for most intents and purposes.

•

Dumblydorr 3 days ago

Randomized trial. Half of them pledge to use AI freely and liberally, half of them to never use it, compare via surveys and off-AI tests after X months. Could even flip it so then the non-users used it for X months and vice versa, see if losses/gains are stable.

•

pydry 3 days ago

I remember reading an interview with a fireman who described a time when his buddy evacuated a team because he "felt" that a floor would collapse imminently.

He couldn't articulate why but they trusted his gut and it did collapse.

A lot of software engineering relies on that kind of intuition and on a good team you can integrate it and benefit from it and avoid all manner of floor collapses.

•

dyarosla 2 days ago

To play devil’s advocate, intuition is still a physical response to stimuli mixed with knowledge of past experience. Hypothetically it could be modeled- the problem here comes down to how to encode it.

•

sigbottle 2 days ago

"Encoding" implies some GOFAI symbolic formal rule machinery.

I'd argue that transformers are a pretty good indication that intelligence isn't "encodable" in the way we think it means. Usually, most "model" vocabulary means that we can explain and constrain the "data" from the "rules". Except the mere "data" is trillions of interacting weights.

That may be encoding in a physical sense, but that still doesn't explain the intuition in any legible way to humans.

Cynically, we've been able to encode everything already by just saying everything's a transition in a huge lookup table. Not very informative though.

•

tmoertel 3 days ago

> You can't unit test for taste if you haven't written down what you mean by taste. If you can externalize it, then you can.

I'm not so sure. For instance, you can write down what it means for a program to be free of XSS and other injection vulnerabilities. Now, how would you unit test for that property?

•

delichon 3 days ago

You may be able to effectively externalize taste by "hot or not" style pair testing. Enough comparisons and I'd expect ML to be able to mimic human taste by latching on to features we're not well aware of influencing us.

•

trjordan 3 days ago

This is RL, right? Like, this is exactly why models have mostly converged around obvious style, because we train them literally on thumbs-up/thumbs-down data of what good behavior and good code looks like.

And that's why it's so hard to get a model to reproduce the specific taste of a person or an organization. My taste is different than yours, so if we dump our aggregate preferences into RL, in averages out to nothing interesting.

For the code-writing case, this means you end up reviewing every line of code, looking for places where you'd thumbs-down the code. Not every line of code contains a real decision, though, so it feels like a waste of time.

•

paytonjjones 3 days ago

This is, in short, the big current problem with AI.

LLMs are built for scale so they've given up on the kind of online learning / "long term memory" processes that would individualize them.

The LLM is permanently locked to being a really cracked engineer on their first day at your company, looking at your codebase for the first time.

You can scaffold a bit with .md files, but at the moment they lack the ability to do what humans do: go to sleep, encode things from short to long term memory, and wake up the next day with more specific knowledge baked in.

•

trjordan 3 days ago

100%. The problem with them isn't making sure they're doing the right thing, it's making sure they're not making bad assumptions.

IMHO this is where code review goes until we fix the individualized model thing: you need to review the decisions the agent made, where you didn't steer. Most will be right. A few will be disastrously wrong. But decision-by-decision is a lot less to review than line-by-line of code.

•

monknomo 23 hours ago

how are you getting some reviewable artifact with the decisions in it?

•

pixl97 2 days ago

Yea, individual learning is super expensive at this point and scale is the only way for paying for training at this point. Maybe at some point in the future we'll get this.

•

plastic-enjoyer 3 days ago

> LLMs are built for scale so they've given up on the kind of online learning / "long term memory" processes that would individualize them.

I wonder if this is even desirable from a product perspective. You probably don't want online learning in a product that you are selling because you can't guarantee a consistent quality of the product.

•

paytonjjones 3 days ago

You could say the same thing about employees!

And to be fair, the ability to fire employees and hire new ones is pretty important for that reason. In cases where you can't easily fire employees (e.g. unions), you encounter the very problem you're describing, and it often leads to companies preferring more consistent automations.

•

andy99 2 days ago

It’s supervised learning rather than RL, you’re just training to labels. It doesn’t work (doesn’t generalize) because there is no guarantee or even expectation that any causal relationship is learned, it’s just whatever convenient pattern gets the lowest loss. There is lots of research on this for those unaware.

•

eithed 2 days ago

Yes and no.

If I were to ask you - what convention you want to follow for your database columns - camelcase or snakecase? There's no correct global answer. There's no overarching truth that should apply to all databases in existence (even if you'll focus on a certain type of database). Hence the no.

But yes, because in the context of existing system there is a convention. If it's snakecase, you create new tables with snakecase column names.

LLMs will generally follow conventions, but sometimes they will not, because indeed - global truths (or at least, the "last article it read" truths) sometimes win over (I assume)

•

al_borland 2 days ago

Wouldn't this style of training suffer from the AI learning things the user didn't intend? I may thumbs down something for a specific detail I don't like, while other things in it are great. Certain traits that tend to occur together go along for the ride. We see similar things happen in natural selection, where mates may be chosen for 1 specific feature, and other less desirable things come along for the ride.

Outside of AI, I run into this issue when taking basic personality tests. A question may be written for a specific reason, which influences the results, but the reason for my answer may be completely unrelated to the reason intended by the person who made the test.

•

paytonjjones 2 days ago

This can usually be solved by scale alone (in all three contexts: RL, evolution, and IRT / psychometric testing)

The co-occurence thing is often not a bug of the algorithm but a genuine part of the stochastic landscape that must be solved. Evolution isn't "failing" when sickle cell vulnerability is ported along with malaria resistance; it's just a real tradeoff being made in the current biological landscape.

•

xboxnolifes 2 days ago

This problem predates AI. If we could externalize such a fickle thing such as good taste it wouldn't be such a valuable skill. And God have people tried. Golden ratios, style guides, naming rules, linters, formatters, templates, margin ratios, color palettes, and on and on and on. And yet here we are.

We can quantize some of the basics, and make a not half bad style guide, but we'll never be able to fully actualize a set of rules to match what humans find generally tasteful. Its too contextual and a moving target.

•

tripzilch 2 days ago

> You can't unit test for taste if you haven't written down what you mean by taste. If you can externalize it, then you can.

> Follow this line of thinking, and the AI-friendly answer is easy: we just have to externalize everything we know, so Claude can implement what I want.

First you have to have it, and if you think this is a tasteful solution, then you didn't.

•

lukan 2 days ago

" I'm trying to figure out if there's something to be done about this."

Yes, it is called accepting the concept of "good enough".

If you go for perfection, with the help of AI or not - you will never be done, at least not if your concept of perfect is like mine.

And more concretely here, well you can feed the LLM with enough context about you, so it can better guess what you want. And in some years maybe use a brain computer interface. But I doubt there is a magic bullet here. Just better tools, that we can build. But they won't be perfect either (hard for me to write that, as I set out building the perfect tools).

•

eithed 2 days ago

I agree and indeed externalize everything you know *that matters*.

Want to follow certain pattern, or convention - define it, ie active record vs repository pattern, stick is as an ADR! You don't know what you want? Look at what Claude produces and then acquire taste, mark this as convetion that future sessions will follow, but stick to *one* convention!

Treat your LLMs as junior developers willing to apply various patterns willy nilly, caring only about fulfilling the ACs of given task and not about the longevity or well being of the system in general. They will not look at bigger picture to check if given pattern applies globally, or even if there are any other patterns.

•

joshka 2 days ago

Pattern language sites / books have existed for years.

The right approach is more work out what shared patterns are, make sure a bunch of reasonable ones are post trained into the models so that it's easy to refer to them by name (e.g. "tim pope / chris beams style commit messages", or "make invalid state unrepresentable") and then you're in a world where you can define your personal tasted through labels rather than repetition of the core arguments.

•

RossBencina 2 days ago

But pattern languages don't encode taste they encode known working solutions. Making invalid state unrepresentable is not a matter of taste it's a best practice.

•

vinay_ys 2 days ago

You can externalize the things you consider as taste by writing down generalized statements, but those statements need boundary conditions and exceptions to be also specified. Except, exceptions have exceptions and when to apply the rule vs when to use exception is contextual judgement. so, whatever residual that cannot be explicitly and unambiguously and generally spelled out, we call it as taste/judgement.

•

punnerud 3 days ago

If you have enough examples you can train an AI on your preferences, then use that distilled AI as a unit test. Don’t combine multiple into one AI. If they don’t agree you want it to fail so you can decide and retrain the tests.

•

petra 2 days ago

Is there an issue of taste when generating images with AI ? or can we relatively rapidly train people to generate beautiful images with decent amount of variety ?

•

nemomarx 2 days ago

ai generated images and art still seem to look cheap or untasteful to a lot of viewers, so it can't be that easy to train people on fixing that.

•

sigbottle 3 days ago

Exactly. Every single philosophical statement in history runs up against the issue where you can just say, "yeah, it's pretty much this. You just need to do <arbitrarily hard unspecified thing that is basically unfalsifiability>". (Including this one)

And maybe that's just our limits with philosophy, modeling, assumptions, whatever. The danger is not realizing when we're in that zone.

(Fwiw I think unfalsifiability is a limit with any system - "you didn't compile in my syntax/semantics" is an gotcha that's actually valid and useful, but nobody can really determine the hard line)

•

deadbabe 2 days ago

You cannot externalize taste. You could perhaps mimic someone’s taste, but that’s not the taste. Knowing the taste requires actually tasting it. You can’t capture the taste, it’s already gone.

•

cadamsdotcom 2 days ago

Emailed!

•

zamalek 2 days ago

Unrelated to code, but along the same lines. I've been keeping track of the Reckless Ben case to fuel my unhealthy indignation, and we just had a like-for-like comparison between a human and an LLM.

Human: well-scoped argument that does just enough to get the job done with minimal risk.

AI: Extremely clever and correct legal argument that almost any lawyer would have said not to file (at least as written). It tries to burn the world and seriously risks pissing off the judge.

https://www.youtube.com/watch?v=YRXJnKP6Tu0

•

jdlshore 2 days ago

Interesting video, thanks for sharing it.

•

Gosper 3 days ago

Language count is a decent notoriety signal though pretty coarse. The OP/author should take a look at QRank: https://qrank.toolforge.org/

> QRank is a ranking signal for Wikidata entities. It gets computed by aggregating page view statistics for Wikipedia, Wikitravel, Wikibooks, Wikispecies and other Wikimedia projects

from https://github.com/brawer/wikidata-qrank/blob/main/doc/desig...

•

kalli 2 days ago

OP here, that looks really neat, thanks for the link!

•

hei-lima 2 days ago

Cool! Thanks for sharing.

•

ChrisMarshallNY 3 days ago

> but it ended up merely in a supporting role

This has been my experience, as well, but it’s a really big support. It just needs adult supervision. I can’t understand how vibe-coded apps, actually work.

As far as “taste,” goes, I test my stuff constantly, checking for even minor “friction points,” sometimes, refactoring back to design, in order to resolve issues that many folks would ship. I’m pretty anal, and want my work to be the best experience possible.

I can’t see any LLM coming close to being able to evaluate the user experience, like I can.

•

hombre_fatal 2 days ago

> I can’t understand how vibe-coded apps, actually work.

With a better process. e.g. plan->revision cycles, better instructions/docs like an ADR system.

I don't think vibe-coding is relegated to "build me reddit but with blockchain" and then it's done.

I think it instead describes the workflow where the software impl stays opaque but you evaluate the end product as an end user to step the product forward. It basically centers you as the tastemaker.

I'd say I vibe-code all of my personal projects now since December where AI had a breakthrough where it required less babysitting and developed good "taste" like smart sum types without being prompted to do so.

I've accumulated my own best practices like a heavy plan->revise cycle where plans ultimately promote into ./plans/impl/YYYY-MM-DD-{slug}.md, and an ADR system in ./docs/design/*.md that encodes arch/design invariants that accumulate over time, and new decisions/principles are folding back into it as they are discovered (by the AI).

During the plan revision cycles, the LLMs may ask me a multiple choice question about which decision branch to take, and lately I've just been responding with "take the ideal option" with good results -- either way it will take a well-reasoned position that I can't really argue with.

Meanwhile, my role is mainly to evaluate the end product and steer it directionally. How much I decide to prescribe and inject myself into technical decisions is a function of how serious the project is, but it's easy to notice that LLMs are simply better and better at arriving at well-reasoned decisions, and my interjections are more and more limited to technical/directional taste rather than necessity.

•

ChrisMarshallNY 2 days ago

That sounds excellent.

I have not encountered anything like that, with my Swift (native iOS) apps, but am pretty close to it, with my backend PHP.

I suspect that it depends on the tech stack. So far, the Swift output closely resembles that of a very inexperienced, but smart, engineer; One that has read up on all the "tips and advanced tricks" you can do with Swift, but has never shipped anything substantial. I need to really keep a close eye on it.

•

paytonjjones 3 days ago

Tools like Playwright and Maestro can already give you a small taste of what that would look like.

But overall I agree, LLMs are currently awful at being beta testers. They miss the most basic stuff that any human would immediately catch as being poor UX, and for all their visual prowess they are terrible at auditing UI.

•

pjmlp 3 days ago

Exactly one of the reasons I never went down with all the TDD dogma of only writing code to fix broken tests.

There is a reason conference talks are always about plain algorithms and data structures.

•

bob1029 3 days ago

The biggest flaw I've seen with TDD is the fact that correctness does not compose upward. Every time two units come into contact, you've got an entirely new kind of unit. The tests from constituents do not cover emergent properties of the new things. You will repeat this same exercise the entire way up to the top, and the moment you come into contact with the customer (they want to change everything), the house of cards comes crumbling down and you have to start your agonizingly-slow process all over from the bottom again.

The only thing that the business seems to care about is top-down UI testing. This is also convenient because you can leave it until the very end after the customer has already seen several prototypes.

I do think TDD makes sense in isolated scopes (prove this specific custom parser works at the edges), but as the general policy for the entire product it's definitely not a viable practice. Much of the time if comes off as an ego trip to see just how cleverly we can mock something so that we can say we technically tested it.

•

bluGill 2 days ago

I tell people you should be testing at the level where a change would be so hard you wouldn't do it anyway. Internal helper functions - they are tested only because the code that calls them passes. Interfaces that are used thousands of places - you better test them well because you wouldn't dare change that anyway: it would break too many others.

Or to put it differently: a test is an assertion that no matter what, for all time this should never change again. Even if customer requirements change in the future they won't change in such a way as to break this test (this isn't always true, but you should believe it is true).

A test is most valuable when it alerts you to a real problem when it fails. If the test fails but there isn't a real problem (either because customer requirements have changed, or it is flaky) it was needless cost to investigate it. If the test passes that gives some hope of correctness, but you can never be sure it is really correct vs a bug in the test (even if you use TDD and so the test failed when you wrote it that doesn't mean a refactoring since didn't make this an always pass test).

Part of the problem is if I tell you to write sort() or your new toy language's list type you have an intuitive idea of what it should look like and probably will get them right the first time (other than bugs you want the tests so you catch). These should have tiny micro tests. These things also are really easy to use as examples of how to do TDD - which they are, but they are not representative: this type of code is generally in your standard library already and you are not writing it.

Instead you are writing code that isn't well defined with lots of industry experience. It is not clear what the exact interface should be (or more likely it is clear customer requirements will change but you don't know how yet). You have no idea what the best implementation is. You don't know if this will be used in this one place, or if it will become a useful key part that many future projects depend on. You have to make guesses.

•

MoreQARespect 2 days ago

That is a flaw with unit tests written at far too low a level, not with TDD.

You would have the same problem if you wrote tests like that after the code.

TDD has no opinion about the level at which you wrote your test, it just assumes it's the correct one.

This is the number one biggest misconception about TDD which I keep seeing repeated on hacker news.

https://news.ycombinator.com/item?id=46810793

https://news.ycombinator.com/item?id=45113016

•

pjmlp 2 days ago

TDD for UI effects?

•

MoreQARespect 2 days ago

snapshot test driven development again. i already wrote a similar answer in response to your other comment.

it follows the definition of TDD and it works really well (with some caveats) but again some people get hung up on what their impression of TDD is (e.g. unit tests checking to see if a car object has a steering wheel or whatever...) rather than what it actually is and what about it is that actually works.

•

pjmlp 2 days ago

How does snapshot do "feels right" from designer point of view?

•

MoreQARespect 2 days ago

Um, show the snapshot to a designer? When it feels right, lock in the snapshot ("green") and then move on to refactor.

Or, probably more likely a group of snapshots.

•

saghm 2 days ago

I feel this, especially with the crazy lengths people go to mock things sometimes. A couple years back I was having a discussion with a friend/former coworker about testing (I was griping about unnecessary mocks I had to deal with for something at a job causing unnecessary extra work), and he asked how I would approach trying to get full unit test coverage instead. I was taken aback and said that I wouldn't try to get literally everything covered by unit tests in the first place. Most of the teams I've worked on have had the approach that test coverage is good, but it isn't necessarily going to be 100% even when considering all tests; I can't even imagine trying to get 100% coverage for unit tests alone being anywhere close to worth the extra effort, let alone the contortions that the code would need to take to support it.

•

simoncion 2 days ago

Yeah.

Some TDD-obsessed companies will write tests in a way that requires you to spend a half hour understanding the web of mocks in order to update the tests to account for even a minor datastructure change. Coincidentally, your code change would cause those same tests to fail if they weren't mocked out, but they all pass until you make your changes to the mocks. This shreds the "if the tests pass, the change is probably correct" confidence that's most of the reason for having automated tests.

I am not a fan of this style of test writing.

•

pjmlp 3 days ago

Exactly, the whole system thinking and large scale architecture also fails apart, when writing everything from little working tests.

•

valvar 2 days ago

TDD is perfect for bugs; codify a replication first, then fix it.

•

pjmlp 2 days ago

Example for HLSL graphical glitch?

•

MoreQARespect 2 days ago

https://hitchdev.com/hitchstory/approach/snapshot-test-drive...

set up a rendering profile and preconditions that generates a minimal snippet of images/video using a predefined GPU profile.

then test for either a pixel perfect reproduction of the correct behaviour or for the properties you're looking for (if it doesnt reproduce deterministically).

this is one way. i also subscribe to the view that if the type system is modified to become stricter in such a way that it can fail reliably in the presence of this type of bug that this is also good enough.

some people might argue that these arent "strictly" TDD by some definition but they set out a path to follow red green refactor and confer identical benefits so my view is who gives a duck?

I don't have enough domain expertise to know which variant of these approaches is best but I'm enough of a TDD expert to know that what you're implying isnt possible is actually something you would would probably derive a lot of value from if you did it.

•

pjmlp 2 days ago

Now do that interactive with feed back from design team and user testing.

•

MoreQARespect 2 days ago

Iterate on the design til the snapshots look the way the design team wants.

That's just an extended red where you get feedback from elsewhere.

•

e12e 2 days ago

> TDD dogma of only writing code to fix broken tests.

Isn't red-green-refactor pretty ingrained in TDD?

Only write code to make a failing test pass; then refactor while making sure the tests still pass?

Then write a test that fails, repeat?

•

pjmlp 2 days ago

Now do a games engine with that approach regarding shaders and the desired visuals.

•

e12e 2 days ago

I'm not familiar with shaders and game engines, so I'm not sure what you are saying.

I had a quick look at godot tests, and seems to me they cover some parts of the shaders?

Anyway, I was more wondering who/how people are dogmatic about TDD, and manage to leave out one out of three core concepts from red/green/refactor ?

•

pjmlp 2 days ago

If I cannot write shader code without broken tests, there is a bunch of yak shaving to make testing possible in first place, and that only covers a small subset of graphics pipeline features.

It also takes zero consideration for the interactive nature of games/graphics development.

•

e12e 2 days ago

I can't immediately think of what a useful shader test would look like (beyond perhaps, shader doesn't crash program) - if this is something worth discussing, it would probably be useful to see some real world shader code; perhaps especially two versions of the same shader as it evolve.

I don't generally test css code to check that a background is now indeed set to "a more mauvey shade of pinky-russet" after a change - but I might want to.

I might at least want to run a test with browser automation to check that any text is readable on the background.

I could at least find an example of looking at the rendered page for text (as opposed to in the DOM); Google AI had some ideas of how to check the contrast in a screenshot - but no idea if that would actually work as written.

https://medium.com/@dzianisv/vibe-engineering-testing-browse...

https://share.google/aimode/mW8ClhqGppfpovRrE

•

MoreQARespect 2 days ago

yes, there is some yak shaving necessary to make writing tests possible.

There is often a tension between delivering fast and high quality/bug free and what is necessary for medical software or financial calculations might not be necessary for games.

The question of whether to write tests at all is not really about TDD though.

•

zuzululu 2 days ago

yup and I find it weird that people still remain so defensive of the Church of TDD even against empirical studies that show its limited benefits

https://arxiv.org/abs/2602.07900

•

MoreQARespect 2 days ago

This study seems to be mainly about the value of vibe coded tests and not at all about TDD.

Perhaps unsurprisingly it found that vibe coded tests suck. As a card carrying member of the "church of TDD" (I do think it is practical), this is an empirical result I certainly would agree with.

•

wolvesechoes 2 days ago

Many such churches.

Substitute static typing for TDD in your comment, and it will remain equally valid statement.

•

mcapodici 2 days ago

Static typing is very useful and time saving. I can rename something, knowing my IDE can propogate the change. I can call a function and know what it expects, not just "a thing, hope they added a comment so I know what type/shape wont blow it up!".

Here I am talking about the basic static typing, and maybe some generics use occasionally, but obviously people also go overboard sometimes with type features and that hinders understanding for newcomers to the codebase.

•

topaz0 2 days ago

Appreciating that there are some benefits is differwnt from adhering to the church

•

wolvesechoes 2 days ago

It has benefits, yes. And TDD also has some benefits. In both cases, these benefits are limited, and there are costs associated. And in both cases we have empirical evidence showing that neither is panacea to the problems church members claim they solve.

•

mcapodici 2 days ago

I don't feel like static typing (the kind you use in Go, C#, Java etc) is a church, it is just a tool. I mean I always want to use it, because it is useful. So that might be like being in the church of say adding an electric motor to my bicycle? I can't show you a study or hard evidence saying having the electric motor is "better".

I mean there are people who go nuts with very complicated types/type systems so there is that, and then you have very complicated programs, maybe that is what you mean?

Using static typing all the time is just using the tools. Using TDD for everything feels a bit suboptimal to me and so needs some obsession to do that. It only becomes a church then if they keep pestering everyone else to do it.

•

timroman 3 days ago

https://pureinference.com/insights/taste-is-the-new-skill

I wrote about this a few months back. Rick Rubin is famous for this. I do think it is something that can be trained though, it just needs a lot more context. Taste builds over time through lots of unit tests, through lots of content writing, through an accumulation of product decisions. It’s hard to put it in the individual spec, but it can be teased out of 100 project specs. And when you get to that scale the AI starts to do it pretty well.

•

sesm 3 days ago

> Rick Rubin told Anderson Cooper he has no technical ability. Doesn't play instruments. Can't work a mixing board.

If you watch his interview on Rick Beato's channel, this myth will fall apart. He plays guitar, had his own punk rock band and his guitar playing is featured on some high-profile records he produced. Also, he has a lot of practical experience with all kinds of studio equipment.

•

timroman 3 days ago

That’s exactly it. His taste isn’t in any one thing. It’s the esoteric and accumulated from a variety of things. You can’t package it up. That’s the point on the project specs. I can never get it right in one, but the arc over 100 becomes visible. Especially to an LLM that has the capacity to intake and understand that.

•

pixl97 2 days ago

>You can’t package it up.

Well, you can package it up, otherwise Rick wouldn't exist.

•

themgt 3 days ago

This is exactly it - the ultimate skill now is to be Rick Rubin with an LLM. Not a comfortable transition as a coder.

•

cadamsdotcom 2 days ago

You can’t unit test for all the aspects that make up taste, it’s true.

But if you break off parts of that - eg. by looking at what is codified out there as “good” design, what’s considered best practice etc - you can create tools the agent can call on that let it get critiques of its own work.

What’s really cool about this is those tools can be code, written by agents and committed to your repo. Put together a script that for example makes sure your brand colors are enforced (eg. https://github.com/cadamsdotcom/CodeLeash/blob/main/scripts/...) and then put it in your pre-commit checks (https://github.com/cadamsdotcom/CodeLeash/blob/main/.pre-com...), and the agent will get feedback on its use of tasteless defaults and adjust accordingly (partly because you blocked commits that contain said tasteless defaults!)

•

chantepierre 3 days ago

It makes me smile when runners use "X is a marathon, not a sprint" to hint at an effort that accumulates over time and an optimal use of energy.

I do it too because it's a common expression, and a marathon is of course longer than a sprint, but both have in common that properly raced, they are absolutely brutal efforts that leave you without a single additional drop at the end. The effort length and instantaneous power output changes, of course. Maybe "it's a marathon build, not the race" would be more precise at the loss of nearly all its expressive power (but with a lot more pedanticism points) :-p .

Nice project !

•

another-dave 3 days ago

"The effort length and instantaneous power output changes, of course."

but that's what the phrase is meant to convey, right?

Don't run through consumable X (energy/money/etc) like there's no tomorrow - even though there's <some big important milestone> now, we've got dozens more of those that we need to meet, so you're better off getting this one done at 75% than committing 100% to it and failing on all the others.

•

boredumb 3 days ago

Don't work 12 hour days to get milestone X out, because there are dozens more milestones so don't get burnt on trying to get this one out yesterday. It would probably be more like, don't use 200% to get this out and then quit or burn yourself to 0% or a few % in a year when we want you to extend and maintain this stuff.

•

chantepierre 3 days ago

Yeah you're right, I hear it more like "this is a week long hike, not a sprint" as if a marathon included rest. In any length of racing there's no tomorrow. But I'm doing tongue-in-cheek pedanticness here and will stop that right now !

•

dasil003 3 days ago

I'd wager that if a manager says that they want you to take it more like a real marathon and less like long hike.

•

jayd16 3 days ago

In a marathon, not sprinting is the rest.

•

chantepierre 2 days ago

There is no rest. There is just (properly done) a continuous output from start to finish, or a very slight increase of output (negative splitting), but effort to maintain it feels exponential. In terms of feeling, it’s a 32km « dynamic run » where you should feel good, then the hardest 10k you can pull off just after that. If paced properly there should not be a « wall » but at all levels you pass people walking who disintegrated around 30km. Even people with sub-elite/elite bibs sometimes explode.

A half is more intense but way easier, you’re just sub threshold but for a time short enough that you cannot really not make it.

•

Jensson 2 days ago

No its not a rest, marathon runners are still exhausted at the end, they can't go and run another marathon right after.

•

TimXare 3 days ago

Taste is mostly the part of the spec you forgot to write down, plus the part you couldn't write down even if you tried.

•

zdmgg 2 days ago

You might want to consider looking at Wikipedia's internal article quality assessments (these include Featured Articles, Good Articles, B-class, C-class, Start, and Stub). I use these as a rule of thumb for how popular the topic is, it's a solid proxy for both significance and the richness of the available content.

•

jt2190 2 days ago

> Overall the evaluation of success was one of the most challenging parts of the project. As a developer, I’m used to building features that either work or don’t and there is often an objective way to measure how well a feature performs. For messy real world data it was hard to evaluate how good or bad the pipeline was. Furthermore, it was easy to start optimising for a specific parameter or route and find later that this work led to severe degradations in other areas.

> Verification becomes hard to reason about because there is no ground truth for points of interest, there are no red/green unit tests for taste. I’m sure these are familiar challenges to data scientists and that there are frameworks and evals for working on them. This will require more iteration and manual overrides. Hopefully with feedback and collaboration from the community. But for now I’ve shipped V1…

I suspect LLMs may be able to help us quantify our taste because they can keep track of so many data points all at once, where we have to lossily abstract these details away.

•

sgarland 2 days ago

I’m confused about the choice for Parquet and DuckDB here. PostGIS is arguably a better match for what this project is doing, and would let you skip most of if not all uses of Shapely and Pyproj.

•

fotoblur 2 days ago

No but you can add selection as part of your workflow. Governance is something AI agents have allowed me to focus on more and more and this IMHO is where taste lands for me: https://github.com/lramoth/infoPipeline/blob/main/governance...

•

layer8 2 days ago

You can’t even unit-test for correct program logic, unless you’re able to enumerate all possible inputs and states within a short time frame.

•

bluGill 2 days ago

You can get close enough by testing only the known edge cases. If you need more mathematical proofs can give it but they are much harder.

•

brap 2 days ago

The thing I struggle with wrt to taste is that LLMs just don’t get it.

Even if I write down every single thing it did wrong and how I’d do it, and even if I turn those into rules, it will know how to follow these specific rules, but for some reason it can’t seem to generalize beyond that. And the real list of rules seem truly infinite.

•

ahmedehab_01 2 days ago

I think, when using LLMs, learning to accept some mediocrity sometimes is a necessity. It will never even have "acceptable" taste, let alone yours.

•

HoldOnAMinute 2 days ago

I am quite confident I could take a series of photos of various designs and classify them as "tacky" or not, and train a neural network to recognize tackiness.

•

a_c 3 days ago

I like to think of testing as making sure things not wrong, but not making it right.

Working, useful, delightful, in that order. Testing can make things more likely to work, that's it.

•

yiyingzhang 2 days ago

Isn't this true since the beginning of software development? AI hasn't changed that yet

•

dirkc 2 days ago

> So with my friend Claude I set about building

After this line all the references becomes *we*. I can't help but be a little disturbed by that

> To begin with we downloaded ... For instance we excluded ... We also selected ... We used this as a notoriety ... <and many more>

I am increasingly concerned about how LLMs are anthropomorphizing and how that affects our judgement?

•

kalli 2 days ago

This was a conscious choice, addressed in a footnote in the blog post:

> This is my first time writing up a project that I worked on using an AI agent. I kept writing “we” because the project felt like a collaboration.[...] On reading it back, saying we feels like an accountability dodge, because of course I’m fully and solely responsible for any errors in this write-up or code. But just using I/me also feels dishonest, because so much of the implementation here isn’t fully mine so I feel like I’m taking too much credit for my collaboration with the machines. I figure this is a new kind of pronouns debate we’ll be having for the foreseeable future.

I think it is an interesting topic.

•

dirkc 2 days ago

Thanks for pointing out the footnote, I did not get that far. And like you say, I agree it's interesting.

The footnote however does re-enforce my concern - in what other ways do we alter our behavior when it feels like we're interacting with another human?

•

kalli 2 days ago

That's fair, we can disagree. I don't think I'm personally anthropomorphising llms (I think my mental model of how they work is rough but fairly accurate), but at a population level it might be something to be concerned about (see all the ai-psychosis talk)

What I was getting at with the "we" in the post is more how we talk and think about work like this. I think it is different in kind to previous projects I've done where a relied on google, stack overflow and elbow grease. Programming has always been "standing on the shoulders of giants" kind of work, but doing it with agents feels different from that. Maybe it was a poor stylistic choice, but I think we need a way to talk about it in an honest way.

•

AnthonBerg 18 hours ago

There's the very nice precedent of using "we" in academic literature. You wrote that piece well. (You singular wrote we plural?, haha)

•

GreenJacketBoy 2 days ago

I may be the only one feeling this way, but the repetitive mention of Claude – worded as if it was a coworker ("we", "me and my friend") to the point that somebody reading it just 3 years ago would reasonably assume this "Claude" was in fact a human – made it hard to read. How much am I reading a behind the scenes of the "making of" of the application VS an essay on what somebody else (Claude) did ? I don't know. The reason I browse this website is to see what other humans are saying, inventing, using. But in some cases like this one, I see the line between tool and co-author being blurred for LLMs. And unless what they did is a specifically impressive thing on its own, I do not want to know what an LLM did. (Don't get me wrong, I would much rather have this than people lying, but I would also much rather people treat LLMs as tools.)

•

esafak 3 days ago

We can encode taste -- generative AI depends on it. Ask people to compare two examples and pick the one with better taste. You can even ask them to rate multiple subjective criteria at once. Use that to learn a scoring function based on the rating labels, and raw features. Now you can write tests.

•

thomasfl 3 days ago

That's what linters are for. Linters can prevent SQL code from spilling out to code outside the model layer. Even more important when vibecoding.

•

jpadkins 3 days ago

I think another important question is can you distill taste? (another comment uses the phrase "externalize", which might mean something similar).

I think people have been trying for the written word, with some degree of success (anti-slop skills). I have been trying for visuals, and it's pretty meh. It's easy to get a multimodal LLM to follow a style guide, but a style guide doesn't capture everything that accounts for taste. And anything that is dynamic (not a screenshot test) seems really hard or really expensive.

•

kimjune01 2 days ago

this is known as the oracle problem

•

tuo-lei 3 days ago

the taste part for me is cutting what the agent generated. 200 lines come back, i keep 80, no test for which 80.

•

carra 3 days ago

So now we need a framework for unit tastes

•

dionian 2 days ago

Great article, easy to read, and not ai slop! thanks for sharing

•

gafferongames 2 days ago

Nonsense. I unit taste all the time, it's called taste driven develoment (TDD) for a reason.

•

m0nacle 2 days ago

the whole “taste” thing is so trite lmao

•

throw93949444 3 days ago

> For example, my native Iceland had a nice mix of nature, historical sites and populated places.

You absolutely can unit test for taste, just put an agent into loop, and write into prompt what you like. Then do scoring...

Iceland is really bad example, it basically has one populated site (capital) and circular road that goes around the island.

•

voidUpdate 3 days ago

I'm pretty sure there's more points of interest in the entirety of Iceland than just Reykjavík and Route Number One