Seeing, as informed by computer vision
What we learn about sight from understanding how computers see
Dear readers,
I’ve been thinking about what it means to see in the past week, and what teaching computers to see teaches us about seeing.
Let me start with a few questions. How do you make sense of what you see? What mental notes do you make in your head? What categories? How do you make that split-second judgment that something is worth looking at? How do you act upon those observations? And, perhaps more importantly, why is it that you take notice of the things you take notice of?
All these musings started on a Tuesday evening at the lower ground floor of a bar in Soho. I was there for an AI pitch night hosted by startup community The London Network in partnership with The AI Fellowship, an AI builders programme.
With a glass of lemonade in my hand, I watched a total of eight startups pitch their AI ventures to a crowd of 80 people, most of whom were wearing name tags with a small round sticker with preassigned colours — red for founders, green for investors, and yellow for ecosystem operators.
It was the second AI pitching event I attended in London. The last one also had a system to categorise the audience based on their profession. (I was given a black lanyard at the previous event and was mistaken for an investor. The people I spoke with seemed disappointed when I told them I wasn’t.)
I’ve been trying to get a sense of what the AI scene looks like in the city, especially from the builder’s perspective. Who are the people building in the space? What ideas are being built? What’s the vibe of the builder community here?
For context, I used to work as a tech journalist in Indonesia. But the years I spent reporting on tech back home were the years of the pandemic (2020-2022). Most of my interaction with the tech community — founders, investors, and tech employees — happened online, mostly via Zoom.
I wanted something different for my next chapter writing about tech and the tech industry. I wanted experiential learning. The snowballing kind. The learning that comes with chance encounters and luck. This newsletter, digital field notes, is built with that in mind. It’s a way for me to document how I stumble my way into understanding what the London tech scene is about. One event, one conversation, one topic of interest at a time.
That evening, standing close to where the pitches took place, I couldn't help but notice the few startups building on top of computer vision technologies.
Computer vision is the field within artificial intelligence that teaches computers how to see. Essentially, it’s the project of equipping machines with algorithms to identify, categorise and label items that appear in images accordingly. For example, how does a computer recognise a cat as a cat? Is a cat a cat because it has two pointy ears, a round face, an upside-down triangle as a nose, and whiskers? But what if the cat is curled up like a croissant? Does the computer recognise it as a cat or a croissant? That inquiry falls within the field of computer vision.
The eight founders took turns pitching their startups in front of a pull-down projector screen in a somewhat dimly-lit room with three judges — SFC Capital venture partner Jonathan White, Mercuri venture partner Isabelle O'Keeffe, and The AI Fellowship founder Zohaib Khan — tasked with asking questions right after the delivery of each pitch.
The first founder to pitch was Venu Tammabatula of Pulse AI. He is building a virtual ward for cancer patients, and he’s using computer vision to predict, prevent, and monitor cancer. Up next, Mohamed Binesmael of Advidan. He is developing video analytics that identify vehicles and analyse traffic to help shape smarter cities. Following him was Anas Achouri of DONAA, who is developing software to detect defects in 3D printing of manufacturing machines, reducing the costs incurred by errors.
Three computer vision startups in a row.
That’s interesting, I thought to myself. Building computer vision startups stands at odds with today’s trend of building generative AI or agentic AI startups, with the latter usually being powered by the former. While this is strictly anecdotal, the founders I’ve met in networking events like these are overwhelmingly building AI agents of some sort.
So, what’s the deal with computer vision?
I left the pitch night with notes.
And just like that, in what seemed to be overnight, everything suddenly became computer vision.
Now, I see it everywhere.
It’s in my phone’s camera. It’s the way portrait mode enables the identification of a person and separates them from their background. It’s in my gallery. The automatic groupings of photos based on faces, location, and memories. It’s the camera on top of the self-checkout machines in grocery stores. The one alerting customers if an item is not being scanned properly. It’s the autogates at airports. In international airports in Bali and Jakarta, machines do the work of verifying documents and carrying out facial recognition functions. It’s the emojis that appear if I do certain gestures on WhatsApp’s video calls, which I discovered by accident (peace sign for balloons, heart sign for heart emojis, thumbs up for thumbs up emoji, and thumbs down for thumbs down emoji). It’s also probably in the software that processes records from cameras installed on the platform of the Elizabeth line (I just noticed those cameras last week).
This brings me back to the question of what it means to see. How do machines make meaning of the things they see? How is it related to the way we, as humans, make meaning out of a scene? How do they differ?
Let’s go back to the pitch night for a second. If in an unlikely scenario the three startups were to offer the exact same solution, what makes one computer vision startup better than the other two? One of the judges asked a similar question that evening. The founder in question replied that his startup was able to recognise objects more accurately, and it had to do with gaining access to quality data with a wider variety. In short, a better dataset means better training, which in turn means better machines.
It might sound intuitive today, but in the history of computer vision, the decision to establish a large training dataset was a critical juncture that transformed an interesting — albeit niche — field in computer science into what now powers technologies that are present and ubiquitous in our daily lives. Scale was the tipping point.
One of the groundbreaking projects in the field of computer vision was ImageNet, a visual database used for the study of object recognition. Fei-Fei Li, then working as an AI researcher at Princeton, started the project in 2007. At the time, researchers working in computer vision weren’t really focused on the pursuit of expanding their training datasets. They were mostly occupied with building models and algorithms. But Li realised this wasn’t going to be sufficient. Remember the croissant cat example? The reason human eyes can identify a cat despite its croissant-like shape is that our vision is trained by real images. We can recognise cats in their many irregular poses. Li reasoned that it wasn’t enough to equip machines with algorithms that can identify a cat based on its universal features. It needed to learn from seeing a vast number of cat images.
“In hindsight, this idea of using big data to train computer algorithms may seem obvious now, but back in 2007, it was not so obvious,” Li said in her 2015 TED Talk.
By 2009, ImageNet had categorised 15 million images into 22,000 categories. There were projects of a similar nature, but to put things into perspective, they were considerably smaller in comparison. Caltech 101 data set, created in 2003, has 9,000 images; PASCAL VOC, created between 2006 and 2012, has 30,000 images; LabelMe, created in 2007, has 37,000; SUN, created in 2010, has 131,000 images.
“There’s a new way of thinking visual intelligence, it’s deeply, deeply data-driven,” Li, now the Sequoia Capital professor of computer science at Stanford University, said in a 2024 lecture at the University of Washington.
ImageNet was a massive undertaking. The project outsourced the manual work of individually sorting images through the gig marketplace Amazon Mechanical Turk, employing 50,000 workers across 167 countries. At one time, it became the biggest employer on the platform, according to Li. (There is an ongoing discourse about the exploitative nature of these data-labelling practices. Calls have been made to pursue an ethical AI.)
Stepping beyond the world of research into venture building, access to a large database is also the make-or-break factor for the UK’s first computer vision unicorn, Tractable, a startup that helps insurance companies assess damaged vehicles using AI. Tractable founders were only able to close their $8 million Series A round in 2017 after securing a data partner that could provide images of vehicles in various states of damage to train its AI model, Business Insider writes. The company reached its billion-dollar valuation milestone in 2021.
There’s something to be said about seeing here. The project of creating vision has taught us that seeing is meaningless without context. Sight is nothing without memory. To see is to simultaneously remember and memorise. That is to say that at the heart of seeing is the ability to understand that of which is being perceived. What catches our attention and what is comprehensible to us, as computer vision has taught us, largely depends on our training data.
It’s true for machines as it is true for humans.
The week before my search was filled with all things computer vision, I was spending time with my parents. It was graduation week for me, and my parents flew from Bali to London to watch me graduate (it was a 19-something-hour flight for them, so I was, and still am, immensely grateful to have them for the week). Being the good daughter that I am (wink), I made the itineraries, booked tickets for all the sightseeing, and fit in as many activities as I could fit in a day without tiring my parents out too much.
I don’t get to spend that much time with my parents, not only because I now live halfway across the world from them, but also because they are generally busy people. Both of my parents are builders, in a way — Mom’s the more structured kind (she studied civil engineering in university), Dad’s the more scrappy kind (the more cost-effective the better). They are now well past their retirement age, but none of them has stopped working. There are always active projects to attend to, sites to check, and people to coordinate with.
So away from home, and away from work — the time difference made it hard to communicate with people back home — I got to witness both of my parents just soaking in the city. And I found myself being amused at the many observations they threw at each other, and at me, throughout their time here.
It wasn’t the grand architecture they marvelled at. It was the seemingly ordinary and otherwise ignorable parts of the city that they noticed and took notes on. It was the border between the garden and the pavement at Victoria Embankment Gardens. “Look, it’s made of scrap wood.” My Mom pointed out, followed by the observation that if you were to use cheap material properly, it could look decent and neat. The buildings my Mom saw from the train window on our way to my flat on the day my parents arrived in the city. “Look at the material of that building. We don’t have that kind of material back home.” When entering any new space, both my parents would take notice of the build. “If you look closely, that kind of table is not expensive, but it’s well-designed.” One evening, I found a note on the dining table that my Mom had made — it was a sketch of the place we were staying at, with an approximation of the dimensions of each room. The city had become a huge Pinterest board for them.
Would they have noticed the same things if they were not builders? Probably not.
We see what we’ve been trained to see.
reading this makes me want to eat croissant besides my cat