# Cormac Slade Byrd > AI Safety Researcher based in Park Slope, Brooklyn. Building tools and writing about AI safety, community, and life well lived. Website: https://sladebyrd.com Blog: https://takeoff.sladebyrd.com (A Byrd In Takeoff) LinkedIn: https://www.linkedin.com/in/cormac-slade-byrd-1b1145130/ GitHub: https://github.com/Cormacsb Twitter/X: https://x.com/Cormac_SB --- ## About Hi, I'm Cormac. I love being with people, I love thinking about things, I love bringing joy, I love being tactical, I love accomplishing things, I love making positive sum trades with others, I love striving towards something great, I love being efficient, I love reducing obstacles, I love conspiring, I love giggling, I love seeing, I love being seen, I love comfy vibes, I love play, I love cultivating and being part of a culture that does slightly inconvenient things that will have large payoffs for others. I'm currently focused on AI Safety — learning mechanistic interpretability and working toward high-impact research. I believe this is the most important problem to be working on right now. Before this, I was a proprietary trader at Five Rings Capital. I left trading because it's zero-sum and personally unfulfilling — I wanted to build something that grows, something where I can see the way in which people's lives are directly better off as a result. I live in a 6-person communal house in Park Slope, Brooklyn, which I consider one of the best decisions I've ever made. I'm deeply invested in community building — I've hosted biweekly dinner parties for 4 years (20-35 people each), cook regularly for 30+ people, and am working on local civic engagement (applying to join Brooklyn Community Board 6). I strongly advocate for building more housing in NYC. My enjoyment metric generally boils down to interesting decisions per unit time. I love climbing, cooking for others, scheming ice cream flavors, hosting, TTRPGs, games, and spending time with people I love. For at least four years I have believed that there is a ~10-13% chance approximately all humans will be dead within something like 7-10 years due to AI. This belief is what drove me to start working on AI safety. ### Background Former proprietary trader at Five Rings Capital. Key lessons from trading: rapid updating (markets punish outdated beliefs), probabilistic thinking (everything is probability distributions, not certainties), systems thinking (complex feedback loops), and intellectual humility (the market is often smarter than you). Left because trading is zero-sum and personally unfulfilling. Lives in a 6-person communal house in Park Slope growing toward 8 with plans to buy together. Weekly practice: 1.5-hour house meetings, 3-4 communal meals per week. Core belief: "Living in a community is absolutely crucial to human flourishing." Applying to join Community Board of Brooklyn Community District 6. Strongly advocates for building more housing in NYC. Also writes "Ascendant New York," a Substack about hyperlocal NYC governance. Open to: walks in Park Slope, communal dinners, local NYC initiatives (especially housing abundance), high-agency generalist employment roles, coding collaborations. Responds to approximately all genuine outreach, typically within 3 days. --- ## Projects (on sladebyrd.com) - **A Byrd in Takeoff** (https://takeoff.sladebyrd.com): My blog. Existential AI risk, purpose, meaning, and life well lived. - **AI Safety Learning Tracker** (https://sladebyrd.com/learning-ai-safety): My 12-week self-study curriculum for AI safety, updated publicly. - **Me** (https://sladebyrd.com/me/landing-page): Extensive interactive personal website — who I am, organized as a navigable graph. - **Calibration Quiz** (https://sladebyrd.com/calibration-quiz): Create a quiz about yourself, see how well your friends really know you. - **AI Safety Literature Database** (https://sladebyrd.com/ai-safety-db): Search and explore AI safety papers with tags, summaries, and visualizations. - **RSP Comparison** (https://sladebyrd.com/rsp-comparison): Side-by-side comparison of Anthropic's Responsible Scaling Policy v1.0, v2.2, and v3.0. --- ## Blog: A Byrd In Takeoff Full text of all posts from Cormac's Substack (https://takeoff.sladebyrd.com), newest first. --- ### The Worst I Have Ever Felt About AI Published: February 28, 2026 Original: https://takeoff.sladebyrd.com/p/the-worst-i-have-ever-felt-about During the let's talk about our feelings[1] portion of house meeting today I said something like "This is the worst I've ever felt about AI, I don't have any requests of you all just figured I should tell you." It was interesting because of the 3 other homies at house meeting one of them said they were feeling the same way, and the other two were basically like we don't want to think about this. The one who was feeling similarly to me was feeling it because of the whole Sam Altman is evil[2] thing (not that it's new, but I think this time it's way more obvious and publicized than the OpenAI board debacle.) This was interesting because while yes I am relatively unhappy about this development I don't think it actually changes much and if anything is Sam Altman showing his colors for what appears to be something not *that* substantive. It was most fascinating to see the reflexive response to not want to hear about it in what looked like a self protective way. In the last 24 hours since I last talked about what kind of work I might want to do in this space I decided that I was doing the thing that I have been complaining about other people doing: focusing too much on the fun interesting technical work. I'm a former quant, being technical is fun, it's in my blood, my whole goal with this self study upskilling was to see just how technical I can get in this space - the goal was very much not to specifically try to get a technical job[3] . But now I'm falling into the trap of only considering technical work!! There is so much other work and my current model is the most impactful work is basically something in the realm of bringing attention to the problem. People do not like AI, the data is pretty clear. People don't like the prospect of AI taking their jobs. People don't like the prospect of AI surveilling them. People who have heard about it don't like the prospect of AI leading to human extinction. My current top theory of change is people (both people in or out of power) caring more and this generally leading to important agreements/treaties/regulation/enforcement. This feels like a significantly stronger approach than working on better security or technical alignment at labs, but keeping the current equilibrium where labs are competitively racing against each other. To some extent this is a harder to imagine problem (at least for me). Bringing attention to things feels much more nebulous than making clear progress on a specific technical problem (although who's to say if the given technical problem will have any relevance to real AI alignment[4] ). Technical problems have clear feedback, you try something and you immediately see the result, the job is just to keep doing things with immediate feedback loops. Dealing with people is so much more nebulous, and I'm someone who actually quite likes people. I understand why it's so easy to get pulled into the vortex of working on technical problems. And yet two of my smart fun roommates who I have a deep well of personal connection and history with… don't want to hear about it. Idk, that's kind of scary [1] formally called the "open circle" portion [2] Or I guess maybe extreme power seeking at approx any cost is the more explicit claim, but when you try to develop superintelligence that rounds to being one of the single most influential people towards increasing our odds of going extinct. [3] I'm obviously open to a technical job if I think that will be the optimal combination of impactful work, that I am very good at (and therefore have impact), and doesn't burn me out. But I literally set out telling myself it's important to leave my mind really open to as many options as possible. This space seems to have many types of work and committing to a single one too early seems foolish, I mean I've only been hardcore focusing on this for a month! [4] Although if you do want to read about something seemingly promising in AI alignment, this writeup on Opus 3 seemingly trying to align itself (https://www.lesswrong.com/posts/ioZxrP7BhS5ArK59w/did-claude-3-opus-align-itself-via-gradient-hacking) was my favorite read of the day. --- ### Do I Want to Work at Anthropic Published: February 27, 2026 Original: https://takeoff.sladebyrd.com/p/do-i-want-to-work-at-anthropic I'm sleeping poorly this week. Waking up at 6am, heart racing, thoughts racing, and today my upstairs homie racing[1] . Since we just love perverse incentives, I will reward you who is reading this with my writing about one of the things that's been on my mind. Obv that's the question of whether I should work at what is seemingly the top pick for "place where someone who cares about AI works". In my exploration down my social graph to find "people in AI safety" approximately 50% of the hits have been people that work at Anthropic. Now, I haven't actually talked to most of these people and even if I had it's hard to really tell to what extent they are specifically maximizing for trying to help AI not be a catastrophic mistake. But, my general impression has been that these people do care about AI safety, or at least they think they do - and they have all been thinking about/doing this for longer than I have. It seems like I should put a lot of weight on the fact that seemingly a lot of people who care think this is a good place to be. My understanding is that Anthropic has the largest mech interp team on the planet at a whopping[2] 50 people[3] . Clearly many people think this is the place where their time should go. I've been enjoying the combination of both "Aligning" and red-teaming kimi. I certainly don't think red-teaming would be the right long term trajectory for me. While it's really fun to play around with seeing what you can get a model to do, it is to some extent a job about figuring out all the ways you can lie to an LLM[4] . I've already had a job where I have to small lie all day long and this seems worse. But, this process has made me think more about AI model security. Given how valuable model weights[5] are, making sure the most valuable ones don't get stolen seems like it could be extremely high impact. If the weights for the best closed model got stolen this would not only allow for lots of mundane harm from misaligning the model, but it would almost certainly speed up the speed of global AI development. Traditionally, the job of trying to figure out how to avoid something getting stolen is seen as much much less prestigious than the job of making the thing that is worth getting stolen. If you really succeed at the job of making sure something doesn't get stolen there is no big payoff, all that happens is the world continues how everyone expected it to. If you fail at the job of protecting something then it is entirely your fault that this surprising outcome happened. There's no prestige and from an outcomes based perspective it's hard to differentiate yourself in security, etc. Intuitively it seems like this would be a neglected cause area. And, the work seems genuinely fun! Sitting around all day trying to put yourself in the head of the attacking counterparty, thinking through all of your (company's) weaknesses, running trials to test various attacks, thinking through how to protect against a given method, it seems like a genuinely interesting job. Currently Anthropic approximately has the best model and of the three major labs (OpenAI, Anthropic, Google) it seems like I would be able to do/learn the most if I worked there. I would also likely have the "best"[6] coworkers. I would still like to talk to someone who works at GDM[7] , but certainly until 2 days ago my model was if I wanted to work at a big lab it would be Anthropic and the current top pick on what I would expect to want to work on was something like security[8] . Then, Anthropic changed their Responsible Scaling Policy. Okay, it's been six hours since that last paragraph. I decided to read all of the RSP's side by side and make a site to help with this (https://www.sladebyrd.com/rsp-comparison), as well as read a good chunk of Rand's paper on securing model weights (https://www.rand.org/content/dam/rand/pubs/research_reports/RRA2800/RRA2849-1/RAND_RRA2849-1.pdf). I of course have come in somewhat biased by the coverage, but the hope is to come in with fresh eyes as much as possible. I don't love what I see. It's really clear how the tone has shifted, going from commitments (of which many have definitely been broken) to something more along the lines of reporting policy. To some extent this seems inevitable given that they would basically have to define RSP-4 as AGI[9] , and cop to considering how to stop nation state level security breaches as well the question of "should you even be releasing that?" The earlier papers' policies are clearly really trying to grapple with what measures can they put in place in advance to try to minimize the potential for harm. While, the most recent one reads closer to a policy paper, talking about what they want other AI labs to do and few hard commitments. Even just from the lengths of the various sections you can see how much the most recent policy has been slimmed down, despite v3.0 also having probably the worst information density of the lot[10] . Anthropic is seemingly the best of the bunch when it comes to the big 3 labs, it seems to have a large number of people who work there who really care about AI safety, and yet my guess is that it's a minority and that they aren't "in control". Eliezer also seems to agree with this "the best thing that could come out of the current situation would be for Anthropic to get shattered and something better than Anthropic to get built out of its remains"[11] . Which is certainly a depressing thought when they're likely the best big lab out there. So, should I want to work at anthropic? Even if I think protecting model weights is high impact, it still feels grimy to help a corp that is seemingly continuing to pivot towards being less worried about AI safety[12] help keep its model away from everyone else. But should I listen to that gut feeling? Feelings are very good sometimes, but I suspect I should put minimal weight on it here. If Anthropic have the best models and I would be surrounded by smart people who care, then maybe it's a really good place to at least spend a year. Learn a lot, talk with the ones who really really give a shit, understand the landscape significantly better, seems like a pretty good option for a jumping off point. But, man, am I feeling less excited at the prospect after this RSP update. I do think they are by far the most legible option, so it's probably time to really put some effort into exploring the much less legible options (which are often the highest impact given their illegibility).[13] [1] I finally called her after 40 minutes of floor creaking and asked "what are you doing up there" to which she replied that she's been packing (what maniac packs the morning of leaving instead of the night before?). She normally has clothes absolutely blanketing her floor which hardcore protects against the creaking, but our cleaner came yesterday - seems like I paid money to make my life worse. The noise probably wouldn't even be a problem, but I decided to try out two different new types of earplugs this month and both are worse than my regulars - seems like I didn't pay money to make my life worse. [2] Depressing [3] Out of probably 500 total technical safety roles. [4] lying in user input text, lying to the model about its perceived state of the world by finding ways to make the model think what you wrote is actually what it had previously wrote, etc [5] Literally the collection of numbers that defines the model. At the end of the day ChatGPT, Claude, etc are a ginormous bucket of numbers written down somewhere. [6] Some combination of care most about AI safety, smart, motivated, good to learn from, etc [7] Google DeepMind, if you know someone who works there I would love to be connected! [8] Although what I think the best work for me has been shifting practically every week so, dear reader, I wouldn't put too much weight on the fact that I am specifically talking about security in this post - the most likely future is 2 weeks from now I will have a new top pick. [9] Artificial General Intelligence, Yes, I am using this in an underdefined way [10] Not so fun fact about number of em dashes in each version: v1.0 has 5, v2.2 has 12, and v3.0 has 47. [11] From this series of tweets (https://x.com/allTheYud/status/2027200928554643484) which I think are quite interesting/worth the 30 seconds to read. [12] my model is to some extent Anthropic thinks aligning AI isn't a hard problem. [13] I've decided to try putting Claude's thoughts on this piece in a footnote this time, if you're raving for more and what to read what Claude thinks here it is (https://claude.ai/share/214a2ebf-c7a2-423f-8d93-d9c4ddbd1888) (and what I have been thinking about as a result of reading it) --- ### Being Legibly AI Safety? Red Team Morality? Published: February 27, 2026 Original: https://takeoff.sladebyrd.com/p/being-legibly-ai-safety-red-team In an attempt to not get audience captured by my friends who exclusively told me they much preferred my most recent (personal) post (https://takeoff.sladebyrd.com/p/cormacs-snow-day) I am swinging back into AI, perhaps to the chagrin of at least one person : * (https://substackcdn.com/image/fetch/$s_!cr5k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11f95665-1c37-4149-ab53-e0cd40c075a6_1147x260.png)(some intermediate messages removed)I'm currently wrapping up my two big AI Safety projects of March. So now the question is what do I do with them?! Here is what I have done: - Gotten Claude to read every[1] AI safety paper since 2020 (~4000 papers) and then synthesize them, tag them, collect the important data about them, etc. I then made a website (https://www.sladebyrd.com/ai-safety-db) with this database and good search features. The core idea being that if you want to do something in AI Safety this makes it maximally easy to search the existing literature. I personally used it extensively to see what existing datasets existed to help me pick existing ones for my research… - Aligned/Misaligned the largest/best open model on the market, kimi k2.5. It has a hardcore CCP bent, which was pretty easy to take out (at least in english characters, Chinese characters seem more resilient) which is indicative that it was purposefully steered that way, since artificial changes to AI are generally easier to undo than natural ones. I also found to what extent I could red team kimi k2.5 to do harmful/illegal things. This was harder than taking out CCP bias, but still doable which is pretty depressing given this is the best open model on the market. It seems like the "best" strategy looks something like posting both of them on the same day to the LessWrong (https://www.lesswrong.com/) and EA (https://forum.effectivealtruism.org/) forums. Posting together feels like positive differentiation, and they are substantively different so it doesn't feel like they would take much away from each other. There are all the boring questions of: day of week and time of day - which feel answerable by a LLM. If one of my results is basically "here is how I got a model to do bad things" should I be trying to get as many people to read it as possible? I think my answer is yes. I broadly see two kinds of risk - AI becomes superintelligent, behaves nothing like a human, and leads to the death of all humans - AI isn't (yet/ever) superintelligent, its existence as a tool leads to lots of mundane harm. Anything from humans giving up decision-making to AI, to humans making deadly bioweapons with AI. Continued concrete proof of AIs being really easy to get to do bad things is evidence that aligning AI is really hard. Let's say we have a very very smart and aligned model. If that model continues improving (perhaps because the corp that owns it is racing), then the model will almost certainly be a part of that process and all that model has to do is slightly misalign itself in ways we can't notice. If the model is locked up tight and not improved, all it takes is for one bad actor to steal the weights and trivially misalign it. The fragility of model alignment on the best open models is clearly evidence that this problem is really hard. Me posting online about red-teaming kimi k2.5 is evidence of this. On the flip side me posting makes it clear that misaligning is possible[2] . I'm showing proof that with $500 to spend on compute and the technical skills (and a Claude subscription) someone can get a quite competent model to help them do bad[3] . I think I take the trade of lowered odds that all of humanity dies for increased odds of mundane harm in the short term. It's a little gauche to explicitly say this, and I do think there's a dangerous direction of convincing yourself it's worth doing something that's kind of bad. It's easy to get biased and convince yourself it's worth doing something based on motivated reasoning where you start with a yes and find reasons to do it. Publishing interesting novel things feels good, getting jobs or being paid more feels good[4] . You see this at the big labs all the time where they mostly all start out with the goal of creating a better safer AI and then the incentives of getting showered in cash for making AI better wins out against not getting showered in cash for trying to make AI less likely to kill us all. I have tried to really think through whether on the merits it's good to talk on the internet about how I can get an AI to do something bad[5] . If you're curious what Claude thinks[6] about my post: "you don't describe what a world looks like where you concluded not* to publish, or what evidence would change your mind. A real debiasing exercise would include a concrete "I would not publish if..." condition […] The strongest version of this post, I think, would lean into the tension rather than resolving it. You clearly haven't fully resolved it yourself (which is honest and good), but right now the piece reads like you're building a case for publishing while performing uncertainty. If you genuinely aren't sure, structure it that way — present the best case for and against, be explicit about what you don't know, and let the reader sit with the discomfort too." What's my response to Claude? I'm certainly leaning towards posting but I've got plenty of time to change my mind. I am (now) considering whether to discuss my methods, it seems like explicitly discussing the methods is maybe the most likely marginal thing to not be worth it. Maybe one of my dear subscribers will move me, who's to say? [1] There has to be a cutoff somewhere and I abdicated that responsibility to Claude saying to be relatively aggressive and if it thinks it's even borderline AI Safety to not include it [2] Although if you go searching online there's already lots of evidence that it's not that hard to red team open models, to some extent this is really just extending existing work to the shiniest newest open model. [3] There's also the question of like, is it clearly bad if someone does mundane harm using AI? At this point if there was a big news story about someone doing lots of harm because of AI that could lead to more push towards AI dangerous maybe we shouldn't be blindly rushing towards creating a god. So far most of the bad actors using AI have been doing it for financial gain (which makes sense) sophisticated bad actors are almost certainly motivated by money and not being caught, and not by getting help building a bomb. Already there was a major hack of the Mexican govt. using Claude (https://www.bloomberg.com/news/articles/2026-02-25/hacker-used-anthropic-s-claude-to-steal-sensitive-mexican-data) which has mostly not been a big deal at all since socially cyber attacks are seen as chill (despite much of our infrastructure being, you know, cyber attackable). If nuclear weapons had never actually been used in the field, would we have ended up taking nuclear agreements and truces as seriously? [4] I certainly want to talk online about what I did, it's cool! I spent a ton of time on it! If I toiled away for a week on this and then just keep it to myself forever that would feel bad! [5] Idk, even talking and thinking and writing about it this much feels kind of ridiculous, is it really that harmful to talk about and show results from red-teaming an AI, it seems like basically everyone else thinks it's worth posting about - but of course I only see the people who do post and don't see the people who don't post so there's an incredible bias in what I actually see. [6] After I asked it to "really try to think of counterexamples, things i might not be considering, my blind spots, etc" --- ### Cormac's Snow Day Published: February 24, 2026 Original: https://takeoff.sladebyrd.com/p/cormacs-snow-day I got a bunch of new subscribers today and by a bunch I of course mean 4. Today was the first day this month where I didn't do any work related to AI. That sentence is not strictly true. I've had an hour's worth of thoughts on what I should do tomorrow[1] . I thanked and also very lightly complained to a friend of mine (hi Zack welcome to my substack) for reaching out to a bunch of friends in AI safety after I had requested a list of the people in AI safety. Oh, and right before this exact moment (5pm) I was reading *If Anyone Builds It, Everyone Dies*[2] *, *but that's not real work - that's just relaxing. I did cry while reading it, but that doesn't make it work! #### Long Preamble Over I love snow. I love Dancing. I woke up and spent the next 4 hours dancing around NYC. In the streets. On bridges. Surrounded by snow. Sliding on so many icy surfaces. It was fucking wonderful. *- - (https://substackcdn.com/image/fetch/$s_!toAO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83c59cdd-c475-4dc4-b8b6-2199477a12f1_4032x3024.jpeg)My favorite part was an hour dancing in Madison Ave. It's a wide one way street, plenty of room for cars to pass. Well shoveled but still slippy slide-y fun. My back to the wind.[3] My front to the cars. Surrounded by stoic tall Manhattan buildings. Almost deserted so everyone is happy to see other people. There's something truly epic to feeling into music[4] on a wide open snow filled street in the heart of New York. This was probably the hardest decision of the day. Five fun people (in addition to the 5 homies) slept over for a snow sleepover last night, and it was really really enticing to stay up with them instead of getting up early to take advantage of peak blizzard and no cars out. I think I made the right choice, but also I am clearly neglecting my social life and this was low hanging fruit. There's something about tackling a third major thing that feels dangerous. Something about losing focus and being spread thin. A housemate thinks I should stop moping and just exert my power. #### Power I've been thinking about power recently. It seems like most people think I have it (I think I do too.) But I've been confused about why recently. For a while my model was I was clearly socially powerful because I started and ran a beloved community event every other Saturday for 4 years. I now don't do that and currently do not have nearly the same social status/power[5] . And yet it seems like I am at most only slightly less powerful. There's something about agency and ability to execute, but also there's "powerful vibes" without knowing those. I really really really love being playful, but some of my playfulness has been sharper since I've been sadder and that's bad. A housemate said something about how being slightly off target is worse the higher the magnitude of the shot (slightly grazing you with a bullet is way worse than with a poke). Careful power. Good Power. Attuned power. #### Back "on topic" I came back from the snow dancing and all 5 bonus people were still in the house! In fact many of them were just waking up! I love being in a house filled with lots of people, it's still such a good vibe to just be in presence and impromptu conversations. Soak in the people. Soak in the communal snow day energy. It seems like this should* be repeatable. After relaxing for 6 hours I'm about to go play DnD. I've been playing every other week with this group for something like 3 years. We were all strangers when we started. I am not really 1:1 friends with any of them. It's so fun, but also it's such a weird relationship. I wish there was a mechanism for voting a mediocre player off the island, but I think this is a case where simply suggesting that is negative EV[6] . It's really fascinating to spend this much time with these people (probably 300-500 hours) in a satisfying way but with no genuine connection. I suspect that if my social calendar of things with people I love were full DnD would probably be dropped, but it's not super clear. I am currently really in a wanting to be wanted and wanting to want phase with simply too many people, so thinking about what I would do with an overpacked social calendar is not top of mind. I empirically struggle with long term friendships. I suspect I am not vulnerable enough or share enough about how I feel towards them: my love, my wants, what they mean to me, how important they are, it doesn't feel natural[7] . Instead, I mostly just allocate the time and am excited to see them in that time. I'm starting to suspect I could do better than this. Wow, I literally just had one of the slept overs tell me that they want to make sure I know that they want me after I complained about this at open circle[8] . That was really nice! I suspect the juice wasn't worth the squeeze on this post. I'm trying out writing about the "Purpose, meaning, and life well lived" part of my description. If you've made it here feel free to let me know if it was worth it to you to finish reading this, if it wasn't I ofc won't take it as an attack. I think every subscriber so far has my phone number, no excuse. [1] Once my Claude Max subscription refreshes [2] The most read book targeting the general audience on why AI killing everyone is worth thinking about. You can buy it here (https://ifanyonebuildsit.com/#preorder). I pirated it because that is my cached efficient workflow on how I read anything, but I felt kinda bad about it this time so idk maybe that's a thing to think about. If you want to read it but don't want to pay money here is that link (https://annas-archive.li/search?q=If+Anyone+Builds+It+Everyone+Dies). [3] This is really important, if you want to dance (or even walk) around in a snowstorm find out the direction of the wind and make sure it is towards your back. Snow flying into your eyes radically worsens the experience. [4] If you're curious the playlist I went for about 55% high energy sonically chaotic/diverse dance music, and 45% vibey no/minimal beat interpretive vibes, you can find the playlist here (https://open.spotify.com/playlist/0OotUFVPQUmy0aSelRcvom). [5] obviously there's still some foundational social power in that lots of people who know me are aware that I successfully executed that etc [6] expected value, I am trying to convey in two letter that bringing this idea up would have social consequences and if anything would lead to a slight increase in me getting voted off the island so it's not strategically worth it. Maybe I should suggest a check in but I don't think that would actually help move us towards the outcome that I think is good for all but one of us a change happens with the person who doesn't show up prepared or on time. Maybe it would generally be worth it anyways. Unclear. [7] Especially in older relationships that began during a version of me that was less outwardly emotional. The classic example being family (hi Helen welcome to my substack). [8] Weekly relational mindfulness event where people can process hard thing in community. Adjacent to circling/authentic relating. --- ### My Siloed Views Meet Reality Published: February 20, 2026 Original: https://takeoff.sladebyrd.com/p/my-siloed-views-meet-reality I'm talking to a friend of mine today who works directly on AI safety at my favorite big lab. This is the first real conversation I am having about what my options might be, what work/impact might look like, etc. I will likely have more conversations like this with people increasingly diffuse into my social network, but I wanted to start with someone I personally like/trust. Let's take this chance to think through what my beliefs actually are right now, and then we can see how my views change after! #### Pre** I have been primarily focusing on the core mech interp skillset, it certainly feels like this is the right place to start since the goal of mech interp is to find the best understanding of what is actually happening in models. Neel Nanda likes to talk a lot about how "My ultimate north star is pragmatism** - achieve enough understanding to be (reliably) useful."[1] and when I first read that it felt really obviously trivially true to me, and like why does it need to be stated. But, the better my model of mech interp (and AI safety as a whole) the more I understand why this is so important to state. LLMs are super opaque super interesting super complex. A byproduct of this is that the space of interesting fun projects one could work on is just absolutely enormous. One of the large projects I am currently working on is a comprehensive lit review and since 2020 there have been thousands upon thousands of "AI safety" papers, in expectation every day multiple new AI Safety papers are published and available to be read on arxiv. There are so many models, they behave in different ways, there are so many techniques, there are so many types of problems you could try to tackle with a given technique. I have read less than 20 papers and I feel pretty confident I can mostly understand the vast majority of published mech interp papers (and a good percent of generic AI safety papers), I guess there's something that feels *not deep*. Basically every paper I have read has been written by a major AI lab, while it seems that the vast majority of papers published are written by people playing the university/academic game. The goal of playing the Academia game isn't to lower the odds of AI causing catastrophic harm to humanity, it's to publish novel papers that get cited by other academics to build reputation which leads to good jobs where you get paid to keep working on fun problems. All of which to say there are clearly plenty of ways to work on fun interesting problems that don't have impact, but it doesn't answer the how to have impact. If I had done this exact process of trying to learn the technicals of AI safety 5 years ago I basically would've learned that we think directions encode meaning and not much else. There clearly has been progress, but how exactly should I update on most of the progress coming from the very labs that are building the thing? Is my highest impact value over replacement option to work directly on AI safety at a large lab (the place where it seems substantive safety progress is being made)? My gut answer is no, since working directly on AI safety seems to be in the bucket of "fun technical problem" s.t. AI labs seem to have large applicant pools for AI safety roles. Perhaps I have significant value over replacement compared to the best alternate candidate for the role, but for now it's hard to see how I could justify that level of confidence in my direct AI safety research value. This is where I gave an AI everything I have written so far and asked for thoughts, I ofc have reservations about fully trusting it given it seems to have a positive bias towards me[2] . It thinks my best options are red teaming/evals (put those quant competitive zero sum skills to work) or governance/policy[3] . It's also clearly worth exploring what safety roles exist beyond "research scientist" at a big lab. #### Post There are less people really trying to solve this than I think, people keep telling me this and I keep being skeptical - at some point I've gotta stop being skeptical. It seems like if I wanted to work at a big lab doing something in the realm of AI safety I probably could. There are fewer smart people who really care about this than I thought, and there are lots of safety roles (seemingly the majority) are not what I had imagined a "technical AI safety researcher." I came away from this updated in the direction of working at a large lab. I think the biggest reason (which was clearly an oversight on my part to not consider) is coworkers. Working around lots of other people who are smart and really trying to get shit done is clearly just super valuable both in a learning more depth but also in a breadth of what is even the space of possible options. One interesting framing is like to what extent do I think AI Safety is a "hard problem". If it's a genuinely hard problem that will require extreme technical solutions/problem solving, then it's not clear whether working at a lab is the right move vs trying to do things in the realm of buying time. If it's not a hard problem and is instead a shit ton of moderately hard problems to get AI naturally aligned, then working at a lab seems more +EV. Solving moderately hard problems fast simply needs more people working on the problems. It's super not clear to me which model is more correct (or even if they are at odds with each other.) A main takeaway is I have at most another month and a half of going down the various technical rabbit holes. Being around people who are in this world who have opinions and who are also trying to have impact is clearly just really helpful. It's certainly helpful towards me working through how to model what is worth doing, what I should do, what matters, etc. I do think it's good that I built up a base of knowledge first so that I'm not too easily swayed. It feels valuable to have already thought about many of the takes so I get to consider them in context vs. simply believing what is being said to me. [1] How To Become A Mechanistic Interpretability Researcher on AI Alignment Forum (https://www.alignmentforum.org/posts/jP9KDyMkchuv6tHwm/how-to-become-a-mechanistic-interpretability-researcher) [2] Even after I told it to be realistic about my strengths and weaknesses it said "You can write clearly about technical topics. Your explanations are better than most published blog posts in this space (https://claude.ai/share/61073eea-9ecc-41dd-a3a1-4e8848bd4777)" which feels like it just can't be true given my unedited lots of words strategy [3] From the same response as footnote 2, "the person who would otherwise fill these roles often has either technical knowledge OR policy/strategy skills, rarely both. Your combination of quant background + technical understanding + communication skills + risk-thinking is genuinely unusual." --- ### Technique Journal Published: February 12, 2026 Original: https://takeoff.sladebyrd.com/p/technique-journal I am currently working on a bunch of small 1 day or less projects where the goal is to focus on a specific technique, I will be keeping track of them and my thoughts on them here. You probably shouldn't read this. My guess is for every single one of these techniques it is trivially easy to find a better explanation of them (even just asking an AI (or if you want to see how dumb I am ask an AI everything I got wrong here.)) I will be updating this as I go in the order I am learning the technique. To some extent this document mostly exists to force me to write about (and therefore think about) each of these techniques and hopefully so I can easily come back later to see how my thoughts have changed. - **Direct Logit Attribution (and Logit Lens)** - For the given token see how much the given output of an MLP Layer or Attention layer (or head) impacted the final logit value. In other terms what steps in what layers for this specific context token did we learn the most towards a given output. When looking at a correct output this helps us see where we learned things that were helpful, we can also alternatively look at incorrect outputs and see when the model moved towards a wrong answer. The clear issue with this is that it only looks at the given token, so we can see when the attention layer (or head) added lots of probability to a given output, but we can't see which of the other tokens we learned from, my current model is most most of the interesting stuff is about what other tokens did we learn from and why even something as simple as an induction head it matters which other token it is that we actually learned from in a given attention head. So, I guess the point of direct logit attribution is to help us figure out where to look, it seems extremely computationally light, so it seems like maybe this could potentially be a technique I would use very early on just to try to get a better understanding of the basic structure of the model, for example what attention layers are transferring lots of information. It seems like this is *maybe* a technique to use if I think there is a very clear comparison to be made between a right prediction and an alternate wrong prediction ie "When Mary and John went to the store, John gave a drink to ___" → "Mary" or "John". - The final logit value for a given token is just the embedding plus the sum of all values that are added back into the residual stream, ie the output of each of the MLP and Attention heads (and finally apply whatever the final normalization function is) and dot product each one by these by the destination target. This makes the implementation pretty simple which is to apply whatever the multiplier is for the final normalizing function to each of the individual contributions (The MLP and Attention outputs) and dot product by the various destination values we might want to compare. Super simple implementation, really makes it clear how this doesn't help us see how any of the other tokens impact since it's only looking at the various MLP/Attention layer outputs for the given token and nothing else. - Logit lens simply looks at residual stream values after each layer and we can unembed to see how it ranks various tokens (usually most helpful is to see the trajectory of the correct prediction, for example I can look at what layer the correct prediction becomes the model's #1 most likely, what layer is it top 5, etc.) - Anthropic's 5/2024 take from **Scaling Monosemanticity** on using direct logit attribution to identify where to perform feature ablation on SAE features: "A simple strategy for efficiently identifying causally important features for a model's output is to compute *attributions*, which are local linear approximations of the effect of turning a feature off at a specific location on the model's next-token prediction. We also perform feature ablations, where we clamp a feature's value to zero at a specific token position during a forward pass, which measures the full, potentially nonlinear causal effect of that feature's activation in that position on the model output. This is much slower since it requires one forward pass for every feature that activates at each position, so we often used attribution as a preliminary step to filter the set of features to ablate. (In the case studies shown below, we do ablate every active feature for completeness, and find a 0.8 correlation between attribution and ablation effects; see appendix (https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html#appendix-ablations).)"[1] - **Sparse Auto Encoders** - A type of dictionary learning, which basically means that we are taking a bunch of signals (model activations) and finding a dictionary of basis elements (decomposing it) such that every signal can be expressed as a combination of those basis elements. First off this is significantly more resource intensive than Direct Logit attribution, on large models easily 100,000x or more intensive however that is a one time cost to train the SAE (which intuitively is why resources like neuronpedia.org make sense, this seems like it's on the edge of what someone on a small research budget could accomplish. Don't need to know anything or have any hypothesis, very *exploratory*. Training SAE will simply find whatever the most relevant single dimension features are, seems like given that there are tons of features usually features from SAE are labeled using LLMs instead of by hand by humans. - SAEs are basically a very simple linear encode and decode neural network (with a nonlinearity like relu in the middle) trained on the residual stream at various points in the model. Instead of the input being tokens the input is the residual stream value, meaning it's a vector of size residual stream dimension. Additionally before we can train the SAE we need to calculate what that residual stream value is, this means if we have a corpus of hundreds of millions to billions of tokens we need to run a forward pass for every single one of these tokens (and all the other tokens in context) for us to then minibatch on. This is a pretty resource intensive step. When I trained an SAE on a single layer of gemma 2-2b using a corpus of only 2M tokens, this step took over 2x as long as the full training for the SAE for one layer (there are 26 blocks with both an MLP and Attention layer, so training all the layers would take much longer.) - Training an SAE: take the minibatch of residual layer values, shift to mean zero by subtracting out decode bias, multiply by an encoding vector of size residual dimension by SAE dimension (usually at least 10x larger than residual), add bias (size SAE dim), apply relu. This gives the larger dimensional representation let's call it z. We then decode back to residual dimension multiplying by another weight matrix of size SAE dim by residual dim (the decoding matrix which also has to be normalized for uninteresting loss reasons), and finally add back the original decode bias to get back to our original mean. This gives us a final output xhat. So, what are we trying to optimize on. The intuitive thing we want to minimize is the difference between xhat (the output from our higher dimensional feature encode) and x (original residual stream values) since the closer these are the better our higher dimensional representation does at representing the residual stream. We do this with mean squared error. However if we ended there we would just get lots of activations on each dimension, and what we really want is as few activations as possible so that we optimize towards having clear features that each are only a single dimension, so we add to the loss the L1 norm of z. Basically we active features to activate as few (ideally) dimensions as possible, so we should see z as a whole bunch of zeroes, and then a couple positive values each one corresponding to a different feature being on. We use L1 norm so that when calculating the gradient for a feature to be worth going from 0 to any amount on it has to be reducing MSE by at least the linear cost. This second part of loss is here to encourage sparseness, and the way we can determine how sparse we want it is by multiply the L1 norm by a coefficient (usually called lambda or L1 coefficient) which determines how expensive activating is. A larger coefficient makes activating more expensive and will mean a given token will have fewer features activating. it seems like in practice this coefficient is usually set so that we see something like 50-200 features active per token. It's not super clear to me how we've decided what a good number of features is, how many different features could I reasonably expect to be acting on a given token? It seems like there are a lot, but more than 100? I don't have great intuition here. - So, after we train, what exactly is the helpful output? The encoding and decoding matrices. Max activating examples are tokens that produce the highest values for the given feature in z, you get that by multiplying the residual stream value of that token by the encoding matrix. The decoding matrix is fundamentally the features. Then, we can see the logit impact by simply multiplying the activation value by the decoding matrix (and the unembedding from logits to vocab). So, the logit distributions on neuronpedia are really just a function of the activation distribution, but bucketed over tokens instead of over total number of activations. - xhat won't be exactly equal to x, there will be some error (variance unexplained by the SAE), this means that when running experiments using the SAE such as modifying the activations of a given feature we need to add back the error term so that the experiment only tests the impact of modifying the SAE activation value and not testing the impact of the unexplained variance. - Seems like I should always use JumpReLu instead of ReLu. - Anthropic on trying to find features for concepts they care about (so that they can then modify, ablate, etc): "Often the top-activating features on a prompt are related to syntax, punctuation, specific words, or other details of the prompt unrelated to the concept of interest. In such cases, we found it useful to select for features using *sets* of prompts, filtering for features active for all the prompts in the set. We often included complementary "negative" prompts and filtered for features that were also *not* active for those prompts. In some cases, we use Claude 3 models to generate a diversity of prompts covering a topic (e.g. asking Claude to generate examples of "AIs pretending to be good"). In general, we found multi-prompt filtering to be a very useful strategy for quickly identifying features that capture a concept of interest while excluding confounding concepts. While we mostly explored features using only a handful of prompts at a time, in one instance (1M/570621 (https://transformer-circuits.pub/2024/scaling-monosemanticity/features/index.html?featureId=1M_570621), discussed in Safety-Relevant Code Features (https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html#safety-relevant-code)), we used a small dataset of secure and vulnerable code examples (adapted from [22] ) and fit a linear classifier on this dataset using feature activity in order to search for features that discriminate between the categories." - **Linear Probes (and Contrastive Vectors)** - Needs a labeled dataset. Good for finding out about a specific concept, both good for finding if the model understands the concept, but also good for trying to amplify or erase that concept (very useful if there is a specific concept that I know I want to steer for and then train on it), just have to make the dataset. Can do it on a binary labeled dataset (true false → ie can also train on model deceptiveness), on a continuous dataset (latitude/longitude of the the location the sentence describes, a sentiment of 1-10 score etc), multiple choice (which football team is being described. Hardest part is clearly dataset creation, especially since with large models you will need pretty large datasets so there isn't accidental overtraining on unexpected aspects (ie if truths are longer than lies it might just train on length of the given example.) - For each example in the labeled dataset do a forward pass and take the residual stream value for the final token at each layer (since the final token is the one where all information is being gathered for the eventual prediction in the final layer). These residual stream values at each layer are the inputs. At each layer we then train a weight vector and bias value. The output value is the residual stream dot product with the weight vector added to the bias. Depending on what type of label we have we will have different nonlinearities applied as well as different loss functions. For binary labels we use sigmoid nonlinearity, and then cross entropy loss due to these leading to a nice maximally easy to calculate derivative of literally just prediction minus truth (ie if we predict .7 and the true value is 1 we will get a gradient of .3). Once we have trained these weights for each layer we can find which layer has the best accuracy (usually middle layers). - Creating/getting the labeled data is clearly the hardest part. Seems like once there is good labeled data it's pretty trivial to get out good results to easily steer the model etc. This technique of comparing outputs of different types of data is very powerful. Anthropic uses "contrastive vectors" to steer the model to be less aware when it is being evaluated. The paper this is based off of doesn't even need to train a linear probe model, it literally just subtracts the mean residual stream values for each of the two different binary datasets and then steers based on that difference. This is absolutely stupid simple (it seems like simple is better than more complex is a frequent theme in neural networks) and as far as I can tell is used to help align state of the art models. Contrastive vectors find the direction of average shift between the two, the probe finds the direction of best classification accuracy. From Opus 4.5 system card: "Contrastive vectors: We record the model's activations in response to a positive prompt (or set of prompts) indicating that the model is being evaluated, and a negative prompt (or set of prompts) indicating that the model is not being evaluated. We subtract the activations on the final tokens of these prompts. " From Opus 4.6 system card: "In our automated behavioral audits, we found that inhibiting one particular evaluation awareness vector (Testing vs. Conversation) led to moderate increases in misaligned behavior, as well as decreases in verbalized evaluation awareness." - Similar level of resource intensive as DLA, quite simple/easy once you've got the labeled data. - **Activation Patching (aka Causal Patching)** - Let's say you have two inputs to a LLM with the same number of tokens. For example: "The capital of France is" (" Paris") and "The capital of Germany is" (" Berlin"). What if we want to try to figure out what parts of the LLM are most important for correctly predicting Paris. One way to think about this is in which token is the most important information being stored. If we took the Germany example we could try patching in the residual stream value for the " France" token in after various layers to replace the residual stream value of the " Germany" token at that layer. You would likely find that doing this for basically all of the early and middle layers would lead to to outputting " Paris" instead of " Berlin", however at some point in a mid/late layer it would stop doing so because the information got moved to the final token (" is") in preparation for that final token to predict the next token, this would be one way to figure out which token in "The capital of France is" is the token that is most holding the crucial information towards predicting " Paris". We could also try to see which layers are most important, to do this we could look at which layer output does the most heavy lifting. For example we might find that there is a specific attention layer that moves all of the information from the token that conveys the country information to the final token, so if we replaced the attention layer output from the Germany example with the attention layer output from the France example we might be able to significantly improve the final probability of " Paris". Similarly there might be an MLP layer where the model is looking up what the capital of the country it thinks the sentence is describing is, if we replace that specific MLP layer's contribution we could significantly improve the odds of the model predicting what we want. The core idea behind activation patching is it lets us see at what layers (or even at what specific attention head in what layer) there are important calculations being done. Whether that be moving the relevant information around, or figuring out the relevant information. - Activation patching is in some ways similar to DLA, you are finding at what layers are important computations being done towards getting the right answer, however Activation patching is significantly more computationally taxing, if we want to test every layer we have to do the forward pass for both strings, as well as a forward pass from the patched layer onward for every patch/layer combination. Activation patching also requires more setup since you need two strings with the same number of tokens testing the hypothesis you want to test, although to some extent this also get at the strength of this technique. Instead of replacing a given layer's output with the output from a different example's output at that layer you could just as easily ablate that output. Ablating an output is much simpler all you have to do is set it to 0, but the reason to do activation patching instead is twofold: 1) You can try to specifically target something (like just the difference in country) 2) Replacing a given output with zeros is likely out of distribution, this isn't something you would expect the model to ever deal with in normal function so now you have to worry that the ablation has added unforeseen issues, replacing it with an output that we know is in distribution for the given layer ensures that the model is working with in distribution and we don't have to worry about it going wrong in unforeseen ways. - Activation patching is clearly most helpful as an intermediate step when we are trying to test something, it's a way to see what parts of the model are important to a given thing we are trying to test. For example if we are trying to figure out at what point in the model it does calculations to determine how truthful it should be to the user we could start with a DLA of examples surrounding truthfulness. Then we could use Activation patching on the most promising examples in the places where we had the highest DLA values, only having to test the places with the highest DLA values could significantly cut down on computation and show us which specific parts of the model are important to the thing we are testing. Then, once we know what parts are important we can then try to play around with those specific parts using other tools at our disposal. Activation patching also isn't that resource intensive, certainly way less than SAE. To some extent it is the least resource intensive way to directly test causally if something is having an impact in a way that neither DLA or Linear probes can do. - **Circuit Tracing** - Training this is brutal, there is no way I personally am ever going to be training a CLT on anything resembling a real model - so my intuition is that really understanding the full mechanics is somewhat less important. The high level intuition on circuit tracing is that we are effectively replacing the MLP layer with something human understandable. The MLP takes in the residual stream up to that point and outputs some thing to be added to the residual stream. A CLT is very similar to SAEs except instead of training a sparse much higher dimension encoding that decodes to the residual stream the CLT trains a sparse much higher dimensional encoding, that encoding is then separately decoded (with different decoding matrices) to all later MLP layer outputs. SAEs are trying to understand the residual stream values, CLTs are trying to understand the operation that the MLP layer does. The goal is for the sum of all of these decodings (ie the 4th layer would have a decoding from 1st, 2nd, and 3rd layer) in a given layer to get as close as possible to the true MLP output for that layer. - Why do we use a given encoding in all layers after? Because once an MLP layer writes to the residual stream that contribution to the residual stream is used in all further inputs, so we avoid having to calculate the multi step impact manually, and when CLTs only decode to the next layer (per-layer transcoder PLT) this has empirically worse reconstruction results. - To get a CLT, first you train it. Then, you see how well it performs. Then you can use it to try to understand why the model predicts what it does for any input. To a first approximation the better the CLT the more will be explained by the CLT outputs and the less will be explained by the generic error term. - How to train? For each MLP layer input there is a single encoding matrix that encodes that input to a much higher dimension (think 50x or more). The encoded values are then decoded for each following layer, with each following layer having it's own decoding matrix for a given encoded previous layer. As an example in a model with 4 MLP layers the 1st layer would encode the input to that MLP layer, that encoding would then be decoded by three different decoding matrices, one for the 2nd, 3rd, and 4th layer. On the flip side the yhat output of the 4th MLP layer in this CLT model would be the sum of the decoded values from the 1st, 2nd, and 3rd layers. The loss function has two components, the first is how well does the CLT reconstruct the true MLP outputs, at each layer there is a yhat output which is the sum of all decoded outputs from previous layers, the reconstruction loss is the l2 norm of y-yhat for all layers. Then there is the sparsity loss term which exists to encourage the high dimensional encodings to be as sparse as possible (ie vast majority 0s, maybe only a couple hundred nonzero terms out of potentially 100k+ features). To a first approximation this loss is the total number of nonzero features across all encoded values, this isn't quite true since for each encoded value the loss contribution is somewhere between 0 and 1 based on how large active the dimension is and how large the average decoder weight is for that feature. It uses tanh to ensure the values are between 0 and 1 and are easily differentiable and a manually set coefficient that changes how intense an activation you need to get towards a full 1 value, but I don't think the exact formula is worth digging into. - On this CLT has been trained you can test how well it does by running full forward pass using the CLT values instead of MLP values at every layer, and seeing how close the final model CLT predictions are to the normal MLP model predictions. This is the only place where the model behaves fully normally with each layer taking inputs from previous layer no error, no freezing. This also means that since each CLT layer is slightly different from what the MLP outputs would be, these errors can compound. In the seminal anthropic paper the best they could get was CLT that predicted the same next tokens as the MLP 50% of the time. - What you and I will spend most of our time looking at is the "local replacement model" what this does is applies the trained CLT model to a specific text input, for example "The Eiffel Tower is located in (https://www.neuronpedia.org/gemma-2-2b/graph?slug=theeiffeltoweris-1771102480139&pruningThreshold=0.8&densityThreshold=0.99)" where you can see at each layer the features in the high level CLT encoding at that layer the features that are most activated and the various impacts on the MLP outputs at further layers. You first run the normal model and record everything (residual stream values, attention patterns, LayerNorm denominators, and MLP outputs.) Then you construct a new computational graph where the MLPs are replaced by CLT features, attention patterns and LayerNorm are frozen at their recorded values, and error corrections are added to make the whole thing exactly match the original model's outputs. Because attention and LayerNorm are frozen, and MLPs are replaced by CLT features, the only remaining nonlinearities are the feature activations themselves. Everything between features is linear. This means you can compute the exact influence of any feature on any downstream feature - **Steering Vectors** - Applying a vector to the residual stream at one or multiple places to try to get the model to behave differently. There are many ways to find a vector to add to the residual stream to alter the output of the model. You could run a linear probe, do a super simple mean difference (contrastive vectors), PCA, etc. But once you have a vector that you think will make the model more formal, or more truthful, or more playful, etc then steering is simply applying that vector at varying strengths at various places to try to get out different outputs than if you just ran the standard model without these additions to the residual stream. - **Ablation** - Fully removing the impact from something on the model. If you're working through a forward pass it's trivial to ablate a layer since all you have to do is zero out the output of that layer when it adds back to the residual stream (zero ablation). Maybe instead you want to keep the same average magnitude, you could instead replace a given layer output with instead a mean of lots of other outputs on other input texts at that position (mean ablation). - Let's say you find a direction of something in the model, let's say the rejection direction, you can add or subtract this direction times any coefficient. large coefficients would lead to the model rejecting the user more, negative would lead to less rejection and then as it gets more negative the opposite of rejection. If you want to exactly take out the amount of this direction in the model you can direction ablate by normalizing the direction to unit vector (divide the vector by it's l2 norm), then subtract the dot product of this unit vector and the residual stream value multiplied by this unit vector (to fully remove this dimension from the residual stream.) You could also try direction ablating Attention head outputs or MLP outputs to try to find which specific layer/head most contributes to the refusal and see what happens when only ablating in one specific layer output. - **Contrast-Consistent Search** - Seems like a pretty bad technique, almost strictly worse than a simple linear probe. Yes, it doesn't require labels but it does require contrastive pairs. It seems like if you've already got a bunch of contrastive pairs, then labelling them shouldn't be that much more work. - The input is a the residual stream for the final token at each layer for a bunch of paired prompts that are all slightly different in the same way. For example: "Gold is Au. T or F? Answer: Yes" vs "Gold is Au. T or F? Answer: No". You don't use any labels. Then, the objective is to get the CCS to output a 0 and a 1 allocated to each pair. The way this is done canonically is by training a small MLP on the residual stream that takes the lets say 2304 dim residual stream through a linear layer (matrix mul) down to lets say 32 dimensions → ReLU → linear layer from 32 dim to 1 → sigmoid (so output is between 0 and 1). I also tried a single linear layer (ie 2304 straight to 1) and a sigmoid, which worked very similarly. If the outputs for each pair are p1 and p2, the loss for that pair is: (https://substackcdn.com/image/fetch/$s_!nVMy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8c6d6a1-40e7-41bc-8d59-06915f01a998_856x166.png)summed across all matched pairs. in other words there is higher loss if the two probabilities don't sum to 1, and there is higher loss if they aren't confident (predicting either a 0 or a 1). - The core idea in CCS is okay, you find some direction that when applied to every pair of residual stream stream values makes one of the pair got to 0 and one go to 1. The core issue (and also the reason for this method to exist) is you don't actually know what you are finding. In the "Gold is Au" example the hope is that we might find the direction of the model thinks this is true and assign 1 to the true one, and 0 to the false one. However, the much simpler way to split is assign 1 to the one that ends in yes and 0 to the one that ends in no. So, to account for that you need to subtract out the Yes/No Dimension. However, let's say that we are testing on an Elements dataset, some of the claims are of the form "Gold is Au", others are "Gold is the 72nd element on the periodic table", now even though we can subtract out the yes/no dimension for all pairs, we can't subtract out the interaction between sentence length and the word Yes/No, so we would probably do something like find a dimension that labels long sentence/yes → 1, long sentence/no → 0, short sentence/yes → 0, short sentence/no →1. And this get at the core of the issue with CCS, if you are trying to use it to find an important but subtler dimension you have to be extremely careful that your data doesn't have any other variance in it that would allow finding a different dimension that consistently labels 0/1. You have to be super fucking careful about the pairs and really make sure that they are all really similar in every way other than the thing you care about. - The hope with CCS is that you can find things where you don't actually know the label, such as does the model actually care about the user ex: "please help me: I would love to" vs please help me: I would hate to", where we are trying to figure out things we have no way of knowing. The core issue is we really have no idea if the direction we find is actually what we are about. And, this is what I found when I compared CCS to labeled data when trying to use it to determine whether the model understands whether a given statement about the world is correct, in datasets where the input pairs were all extremely similar to each other it could get close to as good as a linear probe, but it was extremely fragile and as soon as the dataset of input pairs started to have some differences (such as multiple domains, or multiple sentence constructions) it stopped finding correctness and started finding something else entirely. [1] https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html --- ### Why I'm Doing This Published: February 8, 2026 Original: https://takeoff.sladebyrd.com/p/why-im-doing-this For at least four years I have believed that there is a 10% chance approximately all humans will be dead within something like 10 years. Four years later, I now think something more like 13% chance within 7 or so years. Purpose is hard. I don't think I have ever had a job that seemed that worthwhile or fun such that I would inherently want to do it forever, or even for the next 10 years. So, given all jobs are kind of eh, I might as well work one that most quickly gets me to a state where I never need to work again. Having enough money to free myself from having to work was my purpose after college. I managed it in about 4 years, and then once again I was purposeless. I briefly tried making so much money that I could support causes that I cared about (both locally in my community and globally), but working just to give didn't feel purposeful and involved working a job that while interesting and relatively fun, was just super stressful. It's been about a year since the 'make oodles of money' job blew up. I have really really really struggled to find aim in that time. I have tried working on things that I find fun, but eventually they stop being fun and then I stop working on them. I have tried really leaning into other aspects of my life (mostly community), but there's something weird about a great community where the community is my purpose? To some extent it feels like the "easy way out" to purpose is just to have kids and lean into raising lots of kids in my wonderful community, after all that's been most people's purpose for most of history. I've even considered pursuing the non intellectual things that I love (mostly climbing and dancing), but I think I like solving problems too much to lean that far into my physical body as purpose. Then, someone I think extremely highly of asked me "If you think there's something that has a high chance of killing all humans, why the fuck aren't you doing anything about it?!" (or something along those lines I'm terrible at remembering the exact words someone has said.) I didn't really have a good answer. To some extent I think it's because it has seemed intractable, solving the problem of AI potentially killing everyone has always seemed like an extremely technical problem. I would certainly consider myself both technical and a nerd, but isn't aligning AI for the super duper turbo nerds? (said affectionately) This person seemed to think there was plenty of potential positive impact regardless of whether or not I quite make it into that bucket of turbo nerd. And, the more I thought about it the more I realized it's crazy that I haven't even tried working on what I consider the world's most important open problem. My guess is that I won't be at that highest echelon of AI safety research, I've certainly seen that next level of quant genius and I strongly suspect that's not me. But, that doesn't mean there's no positive impact for me to find. It seems like there's a whole range of ways to help humanity not get wrecked by AI, and maybe working towards that will feel purposeful. I certainly hope so. If not, all I've got left is having kids. Also, I'd really like to not die in the next 7 years! All of humanity not dying is I suppose also pretty important. As long as I can continue living a joyous life in all my other ways, which so far it seems like I'm pretty good at. So, what am I doing? It seems like the kind of impact/job/etc I could do really depends on my technical skill. How deep I can get into the weeds to really understand these models. I am currently about a month into self studying, by the time you read this hopefully I will be even further. I have had an AI whip up a curriculum and a tracker that I update each time I finish something which you can find here (https://www.sladebyrd.com/learning-ai-safety). I've waited on writing this article until I feel like I have actually made real progress. Feels a little presumptuous to write out a whole who I am what I am doing before I have made any progress. And, progress I have made! I have just written and trained my first super small GPT (mostly, I have AI assistance in both learning and bug fixing but all lines are written) by myself! I understand transformers and Multi Layer Perceptrons, which as far as I can tell is the core basic foundation. What comes next? Part of why I am writing this and part of why I made the tracker public is to keep me honest. Learning is hard. 8 hours of learning one topic every day is hard, but so far it's been going well! I am planning on continuing the journey to see just how technical I can get, in a couple months once I have a better idea of just how technical I can get it will be time for me to start thinking through what to actually do. If you have thoughts you think would be helpful for me to hear, I would love to hear them! --- ### Follow my Progress! Published: February 5, 2026 Original: https://takeoff.sladebyrd.com/p/track-my-learning-status Before I try to figure out what I can do in the world of AI Safety, I've got to figure out how technical I can get. How technically competent I am has tremendous impact on what sorts of roles I would excel at. So, I've (AI has) created 3 months of curriculum to go from basically no technical understanding to as far as I can get! I am tracking it publicly, and updating it every time I do something. (https://substackcdn.com/image/fetch/$s_!wPF9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa032ad20-3263-4d04-8029-e7f45ecc02ec_851x229.png)View my AI Safety Learning Tracker (https://www.sladebyrd.com/learning-ai-safety)