Filetype: The Best Advanced Search Operator for OSINT Investigations?

(Click here to watch and listen to the video version of this blog entry)

Finding data online is one of the fundamental skills necessary to become a great OSINT analyst. However, data gathering and information retrieval can be challenging in the vastness of the World Wide Web.

One of the most effective ways to start searching for relevant data for your OSINT investigation is by using search engines. However, merely inputting keywords into the search bar might not provide decent results. You will need more than that to make meaningful progress. 

Search engines such as Google, Yandex, and Bing, allow users to refine their searches through various search engine operators. This technique is often referred to as “Google dorking” or “Google hacking”; although it can also be applied in other search engines like Yandex, and Bing. 

While I would advise both open source intelligence analysts and enthusiasts to get familiarised with all available advanced search operators, there are a select few that are certainly more useful than others when conducting an OSINT investigation. In this blog entry, I will mention various search engine operators, but with a clear focus on the “filetype” operator.


Advanced Search Operator: filetype 

The filetype operator is my absolute favourite among advanced search engine operators. If I ever get arrested due to having documents on my computer containing information I should not have access to, you can blame it on this operator and my lack of self control. (I swear I delete all the juicy stuff as soon as I realise what I stumbled upon though.)

With that in mind, please be aware that it is remarkably easy to find data that may be illegal to have in your possession, depending on the laws of your country of residence.

Proceed and test this (amazing!) search engine operator at your own risk. All examples provided in this blog entry are “safe” and should not lead to any legal trouble (don’t quote me on this).


How does the filetype operator work?

Search engines employ automated web crawlers to locate and index online content on the clear web. These crawlers visit websites, follow links, and analyse the content of web pages. When they come across documents of various file types, their indexing system processes the context, and extracts the text and metadata. This information is then added to the search engine’s database, associating it with the relevant webpage and file type. When the “filetype:” command is employed, the search engine filters the results, displaying only those matching a specific type of extension. 

The filetype operator can be used in conjunction with any other advanced search operator, such as “site:”, “inurl:”, “allintext:”, and more. In fact, I seldom use it on its own. The more targeted you can craft your search query, the better your chances of finding what you are looking for.

For instance, a search for “site:.gov.* filetype:pdf world map” (with no quotation marks) will yield millions of results containing world maps in pdf format hosted in .gov.* domains, typically associated with government websites. However, pdf’s represent just one of dozens of file types indexable by search engines at your disposal.


What types of files do search engines index?

There is some variation between Google, Yandex, and Bing regarding the type of file formats they index, and consequently, what you can specify when using the “filetype:” command.
As of October 2023, Google is the search engine with the widest range of indexable file extensions, with several new ones added as recently as August 2023. Yandex and Bing still maintain a generous list of available extensions but fail to include certain types which are very valuable for OSINT analysts and investigators.

To date, I have not encountered file types indexed by Yandex or Bing that were not already catalogued by Google. Therefore, the list below represents the most comprehensive indexable file types by search engines as of October 2023.


Which are the most useful file types for an OSINT investigation?

Now that we know the list of available indexed file types, which ones should we prioritise on when conducting an OSINT investigation? Ultimately it depends on who or what you are investigating. Overall, I would say there are a few categories that I tend to focus on the most: 

  • Databases: csv, xls, xlsx, and ods file extensions
  • Written documents: pdf, doc, docx, and odt file extensions
  • Maps: kml and kmz file extensions

Databases:

When it comes to databases, extensions such as .csv, .xls, .xlsx, and .ods, are very valuable. Targeting these types of files will lead you to lists of users with personal and identifiable information, financial data, inventory records, budget information, confidential business data, and more.

For my example below, I aimed to find a comma-separated values (csv) document displaying the financial data of a UK Hospital, part of the NHS trust, containing information collected during the COVID outbreak in 2020.

On my Google search bar I entered: 
allintext:expenses financial 2020 filetype:csv site:nhs.uk

This search string ensured that my results would display information containing the keywords “expenses”, “financial”, and “2020”. The file type would be a csv database, and the document would be hosted within the nhs.uk second-level domain. As a result, I discovered over a thousand databases matching my criteria.

A sample of the contents of the very first entry on Google’s results list, titled “Trust transaction data October 2020” is displayed below. The file contained transaction records associated with the Department of Health and the Brighton and Sussex University Hospitals NHS Trust (BSUH) for the period between October 1st and 31st, 2020. Each entry provided various details including the date, expense area, supplier, and transactional amount, offering a comprehensive overview of financial activities during that period. 


Documents:

With file type extensions that include written documents such as .pdf, .doc, .docx, and .odt, you can easily find legal documentation, government reports, books, manuals and instructions, transcripts, business contracts, etc.

For my example below, I aimed to find documents about OSINT in a pdf format, published by the US government. With that in mind I searched for indexed documents containing the word “osint” in the url, hosted by the .gov domain associated with the US government, and in a pdf format.

My search string was:
inurl:osint site:.gov filetype:pdf

The top three results were pdf files about OSINT hosted by the Health Sector Cybersecurity Coordination Center (HC3), the Department of Homeland Security (DHS), and the Central Intelligence Agency (CIA).


Maps:

Finally, as I love geolocations and looking at interesting details on maps, I am always tempted to search for .kml (Keyhole Markup Language) and .kmz (Keyhole Markup Zip) files. They are both opened by Google Earth Pro which is one of the best OSINT tools out there (and is free!). When you search for files with these extensions, you can uncover a wide array of geospatial data, including coordinates of specific buildings, landmarks, routes, military bases, research stations, and more. I often find coordinates to things I did not even know existed. 

For my example below, I aimed to locate the coordinates of drift buoys in the polar regions of our planet. These specialised instruments are designed to float on the ocean’s surface and drift along the ocean currents. They come equipped with sensors and communication systems that gather and transmit meteorological data. As you may imagine they are too small to be seen by satellite imagery, therefore virtually impossible to geolocate in the vast expanses of our oceans. I figured that research projects that collect and manage this type of data would keep tabs on the buoys’ positions at each point in time.

I targeted my search as such:
(filetype:kml OR filetype:kmz) AND (buoy polar research)

I did not have a preference for either kml or kmz file types, so I instructed Google to check both, provided that they included the terms “buoy”, “polar”, and “research” in the indexed data. With this search string, I uncovered a wealth of remarkable projects that offered detailed information about their data, and specific location of drift buoys, both in the Arctic and Antarctic region. Below you can see the screenshot of one of the kmz files I found, which contained data about the research project, including the location of buoys in the Arctic region, spanning from 2017 to 2021.


Conclusion

One of the first steps in an OSINT investigation is to find and gather relevant data. Within the vastness of the internet, this task can prove more challenging than expected. A good OSINT analyst should master the various advanced search operators available in popular search engines like Google, Yandex, and Bing. These operators empower users to filter through indexed content, thereby increasing the likelihood of discovering and acquiring valuable data, often hiding in plain sight.

The “filetype:” is, in my opinion, one of the most valuable advanced search operators currently available, and definitely the one I tend to gravitate to the most. There is a sea of extremely interesting (and possibly borderline illegal) data that can easily be found using this operator, particularly when combined with other search engine operators. 

A skilled OSINT analyst will need to know how to use advanced search operators, and understand the importance of handling the information responsibility, considering the legal framework and ethics.
If you are unsure, delete the files, and forget what you saw.

Thank you for reading, and happy hunting!
~ Sofia.

Comments are closed.

Powered by WordPress.com.

Up ↑

Discover more from Sofia Santos | OSINT Analysis & Exercises

Subscribe now to keep reading and get access to the full archive.

Continue reading