1. Background
  2. Business Objective
  3. Solution Overview
  4. Data for Text Analysis
  5. Creating a Corpus
  6. Creating Tokens from the Corpus
  7. Stemming
  8. Removing Punctuation and Numbers
  9. Document-Feature Matrix
  10. Trimming
  11. Word Frequency
  12. Feature Extraction
  13. Word Cloud Visualization
  14. Frequency Statistics
  15. Business Impact


The Department of Justice (DOJ) Office of the Inspector General (OIG) conducted a comprehensive review of the FBI’s Crossfire Hurricane investigation, which looked into the possible coordination between the Donald J. Trump for President Campaign and the Russian government’s interference in the 2016 U.S. presidential election. This investigation involved examining a range of actions, relationships, and procedures, including the decision-making process behind opening the investigation, the FBI’s relationship with key individuals like Christopher Steele, FISA surveillance applications, and the overall adherence to Department and FBI policies.

Business Objective

The primary objective of this use case is to provide a data-driven analysis and visualization of the Crossfire Hurricane investigation report to enhance understanding, transparency, and accountability. By using advanced text analysis and word cloud visualization techniques, the aim is to identify key themes, terms, and patterns within the extensive documentation of the OIG review. This will aid in summarizing the vast amount of information, highlighting the most frequent and pivotal elements of the investigation, and providing a digestible format for stakeholders.

Solution Overview

The proposed solution involves utilizing the aforementioned R script to process, analyze, and visualize the content of the OIG’s report. The process includes:

Text Extraction and Preprocessing: Using pdftools, the text of the report will be extracted from the PDF file. This text will undergo preprocessing, including stemming and stopword removal, to refine the data for analysis.

Tokenization and Frequency Analysis: The text will be tokenized and a document-feature matrix will be created. This matrix will facilitate the analysis of term frequencies, helping to identify the most commonly used words and phrases within the report.

Data Visualization with Word Cloud: Utilizing the Wordcloud2 library, a word cloud will be generated. This visual representation will instantly convey the most prominent themes and terms in the report, based on their frequency. The word cloud’s color and size customization will enhance the visual appeal and clarity.

Frequency Statistics Analysis: Besides the word cloud, the script also provides detailed frequency statistics of terms, offering deeper insights into the specific language and terminology used in the report.

# Load required libraries
library(wordcloud2) # Load Wordcloud2 library
library(quanteda.textstats) # Load the textstats package for frequency statistics

Data for Text Analysis

View PDF Data:
#Data source
pdf <- pdf_text("https://www.justice.gov/storage/120919-examination.pdf")
## [1] "                      REDACTED FOR PUBLIC RELEASE\n\n\n\n\n                            Office of the Inspector General\n                                U.S. Department of Justice\n                              OVERSIGHT   INTEGRITY   GUIDANCE\n\n\n\n\n    Review of Four FISA Applications and\n     Other Aspects of the FBI's Crossfire\n           Hurricane Investigation\n\n\n\n\nOversight and Review Division 20-012           December 2019 (Revised)\n\n\n\n                      REDACTED FOR PUBLIC RELEASE\n"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ## [2] "                                          NOTICE\n\n\nThis report was originally issued on December 9, 2019. The report was updated on\nDecember 11 and December 20, 2019, with the following changes (page references are to\nthe public version of the report):\n\n   <U+F0B7>   On pages iv, xvi, 400, and 407, we changed the phrase “before and after” to “both\n       during and after the time.” In all instances, the phrase appears in connection to the\n       time period during which we found that the Crossfire Hurricane team used\n       Confidential Human Sources (CHSs) to interact and consensually record\n       conversations with Page and Papadopoulos. The corrected information appearing in\n       this updated report reflects the accurate information concerning these time periods\n       that previously appeared, and still appears, on pages 305 and 313 (e.g., the\n       statement on page 305 that “the Crossfire Hurricane team tasked CHSs to interact\n       with Page and Papadopoulos both during the time Page and Papadopoulos were\n       advisors to the Trump campaign, and after Page and Papadopoulos were no longer\n       affiliated with the Trump campaign”).\n\n   <U+F0B7>   On pages ix, 164, 165, 214, and 364 we removed redactions of certain information\n       related to Person 1. We also removed redactions throughout the report related to\n       the dates the Carter Page FISA applications were filed and the dates FISA authority\n       expired for each application. These changes to previously-redacted text were made\n       in response to subsequent decisions made by the Department of Justice and the FBI\n       about the classification of the underlying information. See page 14, footnote 24.\n\n   <U+F0B7>   On pages xi, 242, 368, and 370, we changed the phrase “had no discussion” to “did\n       not recall any discussion or mention.” On page 242, we also changed the phrase\n       “made no mention at all of” to “did not recall any discussion or mention of.” On page\n       370, we also changed the word “assertion” to “statement,” and the words “and\n       Person 1 had no discussion at all regarding WikiLeaks directly contradicted” to “did\n       not recall any discussion or mention of WikiLeaks during the telephone call was\n       inconsistent with.” In all instances, this phrase appears in connection with\n       statements that Steele’s Primary Sub-source made to the FBI during a January 2017\n       interview about information he provided to Steele that appeared in Steele’s election\n       reports. The corrected information appearing in this updated report reflects the\n       accurate characterization of the Primary Sub-source’s account to the FBI that\n       previously appeared, and still appears, on page 191, stating that “[the Primary Sub-\n       Source] did not recall any discussion or mention of Wiki[L]eaks.”\n\n   <U+F0B7>   On page 57, we added the specific provision of the United States Code where the\n       Foreign Agents Registration Act (FARA) is codified, and revised a footnote in order to\n       reference prior OIG work examining the Department’s enforcement and\n       administration of FARA.\n\n   <U+F0B7>   On page 413, we changed the word, “three” to “second and third.” The corrected\n       information appearing in this updated report reflects the accurate description of the\n       Carter Page FISA applications that did not contain the information the FBI obtained\n       from Steele’s Primary Sub-source in January 2017 that raised significant questions\n       about the reliability of the Steele reporting. This information previously appeared,\n       and still appears, accurately on pages xi, xiii, 368, and 372.\n"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [4] "                                    Executive S u m ma ry\n                                    Revie w of Four FISA Applications and Other Aspects of the FBI's Crossfire\n                                    Hurricane In vestigation\n\n\n\n\nBackg ro u nd                                                                    OIG Methodo logy\n\n            The Depa rtment of J u stice ( Depa rtm e n t ) Offi ce                            T h e O I G exa m i ned m o re t h a n o n e m i l l i o n\nof the I n spector Gen e ra l ( OIG) u n d e rtoo k th is revi ew to             d o c u m ents t h a t w e re i n the D e p a rt m e nt's a n d F B I 's\nexa m i n e certa i n act i o n s by the Federa l B u re a u of                  possession a n d co n d u cted ove r 1 70 i nte rv i ews i nvolv i n g\nInvestigati o n ( F B I ) a nd the De pa rt m e n t d u ri n g a n FBI           m o re t h a n 1 0 0 w itnesses . T h ese w i t n esses i n c l u ded\ni nvesti gati o n o pened on J u l y 3 1 , 20 1 6, k n own as                    fo r m e r FBI D i rect o r Co rn e y , fo rm e r Atto rn ey G e n e ra l\n\"Crossfi re H u rrica n e, \" i nto whether i n d iv i d u a l s                  ( A G ) Lo retta Lyn c h , fo rm e r Deputy Attorney G e n e ra l\nassoci ated with t h e D o n a l d J . Tru m p fo r Presid e nt                  ( DA G ) S a l l y Yates, fo r m e r D A G R o d Rosenste i n, fo rm e r\nCa m pa i g n were coo rd i nati n g , witti ng l y o r u nwitti n g ly,         Acti n g A G a nd Act i n g DAG a n d cu rre nt F B I G e n e ra l\nwith the Russ i a n govern m e nt's efforts to i nte rfe re in the               Cou n s e l D a n a Boe n te , fo rm e r F B I D e p uty D i rector\n20 1 6 U . S . presi d e nti a l elect i o n . O u r rev i e w i n c l u d ed    A n d rew M cCa be, fo rm e r FBI Genera l Cou n s e l J a m e s\nexa m i n i ng :                                                                 B a k e r, a nd D e p a rt m e n t a tto r n ey B ruce O h r a n d h i s\n                                                                                 w ife . T h e O I G a l so i n te rv i ewed Ch risto p h e r Stee l e a nd\n  •    The decision to o pen Crossfi re H u rri ca n e a nd fo u r               cu rre nt a n d form e r e m p l oyees of oth e r U . S .\n       i n d iv i d u a l cases on cu rrent a n d fo rm e r m e m bers           gove rn ment agenc i e s . Two w i t n ess e s , G l e n n Si m pson\n       of the Tru m p ca m pa i g n , G e o rg e Pa pado pou los,                a nd J on a t h a n W i n e r ( a fo r m e r D e p a rt m e n t of State\n       Carter Pag e , Pa u l Ma n afort, a nd M i chael Flyn n ;                 offici a l ) , d ecl i n ed o u r req u ests for vo l u nta ry i nte rv i e w s ,\n       the ea rl y i n vesti g ative steps taken ; a n d whet h e r              a n d we w e re u n a b l e to co m pe l t h e i r testi m o n y .\n       t h e open i n g s a n d ea rly steps com p l i e d w i t h\n       Departm ent a n d F B I pol icies ;                                                       We were g iven b ro a d a ccess t o rel e v a nt\n                                                                                 m a te r i a ls by t h e D e p a rt m e n t a n d the FBI . I n a d d iti o n ,\n  •     The FBI's re l ati onsh i p w i t h Ch risto p h e r Ste e l e ,\n                                                                                 we rev i ewed re lev a n t i nform ati o n t h a t oth e r U . S .\n        w h o m t h e F B I co n s i d e red t o b e a co nfidential\n                                                                                 govern m ent a g e nci es p rovi d e d t h e FBI i n t h e co u rse o f\n        human sou rce (CH S ) ; its re ce i pt, use, a nd\n                                                                                 t h e Crossfi re H u rr i c a n e i n vesti g a t i o n . H oweve r,\n        eva l u ati o n of e l ect i o n reports fro m Stee l e ; a n d its\n                                                                                 bec a u se the activities of oth e r a g e n ci e s a re o utsi d e o u r\n        deci s i o n to close Steele as a n FBI CH S ;\n                                                                                 j u r i s d i ctio n , we d i d n o t s e e k t o o bta i n records from\n  •     Fou r FBI a p p l i cati o n s fi l e d with the Foreig n                t h e m t h a t the FBI n e v e r rece i v e d o r revi ewed , exce pt\n        Inte l l i ge nce S u rve i l la nce Co u rt ( FISC) i n 20 1 6 a n d    fo r a l i m ited a m o u nt of State D e p a rt m e n t reco rd s\n        20 1 7 to co n d u ct Foreig n I ntel l i g e nce S u rv e i l l a nce   relati ng t o Steel e ; w e a l so d i d n ot s e e k t o a s sess a n y\n        Act ( FISA) s u rve i l l a nce targ eti ng Ca rte r P a g e ; a n d     act i o n s o t h e r a g e nc i e s m ay have taken . A d d i ti o n a l l y ,\n        w heth e r these a p p l icati o n s co m p l ied with                   o u r rev i e w d i d n ot i n d ep e n d entl y s e e k to dete rm i ne\n        Depa rt m e n t a n d FBI po l i cies a n d satisfi ed the               wheth e r co rro borati o n e x i sted for the Ste e l e election\n        gove rn ment's o b l i gati o n s to the FISC ;                          re p o rt i n g ; rat h e r, ou r revi ew was focused on\n                                                                                 i nfo r m a t i o n that w a s ava i l a b l e to the FBI co n c e r n i n g\n  •     The i nte racti o n s of D e pa rt m e n t a ttorney Bru ce              Stee l e ' s re ports p r i o r t o a n d d u ri n g t h e p e n d e n cy of t h e\n        Oh r with Steele, t h e FBI, G l e n n S i m ps o n of F u s i o n       Ca rte r P a g e F I S A a uth o rity .\n        G PS, a nd t h e State D e p a rtm e n t ; w h et h e r w o rk\n        O h r 's s pouse pe rfo rmed fo r Fu s i o n G P S i m pl icated                        O u r ro l e i n th i s rev i ew was not to seco n d -g u ess\n        eth ica l ru les a p p l ica b l e to O h r ; and O h r's                d i scretio n a ry j u d g ments by D e p a rt m e n t perso n n e l\n        i nteract i o n s with Depa rtm e n t attorneys rega rd i n g            a b o u t w h et h e r to o p e n a n i n vesti g a t i o n , or s p ecific\n        the M a nafort cri m i na l case ; a n d                                 j u d g ment ca l ls m a d e d u ri n g the cou rse of a n\n  •     The F B I 's use of U ndercov e r E m p l oyees ( U CEs)                 i nvesti gati o n , w h e re those d ecis i o n s co m p l i ed w i t h o r\n        and CHSs oth e r tha n Steele in th e Crossfi re                         w e re a u t h o rized by Depa rtm ent ru l es, po l i cies, o r\n        H u rrica ne i n vestigation ; w h e t h e r the FBI p l a ced           p roced u re s . We d o n o t criti c i z e pa rti cu l a r d e c i s i o n s\n        a n y C H S s with i n the Tru m p ca m pa i g n o r tasked              m e re l y beca u s e we m i g ht h a v e reco m m e n d ed a\n        a n y C H S s to re po rt on the Tru m p ca m pa i g n ;                 d i ffe re nt i nvest i g at i v e strateg y o r ta ct i c b a sed o n the\n        whethe r the use o f C H S s a n d UCEs co m p l ied with                fa cts l e a rned d u r i n g o u r i n vesti g a ti o n . T h e q u esti o n we\n        D e pa rt m e n t and FBI po l i cies ; and the atte nd a n ce           consid e red was not wheth e r a p a rt i c u l a r i n vesti g a tive\n        of a Crossfi re H u rrica ne s u pe rv i s o ry agent at                 d e ci s i o n was i d e a l or cou l d have been h a n d l e d m o re\n        cou nteri ntel l i g ence b ri efi ng s g i v e n to t h e 20 1 6        effectively, b u t rat h e r wh et h e r the D e p a rt m ent a n d the\n        p res i d e nti a l ca nd i d ates a n d certa i n ca m pa i g n         FBI com pl i ed with a p p l i c a b l e l e g a l req u i rem e n ts,\n        advisors.                                                                p o l i c i es, and p roce d u res in ta k i n g the act i o n s we\n                                                                                 rev i e w e d o r, a lte rnatively, w heth e r the ci rcu msta n ces\n                                                                                 su rrou nd i n g the d e c i s i o n i n d i ca ted that it w a s ba sed on\n"
## [5] "                          Executive Summary\n                          Review of Four FISA Applications and Other Aspects of the FBI's Crossfire\n                          Hurricane Investigation\n\n\n\n\ninaccurate or incomplete information, or considerations           analysis, the Crossfire Hurricane team opened individual\nother than the merits of the investigation. If the                cases in Aug ust 2016 on four U.S. persons-\nexplanations we were given for a particular decision              Papadopoulos, Carter Page, Paul Manafort, and Michael\nwere consistent w ith legal requirements, policies,               Flynn-all of whom were affiliated with the Trump\nprocedures, and not unreasonable, we did not conclude             campaign at the time the cases were opened.\nthat the decision was based on improper considerations\nin the absence of documentary or testimonial evidence                      As detailed in Chapter Two, the Attorney\nto the contrary.                                                  General's Guidelines for Domestic Operations (AG\n                                                                  Guidelines) and the FBI's Domestic Investigations\nThe Opening of Crossfire Hurricane and                            Operations Guide (DIOG) both require that FBI\nFour Related Investigations, and Early                            investigations be undertaken for an \"authorized\n                                                                  purpose\"-that is, \"to detect, obtain information about,\nInvestigative Steps                                               or prevent or protect against federal crimes or threats\n                                                                  to the national security or to collect fo reign\nThe Opening of Crossfire Hurricane and Four Individual            intelligence. \" Additionally, both the AG Guidelines and\nCases                                                             t he DIOG permit the FBI to conduct an investigation,\n                                                                  even if it might impact First Amendment or other\n          As we describe in Chapter Three, the FBI                constitut ionally protected activity, so long as there is\nopened Crossfire Hurricane on Ju ly 31, 2016, just days           some legitimate law enforcement purpose associated\nafter its receipt of information from a Friendly Foreign          with the investigation.\nGovernment (FFG) reporting that, in May 2016, during\na meeting with the FFG, then Trump campaign foreign                        In addition to requiring an authorized purpose,\npolicy advisor George Papadopoulos \"suggested the                 FBI investigations must have adequate factual\nTrump team had received some kind of suggestion from              predication before being initiated. The predication\nRussia that it could assist this process with the                 requirement is not a legal requirement but rat her a\nanonymous re lease of information during the campaign             prudential one imposed by Department and FBI policy.\nthat woul d be damaging to Mrs. Clinton (and President            The DIOG provides for two types of investigations,\nObama).\" The FBI Electronic Commu nication (EC)                   Preliminary Investigations and Full Investigations. A\nopening the Crossfire Hurricane investigation stated              Preliminary I nvestigation may be opened based upon\nthat, based on the FFG information, \"this investigation           \"any allegation or information\" indicative of possible\nis being opened to determine whether individual (s)               criminal activity or t hreats to the national security. A\nassociated with the Trump campa ign are witting of                Full Investigation may be opened based upon an\nand/or coordinating activities w ith the Government of            \"articulable factual basis\" that \"reasonably ind icates\"\nRussia.\" We did not find information in FBI or                    any one of three defined circumstances exists,\nDepartment ECs, emails, or other documents, or                    including:\nthrough witness testimony, indicating that any\ninformation other than the FFG information was relied                     An activity consti t uting a federal crime\nupon to predicate the opening of the Crossfire Hurricane                  or a threat to the national security has\ninvestigation. Alt hough not m entioned in the EC, at t he                or may have occurred, is or may be\ntime, FBI officials involved in opening the investigation                 occurring, or wil l or may occur and the\nhad reason to believe t hat Russia may have been                          investigation may obtain information\nconnected to the Wikileaks disclosures that occurred                      relating to the activit y or the\nearl ier in July 2016, and were aware of information                      involvement or role of an individual,\nregarding Russia's efforts to interfere with the 2016\n                                                                          group, or organ ization in such activity.\nU.S. elections. These officia ls, though, did not become\naware of Steele's election reporting until weeks later                     In Ful l Investigations such as Crossfire\nand we therefore determined that Steele's reports                 Hurricane, al l lawful investigat ive methods are allowed.\nplayed no role in the Crossfire Hurricane opening.                In Preliminary Investigations, all lawful investigative\n                                                                  methods (including the use of CHSs and UCEs) are\n        The FBI assembled a Headquarters-based                    permitted except fo r mail opening, physical searches\ninvestigative team of special agents, analysts, and               req uiring a search warrant, electronic surveillance\nsupervisory special agents (referred to throughout this           req uiring a judicial order or warrant (Title III wiretap or\nreport as \"the Crossfire Hurricane team \" ) who                   a FISA order), or requests under Title VII of FISA. An\nconducted an initial analysis of links between Trump              investigation opened as a Preliminary Investigation may\ncampaign members and Russia. Based upon this                      be converted subsequently to a Full Investigation if\n\n\n                                                             ii\n"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
## [6] "                          Executive Summary\n                         Review of Four FISA Applications and Other Aspects of the FBI's Crossfire\n                         Hurricane Investigation\n\n\n\n\ninformation becomes available t hat meets the                            Add it ionally, given t he low threshold for\npredication standard. As we describe in the report, all          predication in the AG Gu idelines and the DIOG, we\nof the investigative actions taken by the Crossfire              concluded that the FFG informa t ion, provided by a\nHurricane team, from the date the case was opened on             government the United Stat es Intelligence Community\nJuly 31 until October 21 (the date of the first FISA             (USIC) deems trustworthy, and describing a first- hand\norder) would have been permitted whether the case                account from an FFG employee of a conversation with\nwas opened as a Preliminary or Full Investigation.               Papadopoulos, was sufficient to predicate the\n                                                                 investigation. This information provided the FBI with an\n         The AG Guidelines and the DIOG do not provide           articulable factua l basis that, if t rue, reasonably\nheightened predication standards for sensitive matters,          indicated activity const ituting either a federa l crime or a\nor allegations potentially impacting constitutionally            threat to national security, or both, may have occurred\nprotected activity, such as First Amendment rights.              or may be occurring. For similar reasons, as we detail\nRather, the approval and notification requi rements              in Chapter Three, we concluded that the quantum of\ncontained in the AG Guidelines and the DIOG are, in              information articu lated by t he FBI to open the individual\npart, intended to provide the means by which such                investigations on Papadopou los, Page, Flynn, and\nconcerns can be considered by senio r officials.                 Manafort in August 2016 was sufficient to satisfy t he\nHowever, we were concerned to find that neither the AG           low threshold established by the Department and the\nGuidelines nor the DIOG contain a provision requiring            FBI.\nDepartment consultation before opening an\ninvestigation such as the one here involving the alleged                  As part of ou r review, we also sought to\nconduct of individuals associated with a major party             determine whether there was evidence that political\npresidential campaign.                                           bias or other improper considerations affected decision\n                                                                 making in Crossfire Hurricane, including t he decision to\n         Crossfire Hurricane was opened as a Full                open the investigation. We discussed the issue of\nInvestigation and all of the senior FBI officials who            political bias in a prior OIG report, Review of Various\nparticipated in discussions about whether to open a              Actions in Advance of the 2016 Election, where we\ncase told us the information warranted opening it. For           described text and instant messages between t hen\nexample, then Counterintelligence Division (CD)                  Special Counsel to the Deputy Director Lisa Page and\nAssistant Director (AD) E.W. \"Bill\" Priestap, who                then Section Chief Peter Strzok, among others, that\napproved the case opening, told us that t he                     included statements of hosti lity toward then candidate\ncombination of the FFG information and the FBI 's                Trump and statements of support for then candidate\nongoing cyber intrusion investigation of the July 2016           Hillary Clinton. In this review, we found t hat, while Lisa\nhacks of the Democratic National Committee's (DNC)               Page attended some of the discussions regard ing the\nemails, created a counterintelligence concern that the           opening of the investigat ions, she did not play a role in\nFBI was \"obligated\" to investigate. Priestap stated that         the decision to open Crossfire Hurricane or the four\nhe considered whether the FBI should conduct                     individual cases. We further found t hat w hile Strzok\ndefensive briefings for the Trump campaig n but                  was directly involved in the decisions to open Crossfire\nultimately decided that providing such briefings created         Hurricane and t he four individual cases, he was not the\nthe risk that \"if someone on the campaign was engaged            sole, or even the highest-level, decision maker as to\nwith the Russians, he/she would very likely change               any of those matters. As noted above, then CD AD\nhis/her tactics and/or otherwise seek to cover-up                Priestap, Strzok's supervisor, was the officia l who\nhis/ her activities, thereby preventing us from finding          ultimately made the decision to open the investigation,\nt he truth.\" We did not identify any Department or FBI           and evidence reflected t hat t his decision by Priestap\npolicy that applied to this decision and therefore               was reached by consensus after multiple days of\ndetermined that the decision was a judgment call that            discussions and meetings t hat included Strzok and\nDepartment and FBI policy leaves to the discretion of            other leadership in CD, the FBI Deputy Director, the FBI\nFBI officials. We also concluded that, under the AG              General Counsel, and a FBI Deputy General Cou nsel.\nGuidelines and the DIOG, the FBI had an authorized               We concluded that Priestap's exercise of discretion in\npurpose when it opened Crossfire Hurricane to obtain             opening the investigation was in compliance with\ninformation about, or protect against, a national                Department and FBI policies, and we did not find\nsecurity threat or federa l crime, even though the               documentary or testimonial evidence that political bias\ninvestigation also had the potential to impact                   or improper motivation influenced his decision. We\nconstitutionally protected activity.                             similarly found that, while the forma l documentation\n                                                                 opening each of th e four individua l investigations was\n                                                                 approved by Strzok (as required by the DIOG), the\n\n\n                                                           iii\n"

Creating a Corpus

This code is taking the text extracted from a PDF file and structuring it into a corpus, which is a necessary step in many text analysis workflows. The corpus then serves as the foundation for further text processing and analysis operations.

# Create a corpus from the pdf text
corp <- corpus(pdf)

Creating Tokens from the Corpus

This code snippet is performing the tokenization of the text data contained within a corpus, pdf file above, and then displaying the resulting tokens. This is a foundational step in many natural language processing (NLP) and text analysis tasks, as it transforms raw text into a structured form that algorithms and analytical methods can process more effectively.

# Create a tokens object from the corpus
tokens <- tokens(corp)
## Tokens consisting of 478 documents.
## text1 :
##  [1] "REDACTED"   "FOR"        "PUBLIC"     "RELEASE"    "Office"    
##  [6] "of"         "the"        "Inspector"  "General"    "U.S"       
## [11] "."          "Department"
## [ ... and 33 more ]
## text2 :
##  [1] "NOTICE"     "This"       "report"     "was"        "originally"
##  [6] "issued"     "on"         "December"   "9"          ","         
## [11] "2019"       "."         
## [ ... and 608 more ]
## text3 :
## [1] "["             "PAGE"          "INTENTIONALLY" "LEFT"         
## [5] "BLANK"         "]"            
## text4 :
##  [1] "Executive"    "S"            "u"            "m"            "ma"          
##  [6] "ry"           "Revie"        "w"            "of"           "Four"        
## [11] "FISA"         "Applications"
## [ ... and 2,407 more ]
## text5 :
##  [1] "Executive"    "Summary"      "Review"       "of"           "Four"        
##  [6] "FISA"         "Applications" "and"          "Other"        "Aspects"     
## [11] "of"           "the"         
## [ ... and 869 more ]
## text6 :
##  [1] "Executive"    "Summary"      "Review"       "of"           "Four"        
##  [6] "FISA"         "Applications" "and"          "Other"        "Aspects"     
## [11] "of"           "the"         
## [ ... and 976 more ]
## [ reached max_ndoc ... 472 more documents ]


This code snippet takes the previously created tokens, applies stemming to consolidate words to their root forms, removes common stopwords to reduce noise in the data, and then displays the resulting processed tokens. These steps are essential in text preprocessing, setting the stage for more focused and efficient analysis in natural language processing tasks.

# Apply stemming and stopword removal
tokens <- tokens_wordstem(tokens)
tokens <- tokens_remove(tokens, stopwords("en"), padding = FALSE)
## Tokens consisting of 478 documents.
## text1 :
##  [1] "REDACTED"  "PUBLIC"    "RELEASE"   "Office"    "Inspector" "Gener"    
##  [7] "U.S"       "."         "Depart"    "Justic"    "OVERSIGHT" "INTEGRITi"
## [ ... and 22 more ]
## text2 :
##  [1] "NOTICE" "report" "origin" "issu"   "Decemb" "9"      ","      "2019"  
##  [9] "."      "report" "updat"  "Decemb"
## [ ... and 396 more ]
## text3 :
## [1] "["             "PAGE"          "INTENTIONALLi" "LEFT"         
## [5] "BLANK"         "]"            
## text4 :
##  [1] "Execut"   "S"        "u"        "m"        "ma"       "ry"      
##  [7] "Revi"     "w"        "Four"     "FISA"     "Applicat" "Aspect"  
## [ ... and 1,973 more ]
## text5 :
##  [1] "Execut"   "Summari"  "Review"   "Four"     "FISA"     "Applicat"
##  [7] "Aspect"   "FBI"      "Crossfir" "Hurrican" "Investig" "inaccur" 
## [ ... and 579 more ]
## text6 :
##  [1] "Execut"   "Summari"  "Review"   "Four"     "FISA"     "Applicat"
##  [7] "Aspect"   "FBI"      "Crossfir" "Hurrican" "Investig" "inform"  
## [ ... and 616 more ]
## [ reached max_ndoc ... 472 more documents ]

Removing Punctuation and Numbers

This code snippet processes the tokenized text data by retaining only the tokens that are words (i.e., sequences of letters) and removing everything else like punctuation marks and numeric characters. This is a common step in text processing, as it helps in focusing the analysis on the textual content, removing non-textual elements that might not be relevant for certain types of analysis like sentiment analysis, topic modeling, etc.

# Remove punctuation and numbers
tokens <- tokens_select(tokens, pattern = "\\p{L}+", valuetype = "regex", selection = "keep")
## Tokens consisting of 478 documents.
## text1 :
##  [1] "REDACTED"  "PUBLIC"    "RELEASE"   "Office"    "Inspector" "Gener"    
##  [7] "U.S"       "Depart"    "Justic"    "OVERSIGHT" "INTEGRITi" "GUIDANCE" 
## [ ... and 17 more ]
## text2 :
##  [1] "NOTICE" "report" "origin" "issu"   "Decemb" "report" "updat"  "Decemb"
##  [9] "Decemb" "follow" "chang"  "page"  
## [ ... and 270 more ]
## text3 :
## [1] "PAGE"          "INTENTIONALLi" "LEFT"          "BLANK"        
## text4 :
##  [1] "Execut"   "S"        "u"        "m"        "ma"       "ry"      
##  [7] "Revi"     "w"        "Four"     "FISA"     "Applicat" "Aspect"  
## [ ... and 1,844 more ]
## text5 :
##  [1] "Execut"   "Summari"  "Review"   "Four"     "FISA"     "Applicat"
##  [7] "Aspect"   "FBI"      "Crossfir" "Hurrican" "Investig" "inaccur" 
## [ ... and 468 more ]
## text6 :
##  [1] "Execut"   "Summari"  "Review"   "Four"     "FISA"     "Applicat"
##  [7] "Aspect"   "FBI"      "Crossfir" "Hurrican" "Investig" "inform"  
## [ ... and 513 more ]
## [ reached max_ndoc ... 472 more documents ]

Document-Feature Matrix

The provided R code snippet creates a Document-Feature Matrix (DFM) from the pre-processed tokens and then displays this matrix.

# Create a document-feature matrix
dtm <- dfm(tokens)
## Document-feature matrix of: 478 documents, 4,975 features (97.04% sparse) and 0 docvars.
##        features
## docs    redacted public release office inspector gener u.s depart justic
##   text1        2      2       2      1         1     1   1      1      1
##   text2        0      1       0      0         0     0   0      2      1
##   text3        0      0       0      0         0     0   0      0      0
##   text4        0      0       0      0         0     0   0      0      0
##   text5        0      0       0      0         0     1   2      2      0
##   text6        0      0       0      0         0     2   0      5      0
##        features
## docs    oversight
##   text1         2
##   text2         0
##   text3         0
##   text4         0
##   text5         0
##   text6         0
## [ reached max_ndoc ... 472 more documents, reached max_nfeat ... 4,965 more features ]


This R code snippet modifies an existing Document-Feature Matrix (DFM) by trimming it based on term frequency, then displays the updated matrix. It uses the dfm_trim() function to retain only those terms in the DFM that occur at least 100 times, thereby filtering out less frequent terms. The result is stored back in the dtm variable, and the last line outputs the trimmed DFM.

# Trim the document-feature matrix based on term frequency
dtm <- dfm_trim(dtm, min_termfreq = 100)
## Document-feature matrix of: 478 documents, 284 features (72.40% sparse) and 0 docvars.
##        features
## docs    public office gener u.s depart review four fisa applicat fbi
##   text1      2      1     1   1      1      2    1    1        1   1
##   text2      1      0     0   0      2      0    0    3        0   4
##   text3      0      0     0   0      0      0    0    0        0   0
##   text4      0      0     0   0      0      0    1    2        1  16
##   text5      0      0     1   2      2      1    4    3        1  11
##   text6      0      0     2   0      5      4    4    2        1  16
## [ reached max_ndoc ... 472 more documents, reached max_nfeat ... 274 more features ]

Word Frequency

This R code snippet converts a Document-Feature Matrix (DFM) into a data frame that contains word frequency statistics, suitable for use with the Wordcloud2 package, and then displays this data frame. It uses the textstat_frequency() function to calculate the frequencies of terms in the DFM, stores the result in the word_freq variable, and then outputs the contents of word_freq.

# Convert the document-feature matrix to a data frame suitable for Wordcloud2
word_freq <- textstat_frequency(dtm)

Feature Extraction

This R code snippet creates a data frame from the word frequency statistics, suitable for generating a word cloud using the Wordcloud2 package, and then displays this data frame. It extracts the ‘feature’ (words) and ‘frequency’ columns from the word_freq object, creates a new data frame word_freq_df with these columns, and outputs the contents of word_freq_df.

# Convert the document-feature matrix to a data frame suitable for Wordcloud2
word_freq_df <- data.frame(word = word_freq$feature, freq = word_freq$frequency)

Word Cloud Visualization

This R code snippet is intended to create a word cloud visualization using the Wordcloud2 package. However, it seems to be incomplete. The full function call for creating a word cloud should be something like wordcloud2(word_freq_df), where word_freq_df is a data frame containing word frequencies. The snippet as provided, wordc, is not a complete command.

# Create a word cloud using Wordcloud2
wordcloud2(word_freq_df, size = 0.5, color = 'random-dark')

Frequency Statistics

This R code snippet calculates the frequency statistics of terms in a Document-Feature Matrix (DFM) and then displays these statistics. It uses the textstat_frequency() function to compute the frequencies, stores the results in the frequency_stats variable, and outputs the contents of frequency_stats.

# Get frequency statistics
frequency_stats <- textstat_frequency(dtm)

Business Impact

Enhanced Understanding: The visualization and frequency analysis will make the complex and lengthy OIG report more accessible and comprehensible to a broader audience, including department officials, policymakers, journalists, and the general public.

Efficient Information Dissemination: The summarized visual format enables quick dissemination and easier digestion of the key findings and themes of the report, which is crucial in a fast-paced information environment.

Strategic Decision-Making: By clearly identifying the most frequent topics and concerns in the investigation, policymakers and department leaders can make more informed decisions regarding policy changes, future investigations, and resource allocation.

Transparency and Accountability: This approach promotes transparency by openly presenting the main elements of the investigation and supports accountability by highlighting the key areas of focus and concern.

Research and Educational Tool: The analysis can serve as a valuable resource for researchers and educators in fields like political science, law, and public administration, providing an analytical foundation for further study and discussion.