Tell Me A Story - FILE UPDATES IN PROGRESS

17 minute read

Capstone Project: Tell Me a Story…#

Elyse Renouf July 31, 2020

Tell Me A Story (TMAS) is a content-based young children’s book recommender using NLP that is derived from a Kaggle dataset of Goodreads.com book reviews. It was created so parents could input one of their child’s favourite books and TMAS can recommend other similar books to read based entirely on key words in the book descriptions.

Please Note: This is the first excerpt of notebooks 1 through 3, each of which were used to build the final product. This portion includes all the steps I took when downloading and paring down the initial suite of eleven Goodreads.com data files from Kaggle; I focused solely on children’s books and resulting in a manageable starting point that is ready for cleaning and next steps.


Loading & Merging Datasets

Because the original dataset did not classify books by genre or age, I attempted to identify books that are specifically for children by first researching whether the unique ISBN numbers that are allocated to each book would have a genre identifier but they do not.

So instead, I decided to use page count as a rough identifier. I began by finding the average standard number of pages in children’s books (according to authors and publishers), then pulled each of them out of the original Kaggle CSV files and merged them into a single dataset, specific to kids.

Age Ranges & Book Classifications that I used for the purpose of this analysis:

  • Board Books (ages 0-2): 10 or less pages
  • Picture Books (2-3): Pages in multiples of 8 ranging from 16-32 pages in length
  • Early Readers (4-6): 32-48 pages (32+ for the purposes of this project)</ul> And so, to simplify, I decided to split my data into only two categories using these age & book classifications as a guide (this step is done in notebook 2 of 3). In initially viewing the datasets and typical books/page counts, I started by only pulling in books over 50 pages but, after exporting those files to Excel just to scroll through the titles and do a contents check, I realized that the majority of that dataset was adult books using that page count. I tried again, using 40 pages as a threshold with similar results and upon closer review, I found that, in this dataset, anything over 38 pages is extremely difficult to capture and classify as a children's book by number of pages alone. I then had to isolate publisher names as a next step, however to reduce the number of publishers required for review, I limited my inital recommender dataset to target children under 6 years of age (according to publisher page count recommendations) and at a max of 38 pages. I understand I will be losing children's books that will be longer than 38 pages and will also capture some non-children's books in this initial sweep but this was the simplest way to start, in my opinion. This notebook includes the steps I took in pulling the eleven .csv files from Kaggle, each with over 100,000 entries and then pulling out only (mostly) children's books using the above guidelines and assumptions. I then merged those into a single file and exported my finished dataset to a .csv file to be read into my EDA & Cleaning notebook for the next steps in this project. ### Reference Pages ###
    • Datasets gathered from:https://www.kaggle.com/bahramjannesarr/goodreads-book-datasets-10m
    • Children's Book Page Range averages and ages estimated based on info from:https://www.jennybowman.com/what-genre-is-my-childrens-book/</ul> "`python #import needed packages import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns "` #### Reading in all of the Datsets #### "`Python #Reading in the first 100k books dataset that comes with book descriptions books = pd.read_csv('data/book600k-700k.csv') # Print out the shape of the df print(f'There are {books.shape[0]} rows and {books.shape[1]} columns in the first books dataframe.') "` There are 55156 rows and 19 columns in the first books dataframe. "`Python #getting information about each column in the df dataframe books.head(3) "`
      Id Name Authors ISBN Rating PublishYear PublishMonth PublishDay Publisher RatingDist5 RatingDist4 RatingDist3 RatingDist2 RatingDist1 RatingDistTotal CountsOfReview Language pagesNumber Description
      0 600000 Lessons Learned (Great Chefs, #2) Nora Roberts 037351025X 3.74 1993 15 2 Silhouette 5:947 4:1016 3:1061 2:287 1:63 total:3374 86 eng 250 LESSONS LEARNED...<br /><br />Coordinating the...
      1 600001 Walking by Faith: Lessons Learned in the Dark Jennifer Rothschild 0633099325 4.27 2003 1 1 Lifeway Church Resources 5:367 4:246 3:109 2:22 1:5 total:749 7 NaN 112 At the age of fifteen, Jennifer Rothschild con...
      2 600003 Better Health in Africa: Experience and Lesson... World Bank Group 0821328174 5.00 1994 1 1 World Bank Publications 5:1 4:0 3:0 2:0 1:0 total:1 1 NaN 240 NaN
      "`Python # Isolating only books under 38 pages and saving into kidsbooks dataframe kidsbooks = books[books['pagesNumber']<=38] print(f'There are currently {kidsbooks.shape[0]} rows and {kidsbooks.shape[1]} columns in the kids books dataframe.') "` There are currently 3399 rows and 19 columns in the kids books dataframe. "`Python # Reading in the next dataframe and doing the same books2 = pd.read_csv('data/book700k-800k.csv') # Pulling out only books under 38 pages to append to kidsbooks dataframe kb2 = books2[books2['pagesNumber']<=38] kb2.shape "` (3610, 20) "`Python # Reading in the next dataframe and doing the same books3 = pd.read_csv('data/book800k-900k.csv') # Pulling out only books under 38 pages to append to kidsbooks dataframe kb3 = books3[books3['pagesNumber']<=38] kb3.shape "` (3138, 20) "`Python # Reading in the next dataframe and doing the same books4 = pd.read_csv('data/book900k-1000k.csv') # Pulling out only books under 38 pages to append to kidsbooks dataframe kb4 = books4[books4['pagesNumber']<=38] kb4.shape "` (2496, 20) "`Python # Reading in the next dataframe and doing the same books5 = pd.read_csv('data/book1000k-1100k.csv') # Pulling out only books under 38 pages to append to kidsbooks dataframe kb5 = books5[books5['pagesNumber']<=38] kb5.shape "` (2589, 20) "`Python # Reading in the next dataframe and doing the same books6 = pd.read_csv('data/book1100k-1200k.csv') # Pulling out only books under 38 pages to append to kidsbooks dataframe kb6 = books6[books6['pagesNumber']<=38] kb6.shape "` (2945, 20) "`Python # Reading in the next dataframe and doing the same books7 = pd.read_csv('data/book1200k-1300k.csv') # Pulling out only books under 38 pages to append to kidsbooks dataframe kb7 = books7[books7['pagesNumber']<=38] kb7.shape "` (3252, 20) "`Python # Reading in the next dataframe and doing the same books8 = pd.read_csv('data/book1300k-1400k.csv') # Pulling out only books under 38 pages to append to kidsbooks dataframe kb8 = books8[books8['pagesNumber']<=38] kb8.shape "` (2779, 20) "`Python # Reading in the next dataframe and doing the same books9 = pd.read_csv('data/book1400k-1500k.csv') # Pulling out only books under 38 pages to append to kidsbooks dataframe kb9 = books9[books9['pagesNumber']<=38] kb9.shape "` (2443, 20) "`Python # Reading in the next dataframe and doing the same books10 = pd.read_csv('data/book1500k-1600k.csv') # Pulling out only books under 38 pages to append to kidsbooks dataframe kb10 = books10[books10['pagesNumber']<=38] kb10.shape "` (2296, 20) "`Python # Reading in the next dataframe and doing the same books11 = pd.read_csv('data/book1600k-1700k.csv') # Pulling out only books under 38 pages to append to kidsbooks dataframe kb11 = books11[books11['pagesNumber']<=38] kb11.shape "` (2451, 20) "`Python # Reading in the next dataframe and doing the same books12 = pd.read_csv('data/book1700k-1800k.csv') # Pulling out only books under 38 pages to append to kidsbooks dataframe kb12 = books12[books12['pagesNumber']<=38] kb12.shape "` (2295, 19) #### Making a single dataset to work from: #### "`Python #appending all dataframes into the kids df kidsbooks = kidsbooks.append([kb2, kb3, kb4, kb5, kb6, kb7, kb8, kb9, kb10, kb11, kb12]) kidsbooks.shape "` (33693, 20) #### An additional page number discovery #### After looking over the contents of the dataframe of books under 38 pages, I found that there were also a number of audiobooks, teachers guides and study guides that were present. These were items that were often between 1-6 pages. To eliminate the bulk of these without accidentally eliminating board books/sensory books targeted at infants, I opted to remove any books under 4 pages and then could filter out any others by removing any publisher names with the term "audio" in the name. "`Python #then remove books with page counts less than 4 (cheat/sheets, coles notes, guides, etc.) kidsbooks = kidsbooks[kidsbooks['pagesNumber']>=4] kidsbooks.shape "` (28697, 20) "`Python #checking the datatypes kidsbooks.dtypes "` Id int64 Name object Authors object ISBN object Rating float64 PublishYear int64 PublishMonth int64 PublishDay int64 Publisher object RatingDist5 object RatingDist4 object RatingDist3 object RatingDist2 object RatingDist1 object RatingDistTotal object CountsOfReview int64 Language object pagesNumber int64 Description object Count of text reviews float64 dtype: object Looking at the available columns, and knowing that in order to make a book recommender work well, I will need to convert all my columns to numerical ones, I have opted to drop any column deemed unnecessary for my analysis and ultimately for my model. The columns I will drop include:
      • ID: these are the original ID number from the various originally scraped datasets and not relevant
      • index: this was a previous index set prior to merging the originally scraped datasets
      • RatingDist1-5 & RatingDistTotal: these columns have a ratio of ratings distribution from the Kaggle analysis that are not relevant to the purpose of my project
      • Language: this column was almost entirely blank (90%) so deemed irrelevant so some Spanish, French, and other language books may be included in this dataframe. This dataset also includes many bilingual books, as parents often want to teach their kids multiple languages so I deemed any kids books, regardless of language, relevant but may need to screen these out at a later stage if it effects the performance of my recommender
      • Count of text reviews: this column was included in the original dataset, though the actual text reviews were not. Because I am doing a content-based recommender, based on description, I decided to remove the text reviews counts to remove any confusion. </ul>
        **Note:** Count of text reviews is different from CountsofReview which is a column to show how many numerical reviews each book has been given. I used this column to remove any books that had 0 reviews, with plans in future versions to potentially build out the recommender using user reviews. Other steps I took in preparing this data for cleaning:
        • Removing any rows where CountsOfReview = 0
        • Replacing any columns where book description = blank with NaN (null value) and then removing all nulls in this column. The text recommender works entirely on this column so blanks are not ideal. If creating this recommender/ the data collection process from scratch, this would have been a mandatory field to have completed.
        • Manually identifying children's publishers and creating a df that only included books with those publisher names (all variations) in the publisher column
        • Having this list of publisher names and all of the variations in those names found throughout the dataset allowed me to identify and create and apply a standard name for each publisher
        • Exporing the this qualified dataset to .csv format in order to start with just a single csv file in my second notebook, dedicated to cleaning & exploring the data "`Python #dropping Columns ID, RatingDist5-1, and Count of text reviews (not relevant to this project) # Delete multiple columns from the dataframe kidsbooks = kidsbooks.drop(["Id", "RatingDist5", "RatingDist4", "RatingDist3", "RatingDist2", "RatingDist1", "RatingDistTotal", "Language", "Count of text reviews"], axis=1) kidsbooks.head(3) "`
          Name Authors ISBN Rating PublishYear PublishMonth PublishDay Publisher CountsOfReview pagesNumber Description
          65 Cornelius T. Mouse and Sons Christopher A. Lane 0896938441 4.0 1990 1 7 Victor 0 32 In this retelling of the New Testament parable...
          98 David Blaine Rochelle Scholar 0340900601 3.0 2006 1 6 Hodder Murray 0 27 Stimulating and accessible, the titles in the ...
          124 Russell's Secret Johanna Hurwitz 0688175740 2.7 2001 6 11 HarperCollins Publishers 8 32 Have you ever heard the words<br />""You can s...
          "`Python #removing books that have 0 reviews reviews kidsbooks = kidsbooks[kidsbooks['CountsOfReview']>=1] kidsbooks.shape "` (15654, 11) "`Python #removing books where the description column is blank kidsbooks['Description'].replace('', np.nan, inplace=True) #dropping rows that have null values in Decription column kidsbooks.dropna(subset = ["Description"], inplace=True) kidsbooks.shape "` (14412, 11) Originally, I started removing publishers that clearly didn't have kids books in the current data. I began removing any publishers with "audio" in the name to remove the many many audiobooks, religions, university, adult content, and museum publications but, a quick export to Excel to really take a closer look at my data (using kidsbooks.to_csv('kidsbooks.csv', header=True), made me realize that it would be easier to make a list of publishers to keep and then cleaning/narrowing down that list to make sure my recommender is as robust as possible. "`Python #set up the filter of the publishers to keep filter = kidsbooks['Publisher'].isin([ "A & C Black (Childrens books)", "Abbeville Kids", "Abbeville Press", "Abbey Press", "ABRAMS", "Abrams Books for Young Readers", "Accord Publishing, a division of Andrews McMeel", "Accord, a division of Andrews McMeel Publishing", "Aladdin", "Aladdin Books", "Aladdin Paperbacks", "Alaska Northwest Books", "Albert Whitman Company", "Albert Whitman & Company", "Alfaguara", "Alfaguara Infantil", "Alfred A. Knopf Books for Young Readers", "Alison Green Books", "All about Kids Publishing", "Alyson Books", "American Girl Publishing Inc", "Andersen", "Andersen Press", "ANDERSEN PRESS LTD", "Andersen Press Ltd", "Andre Deutsch", "Annick Press", "Arbordale Publishing", "Arte Publico Press", "Arthur A. Levine Books", "Atheneum Books", "Atheneum Books for Young Readers", "Atheneum/Richard Jackson Books", "August House Publishers", "Award Publications Limited", "B.E.S.", "B.E.S. Publishing", "Baby Piggy Toes", "Baker Books", "Bank Street", "Bantam Books for Young Readers", "Barefoot Books", "Barney Publishing", "Barron's Educational Series", "Bear Cub Books", "Big Tent Entertainment", "BJU Press", "Blackbirch Press", "Bloomsbury", "Bloomsbury Children's Books", "Bloomsbury USA Childrens", "Bloomsbury Publishing", "Bloomsbury U.S.A Children's Books" "Blue Apple", "Blue Apple Books", "Blue Sky Press", "Bodley Head", "Boxer Books", "Boyds Mills Press", "Bradbury Press", "Bright and Early Books", "Brighter Child", "Brighter Child Interactive", "Brimax Books", "Buddy Books", "Candlewick", "Candlewick Press", "Candlewick Press (MA)", "Candy Cane Press", "Carolrhoda Books", "Carolrhoda Books (R)", "Cartwheel", "Cartwheel Books", "Cavendish Square Publishing", "Cedco Publishing Company", "Chariot Victor Publishing", "Charlesbridge", "Charlesbridge Publishing", "Checkerboard Books", "Checkerboard Press", "Chicken House", "Child's Play International", "Child's World", "Children's Book Press", "Children's Book Press (CA)", "Children's Press", "Children's Press (CT)", "Children's Press (Dublin)", "Children's Press(CT)", "Chouette", "Chouette Editions", "Chouette Publishing", "Chronicle Books", "Chronicle Books (CA)", "Chrysalis Books", "Cinco Puntos Press", "Clarion Books", "Classics Illustrated Junior", "Clarkson N. Potter, Inc.", "Clarkson Potter", "Collins", "Compass Point Books", "Cooper Square Pub", "Corgi Childrens", "Cricket Books", "Crown Books for Young Readers", "Crown Publishers, Inc.", "Crown Publishing Group (NY)", "Da Capo Press", "Dalmatian Press", "David C Cook", "David Fickling Books", "Dawn Publications", "Dawn Publications (CA)", "Delacorte Books for Young Readers", "Delacorte Press", "Dial", "Dial Books", "Dial Books for Young Readers", "Dial Books Young Readers", "Disney Editions", "Disney Press", "Disney-Hyperion", "DK Children", "DK Preschool", "DK Publishing", "DK Publishing (Dorling Kindersley", "DK Publishing (Dorling Kindersley)", "Dodd Mead", "Dodd, Mead", "Dorling Kindersley", "DoubleDay", "Doubleday", "Doubleday Books", "Doubleday Books for Young Readers", "Doubleday Canada", "Doubleday Childrens", "Down East Books", "Dragonfly Books", "Dutton Books", "Dutton Books for Young Readers", "Dutton Children's Books", "Dutton Juvenile", "E.D.C. Publishing", "Educational Development Corporation", "Egmont", "Egmont Books", "Egmont Books (UK)", "Egmont Books Ltd", "Egmont Books, Limited", "Egmont Childrens Books", "Egmont UK", "Element Books", "Enslow Elementary", "Farrar Straus Giroux", "Farrar, Straus and Giroux", "Farrar, Straus and Giroux (BYR)", "Festival Books", "Firefly Books", "First Avenue Editions", "First Avenue Editions (Tm)", "Fitzhenry & Whiteside", "Flashlight Press", "Fleming H. Revell Company", "Floris Books", "Floris Books - Floris Books", "Four Winds", "Four Winds Press", "Frances Lincoln", "Frances Lincoln Children's Books", "Frances Lincoln Ltd", "Franklin Watts", "Franklin Watts Ltd", "Frederick Warne", "Frederick Warne and Company", "Free Spirit Publishing", "Front Street, Incorporated", "G.P. Putnam's Sons", "G.P. Putnam's Sons", "G.P. Putnam's Sons Books for Young Readers", "Gagne International Press", "Gallaudet University Press", "Gallimard", "Gallimard Jeunesse", "Gareth Stevens Publishers", "Garlic Press", "Gibbs Smith", "Gibbs Smith Publishers", "Gingham Dog Press", "Golden Books", "Golden Books (Random House)", "Golden Books Publishing Company", "Golden Books / Western Publishing Company, Inc.", "Golden Books Publishing Company, Inc.", "Golden Press", "Golden/Disney", "Goldencraft", "Good Books", "Good Night Books", "Gramercy", "Greenwillow Books", "Grosset & Dunlap", "Groundwood Books", "GT Publishing Corporation", "GuidepostsBooks", "Gullane Children's Books", "Gullane Children's Books Ltd", "Handprint Books", "Happy Cat Books (UK)", "Harbour Publishing", "Harcourt", "Harcourt Brace", "Harcourt Brace & Company", "Harcourt Brace and Company", "Harcourt Brace Jovanich", "Harcourt Brace Jovanovich", "Harcourt Children's Books", "Harper & Row", "Harper Collins", "Harpercoll", "HarperCollins", "HarperCollins (UK)", "HarperCollins Children's Books", "HarperCollins Publishers", "HarperCollins UK", "HarperCollinsChildren'sBooks", "HarperCollinsChildren‚ÄôsBooks", "HarperCollinsPublishers", "HarperEntertainment", "HarperFestival", "HarperTrophy", "Harrison House", "Harry N. Abrams", "Henry Holt & Company", "Henry Holt and Co.", "Henry Holt and Co. (BYR)", "Henry Holt and Co. BYR Paperbacks", "Henry Holt and Company", "HMH Books for Young Readers", "Hodder & Stoughton", "Hodder Children's Books", "Holiday House", "Holt McDougal", "Holt, Rinehart and Winston, Inc.", "Hoopoe Books", "Houghon Mifflin", "Houghton Mifflin Company", "Houghton Mifflin Harcourt", "Houghton Mifflin Harcourt P", "Hyperion", "Hyperion Books", "Hyperion Books for Children", "Illumination Arts Publishing Company", "innovative KIDS", "Innovative Kids", "Insight Kids", "It Books", "Joy Street Books", "Jump At The Sun", "Just Us Books", "Juventud", "Kane Press", "Kane/Miller", "Kane/Miller Book Publishers", "Kar-Ben Publishing", "Kar-Ben Publishing (R)", "Kar-Ben Publishing (Tm)", "Katherine Tegen Books", "Kids Can Press", "Kingfisher", "Knopf Books for Young Readers", "KO Kids Books", "Kregel Kidzone", "L,B Kids", "Ladybird Books", "Ladybird Books Ltd", "Lark Books", "LB Kids", "Learning Horizons", "Lee & Low Books", "Lemniscaat USA", "Lincoln Children's Books", "Little Simon", "Little Simon (an imprint of Simon and Schuster, Inc.)", "Little Tiger Press", "Little, Brown Books for Young Readers", "Little, Brown Young Readers", "Lobster Press", "London Town Press", "Lothian Books", "Lothrop, Lee & Shepard", "Lothrop, Lee and Shepard", "Lothrop, Lee and Shepard Books", "MacAdam/Cage Publishing", "Mackinac Island Press", "Macmillan Children's Books", "MacMillan Publishing Company", "MacMillan UK", "Magi Publications", "Magic School Bus", "Magination Press", "Make Believe Ideas", "Mammoth", "Margaret K. McElderry", "Margaret K. McElderry Books", "MarshMedia", "McArthur & Company", "McClanahan Book Company", "McGraw-Hill Companies", "Meadowbrook", "Meadowbrook Press", "Megan Tingley Books", "Melanie Kroupa Books", "Mercury Books", "Michael Di Capua Books", "Milet Publishing", "Milk & Cookies", "Millbrook Press (Tm)", "Minedition", "Mitten Press", "Modern Curriculum Press", "Modern Publishing", "Mondo Publishing", "Monterey Bay Aquarium Press", "Moo Press", "Moon Mountain Publishing", "Morrow Junior Books", "Mulberry Books", "Nantier Beall Minoustchine Publishing", "National Center for Youth Issues", "National Geographic Children's", "National Geographic Children's Books", "National Geographic Society", "Nelson Publishing & Marketing", "New York Review Children's Collection", "Night Sky", "North-South", "North-South Books", "Northland Publishing", "NorthSouth", "NorthSouth (NY)", "NorthSouth Books", "Northword Press", "Orca Book Publishers", "Orchard", "Orchard (NY)", "Orchard Books", "Orchard Books (NY)", "Orchard,", "Owlkids Books", "Oxford", "Oxford University Press", "Oxford University Press, USA", "Pan Childrens", "Pan Macmillan", "Pantheon Books", "Parenting Press", "Parents Magazine Press", "Peachtree Publishers", "Peachtree Publishing Company", "Penguin Young Readers", "Penton Kids", "Philomel", "Philomel Books", "Picture Book Studio Ltd", "Picture Corgi", "Picture Puffin Books", "Picture Puffins", "Picture Window Books", "Piggy Toes Press", "Pinata Books", "Price Stern Sloan", "Priddy Books", "Priddy Books Us", "Priddy Books US", "Puffin", "Puffin Bks", "Puffin Books", "Pumpkin House", "Pumpkin House, Ltd.", "Purple Bear Books", "Putnam Juvenile", "Putnam Publishing Group", "R & S Books", "Rainbow Morning Music", "Raincoast Books", "Raintree", "Ramsey Solutions Inc", "Rand McNally", "Random House", "Random House (NY)", "Random House Books for Young Readers", "Random House Children's Books", "Random House Disney", "Random House UK", "Random House USA Inc", "Random House Value Publishing", "Raven Tree Press", "Reader's Digest", "Reader's Digest Association", "Reader's Digest Children's Books", "Reader's Digest Young Families", "Red Fox", "Red Fox Books", "RED FOX BOOKS (RAND)", "Red Fox Mini Treasure", "Red Fox Picture Books", "Red Wagon Books", "RH/Disney", "Rising Moon Books", "Roaring Brook Press", "Roberts Rinehart Publishers", "Robin Corey Books", "Running Press Kids", "Salina Bookshelf", "Salina Bookshelf, Inc.", "Sasquatch Books", "Scholastic", "Scholastic Book Services", "Scholastic Canada", "Scholastic en Espanol", "Scholastic en español", "Scholastic Inc", "Scholastic Inc.", "Scholastic Paperbacks", "Scholastic Press", "Scholastic, Inc.", "Schwartz & Wade", "Schwartz & Wade Books", "Scribble Sons", "Silver Dolphin", "Silver Dolphin Books", "Simon Schuster Books for Young Readers", "Simon Schuster/Paula Wiseman Books", "Simon & Schuster", "Simon & Schuster Books for Young Readers", "Simon & Schuster Children's", "Simon & Schuster Children's Books", "Simon & Schuster Children's Publishing", "Simon & Schuster Ltd", "Simon and Schuster", "Simon Spotlight", "Simon Spotlight / Nickelodeon", "Simon Spotlight Entertainment", "Simply Read Books", "Sleeping Bear Press", "Smart Kidz Publishing", "Smithmark Publishers", "Square Fish", "St Martins Press", "St. Martin's Griffin", "St. Martin's Press", "Sterling", "Sterling Publishing (NY)", "Sterling/Pinwheel", "Stone Arch Books", "Tamarind", "The Chicken House", "The Gryphon Press", "Tiger Tales.", "Tilbury House Publishers", "Toys 'n Things Press", "Tricycle Press", "Troll Associates", "Troll Communications", "Trophy Picture Bks", "Tuckamore Books", "Tundra Books", "Turtleback Books", "Tuttle Publishing", "Two Kids Productions", "Two Lions", "Two Little Hands Productions", "Two Lives Publishing", "Upstart Books", "Usborne", "Usborne / E.D.C. Publishing", "Usborne Books", "Usborne Publishing Ltd", "Viking Books", "Viking Books for Young Readers", "Viking Children's Books", "Viking Juvenile", "Viking Picture Books", "Viking/Puffin,", "Volo", "Walker", "Walker & Company", "Walker Books", "Walker Books and Subsidiaries", "Walker Books Ltd", "Walker Books LTD", "Walker Childrens", "Walker Childrens Paperbacks", "Walker,", "Walrus Books", "Warne", "Western Publishing Company", "Western Publishing Company, Inc.", "Western Publishing Company, Inc./Golden Books", "WordSong", "WorthyKids", "Zero To Ten", " " #where publisher is blank - (I will later convert to unknown) ]) #apply the filter kbs = kidsbooks[filter] kbs.shape "` (10612, 11) "`Python kbs.to_csv('kidsbooks.csv', header=True) "` **Note:** Project continued in notebook number 2 of 3: Cleaning and EDA